Approximate Distinct Counts for Billions of Datasets

SIGMOD 2019 (Amsterdam, The Netherlands, June 30 - July 5, 2019)

Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser aggregations. These arise naturally in many applications including networking, databases, and real-time business reporting. We demonstrate existing approaches to solve this problem are inherently flawed, exhibiting bias that can be arbitrarily large, and propose new methods for solving this problem that have theoretical guarantees of correctness and tight, practical error estimates.

This is achieved by carefully combining CountMin and HyperLogLog sketches and a theoretical analysis using sta- tistical estimation techniques. These methods also advance cardinality estimation for individual multisets, as they pro- vide a provably consistent estimator and tight confidence intervals that have exactly the correct asymptotic coverage.



Daniel Ting