Tighter estimation using bottom k sketches

Edith Cohen*, Haim Kaplan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [22], and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams and support coordinated and all-distances sketches. We derive novel unbiased estimators and confidence bounds for subpopulation weight. Our rank conditioning (RC) estimator is applicable when the total weight of the sketched set cannot be computed by the summarization algorithm without a significant use of additional resources (such as for sketches of network neighborhoods) and the tighter subset conditioning (SC) estimator that is applicable when the total weight is available (sketches of data streams). Our estimators are derived using clever applications of the Horvitz-Thompson estimator (that is not directly applicable to bottom-k sketches). We develop efficient computational methods and conduct performance evaluation using a range of synthetic and real data sets. We demonstrate considerable benefits of the SC estimator on larger subpopulations (over all other estimators); of the RC estimator (over existing estimators for weighted sampling without replacement); and of our confidence bounds (over all previous approaches).

Original languageEnglish
Pages (from-to)213-229
Number of pages17
JournalProceedings of the VLDB Endowment
Volume1
Issue number1
DOIs
StatePublished - 2008

Fingerprint

Dive into the research topics of 'Tighter estimation using bottom k sketches'. Together they form a unique fingerprint.

Cite this