Summarizing data using bottom-k sketches

Edith Cohen, Haim Kaplan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A Bottom-k sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value. Bottom-k sketches are an alternative to k-mins sketches [9], which consist of the k minimum ranked items in k independent rank assignments, and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates. An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benet, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and alldistances bottom-k sketches. Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efciently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is a distribution over k-mins sketches).

Original languageEnglish
Title of host publicationPODC'07
Subtitle of host publicationProceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing
Pages225-234
Number of pages10
DOIs
StatePublished - 2007
EventPODC'07: 26th Annual ACM Symposium on Principles of Distributed Computing - Portland, OR, United States
Duration: 12 Aug 200715 Aug 2007

Publication series

NameProceedings of the Annual ACM Symposium on Principles of Distributed Computing

Conference

ConferencePODC'07: 26th Annual ACM Symposium on Principles of Distributed Computing
Country/TerritoryUnited States
CityPortland, OR
Period12/08/0715/08/07

Keywords

  • All-distances sketches
  • Bottom-k sketches
  • Data streams

Fingerprint

Dive into the research topics of 'Summarizing data using bottom-k sketches'. Together they form a unique fingerprint.

Cite this