TY - GEN
T1 - Summarizing data using bottom-k sketches
AU - Cohen, Edith
AU - Kaplan, Haim
PY - 2007
Y1 - 2007
N2 - A Bottom-k sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value. Bottom-k sketches are an alternative to k-mins sketches [9], which consist of the k minimum ranked items in k independent rank assignments, and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates. An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benet, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and alldistances bottom-k sketches. Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efciently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is a distribution over k-mins sketches).
AB - A Bottom-k sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value. Bottom-k sketches are an alternative to k-mins sketches [9], which consist of the k minimum ranked items in k independent rank assignments, and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates. An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benet, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and alldistances bottom-k sketches. Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efciently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is a distribution over k-mins sketches).
KW - All-distances sketches
KW - Bottom-k sketches
KW - Data streams
UR - http://www.scopus.com/inward/record.url?scp=36849001315&partnerID=8YFLogxK
U2 - 10.1145/1281100.1281133
DO - 10.1145/1281100.1281133
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:36849001315
SN - 1595936165
SN - 9781595936165
T3 - Proceedings of the Annual ACM Symposium on Principles of Distributed Computing
SP - 225
EP - 234
BT - PODC'07
T2 - PODC'07: 26th Annual ACM Symposium on Principles of Distributed Computing
Y2 - 12 August 2007 through 15 August 2007
ER -