TY - GEN
T1 - Processing top-k queries from samples
AU - Cohen, Edith
AU - Grossaug, Nadav
AU - Kaplan, Haim
PY - 2006
Y1 - 2006
N2 - Top- k queries are desired aggregation operations on data sets. Examples of queries on network data include the top 100 source AS's, top 100 ports, or top Domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples. If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting and techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top- k frequencies. This bias depends on the distribution and must be accounted for when seeking the actual value. We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various data sets that include IP flows data, show that schemes that exploit more of the structure of the sample distribution produce much tight confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation. Our work is basic and is widely applicable to all applications that process top-k and heavy hitters queries over a random sample of the actual records.
AB - Top- k queries are desired aggregation operations on data sets. Examples of queries on network data include the top 100 source AS's, top 100 ports, or top Domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples. If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting and techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top- k frequencies. This bias depends on the distribution and must be accounted for when seeking the actual value. We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various data sets that include IP flows data, show that schemes that exploit more of the structure of the sample distribution produce much tight confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation. Our work is basic and is widely applicable to all applications that process top-k and heavy hitters queries over a random sample of the actual records.
UR - http://www.scopus.com/inward/record.url?scp=77953845657&partnerID=8YFLogxK
U2 - 10.1145/1368436.1368446
DO - 10.1145/1368436.1368446
M3 - פרסום בספר כנס
AN - SCOPUS:77953845657
SN - 1595934561
SN - 9781595934567
T3 - Proceedings of CoNEXT'06 - 2nd Conference on Future Networking Technologies
BT - Proceedings of CoNEXT'06 - 2nd Conference on Future Networking Technologies
Y2 - 4 December 2006 through 7 December 2006
ER -