TY - GEN

T1 - Stream sampling for variance-optimal estimation of subset sums

AU - Cohen, Edith

AU - Duffield, Nick

AU - Kaplan, Haim

AU - Lund, Carsten

AU - Thorup, Mikkel

PY - 2009

Y1 - 2009

N2 - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VAROPTk, that dominates all previous schemes in terms of estimation quality. VAROPTk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

AB - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VAROPTk, that dominates all previous schemes in terms of estimation quality. VAROPTk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

UR - http://www.scopus.com/inward/record.url?scp=70349089219&partnerID=8YFLogxK

U2 - 10.1137/1.9781611973068.136

DO - 10.1137/1.9781611973068.136

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:70349089219

SN - 9780898716801

T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

SP - 1255

EP - 1264

BT - Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms

PB - Association for Computing Machinery (ACM)

T2 - 20th Annual ACM-SIAM Symposium on Discrete Algorithms

Y2 - 4 January 2009 through 6 January 2009

ER -