TY - JOUR
T1 - Efficient stream sampling for variance-optimal estimation of subset sums
AU - Cohen, Edith
AU - Duffield, Nick
AU - Kaplan, Haim
AU - Lund, Carsten
AU - Thorup, Mikkel
PY - 2011
Y1 - 2011
N2 - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.
AB - From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.
KW - Reservoir sampling
KW - Sampling without replacement
KW - Subset sum estimation
KW - Weighted sampling
UR - http://www.scopus.com/inward/record.url?scp=81255191687&partnerID=8YFLogxK
U2 - 10.1137/10079817X
DO - 10.1137/10079817X
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:81255191687
SN - 0097-5397
VL - 40
SP - 1402
EP - 1431
JO - SIAM Journal on Computing
JF - SIAM Journal on Computing
IS - 5
ER -