TY - JOUR
T1 - Tracking join and self-join Sizes in limited storage
AU - Alon, Noga
AU - Gibbons, Phillip B.
AU - Matias, Yossi
AU - Szegedy, Mario
N1 - Funding Information:
The first author is supported in part by a USA–Israel BSF grant and by the Fund for Basic Research administered by the Israel Academy of Sciences. The third author is supported in part by an Alon Fellowship, by a Tel Aviv University Grant, by the Israel Science Foundation founded by the Academy of Sciences and Humanities, and by the Israeli Ministry of Science. This work was done while the second author was with Bell Labs and the fourth author was with AT&T Labs.
PY - 2002/5
Y1 - 2002/5
N2 - This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we consider two approaches proposed in N. Alon et al. (The space complexity of approximating the frequency moments, J. Comput. System Sci. 58 (1999), 137-147), which we denote tug-of-war and sample-count. We present fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the first experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4-256 memory words, depending on the data set, in order to estimate the self-join size to within a 15% relative error; on average, this is over 4 times (50 times) fewer memory words than needed by sample-count (standard sampling, resp.) to obtain a similar accuracy. For joins, we propose schemes based on maintaining a small signature of each relation independently, such that join sizes can be quickly and accurately estimated between any pair of relations using only these signatures. We show that taking random samples for join signatures can lead to an inaccurate estimation unless the sample size is quite large; moreover, we show that no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a fast join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.
AB - This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we consider two approaches proposed in N. Alon et al. (The space complexity of approximating the frequency moments, J. Comput. System Sci. 58 (1999), 137-147), which we denote tug-of-war and sample-count. We present fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the first experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4-256 memory words, depending on the data set, in order to estimate the self-join size to within a 15% relative error; on average, this is over 4 times (50 times) fewer memory words than needed by sample-count (standard sampling, resp.) to obtain a similar accuracy. For joins, we propose schemes based on maintaining a small signature of each relation independently, such that join sizes can be quickly and accurately estimated between any pair of relations using only these signatures. We show that taking random samples for join signatures can lead to an inaccurate estimation unless the sample size is quite large; moreover, we show that no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a fast join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.
UR - http://www.scopus.com/inward/record.url?scp=0344794028&partnerID=8YFLogxK
U2 - 10.1006/jcss.2001.1813
DO - 10.1006/jcss.2001.1813
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:0344794028
VL - 64
SP - 719
EP - 747
JO - Journal of Computer and System Sciences
JF - Journal of Computer and System Sciences
SN - 0022-0000
IS - 3
ER -