TY - CONF
T1 - Tracking join and self-join sizes in limited storage
AU - Alon, Noga
AU - Gibbons, Phillip B.
AU - Matias, Yossi
AU - Szegedy, Mario
N1 - Funding Information:
The first author is supported in part by a USA–Israel BSF grant and by the Fund for Basic Research administered by the Israel Academy of Sciences. The third author is supported in part by an Alon Fellowship, by a Tel Aviv University Grant, by the Israel Science Foundation founded by the Academy of Sciences and Humanities, and by the Israeli Ministry of Science. This work was done while the second author was with Bell Labs and the fourth author was with AT&T Labs.
PY - 1999
Y1 - 1999
N2 - Query optimizers rely on fast, high-quality estimates of result sizes in order to select between various join plans. Self-join sizes of relations provide bounds on the join size of any pairs of such relations. It also indicates the degree of skew in the data, and has been advocated for several estimation procedures. Exact computation of the self-join size requires storage proportional to the number of distinct attribute values, which may be prohibitively large. In this paper, we study algorithms for tracking (approximate) self-join sizes in limited storage in the presence of insertions and deletions to the relations. Such algorithms detect changes in the degree of skew without an expensive recomputation from the base data. We show that an algorithm based on a tug-of-war approach provides a more accurate estimation than one based on a sample-and-count approach which is in turn more accurate than a sampling-only approach. Next, we study algorithms for tracking (approximate) join sizes in limited storage; the goal is to maintain a small signature of each relation such that join sizes can be accurately estimated between any pairs of relations. We show that taking random samples for join signatures can lead to inaccurate estimation unless the sample size is quite large; moreover, by a lower bound we show, no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.
AB - Query optimizers rely on fast, high-quality estimates of result sizes in order to select between various join plans. Self-join sizes of relations provide bounds on the join size of any pairs of such relations. It also indicates the degree of skew in the data, and has been advocated for several estimation procedures. Exact computation of the self-join size requires storage proportional to the number of distinct attribute values, which may be prohibitively large. In this paper, we study algorithms for tracking (approximate) self-join sizes in limited storage in the presence of insertions and deletions to the relations. Such algorithms detect changes in the degree of skew without an expensive recomputation from the base data. We show that an algorithm based on a tug-of-war approach provides a more accurate estimation than one based on a sample-and-count approach which is in turn more accurate than a sampling-only approach. Next, we study algorithms for tracking (approximate) join sizes in limited storage; the goal is to maintain a small signature of each relation such that join sizes can be accurately estimated between any pairs of relations. We show that taking random samples for join signatures can lead to inaccurate estimation unless the sample size is quite large; moreover, by a lower bound we show, no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.
UR - http://www.scopus.com/inward/record.url?scp=0032678623&partnerID=8YFLogxK
U2 - 10.1145/303976.303978
DO - 10.1145/303976.303978
M3 - ???researchoutput.researchoutputtypes.contributiontoconference.paper???
AN - SCOPUS:0032678623
SP - 10
EP - 20
T2 - Proceedings of the 1999 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '99
Y2 - 31 May 1999 through 2 June 1999
ER -