TY - GEN

T1 - External sampling

AU - Andoni, Alexandr

AU - Indyk, Piotr

AU - Onak, Krzysztof

AU - Rubinfeld, Ronitt

N1 - Funding Information:
The research was supported in part by David and Lucille Packard Fellowship, by MADALGO (Center for Massive Data Algorithmics, funded by the Danish National Research Association), by Marie Curie IRG Grant 231077, by NSF grants 0514771, 0728645, and 0732334, and by a Symantec Research Fellowship.

PY - 2009

Y1 - 2009

N2 - We initiate the study of sublinear-time algorithms in the external memory model [1]. In this model, the data is stored in blocks of a certain size B, and the algorithm is charged a unit cost for each block access. This model is well-studied, since it reflects the computational issues occurring when the (massive) input is stored on a disk. Since each block access operates on B data elements in parallel, many problems have external memory algorithms whose number of block accesses is only a small fraction (e.g. 1/B) of their main memory complexity. However, to the best of our knowledge, no such reduction in complexity is known for any sublinear-time algorithm. One plausible explanation is that the vast majority of sublinear-time algorithms use random sampling and thus exhibit no locality of reference. This state of affairs is quite unfortunate, since both sublinear-time algorithms and the external memory model are important approaches to dealing with massive data sets, and ideally they should be combined to achieve best performance. In this paper we show that such combination is indeed possible. In particular, we consider three well-studied problems: testing of distinctness, uniformity and identity of an empirical distribution induced by data. For these problems we show random-sampling-based algorithms whose number of block accesses is up to a factor of smaller than the main memory complexity of those problems. We also show that this improvement is optimal for those problems. Since these problems are natural primitives for a number of sampling-based algorithms for other problems, our tools improve the external memory complexity of other problems as well.

AB - We initiate the study of sublinear-time algorithms in the external memory model [1]. In this model, the data is stored in blocks of a certain size B, and the algorithm is charged a unit cost for each block access. This model is well-studied, since it reflects the computational issues occurring when the (massive) input is stored on a disk. Since each block access operates on B data elements in parallel, many problems have external memory algorithms whose number of block accesses is only a small fraction (e.g. 1/B) of their main memory complexity. However, to the best of our knowledge, no such reduction in complexity is known for any sublinear-time algorithm. One plausible explanation is that the vast majority of sublinear-time algorithms use random sampling and thus exhibit no locality of reference. This state of affairs is quite unfortunate, since both sublinear-time algorithms and the external memory model are important approaches to dealing with massive data sets, and ideally they should be combined to achieve best performance. In this paper we show that such combination is indeed possible. In particular, we consider three well-studied problems: testing of distinctness, uniformity and identity of an empirical distribution induced by data. For these problems we show random-sampling-based algorithms whose number of block accesses is up to a factor of smaller than the main memory complexity of those problems. We also show that this improvement is optimal for those problems. Since these problems are natural primitives for a number of sampling-based algorithms for other problems, our tools improve the external memory complexity of other problems as well.

UR - http://www.scopus.com/inward/record.url?scp=70449090401&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-02927-1_9

DO - 10.1007/978-3-642-02927-1_9

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:70449090401

SN - 3642029264

SN - 9783642029264

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 83

EP - 94

BT - Automata, Languages and Programming - 36th International Colloquium, ICALP 2009, Proceedings

T2 - 36th International Colloquium on Automata, Languages and Programming, ICALP 2009

Y2 - 5 July 2009 through 12 July 2009

ER -