An efficient MapReduce cube algorithm for varied data distributions

Tova Milo, Eyal Altshuler

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SPSketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.

Original languageEnglish
Title of host publicationSIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1151-1165
Number of pages15
ISBN (Electronic)9781450335317
DOIs
StatePublished - 26 Jun 2016
Event2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States
Duration: 26 Jun 20161 Jul 2016

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
Volume26-June-2016
ISSN (Print)0730-8078

Conference

Conference2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
Country/TerritoryUnited States
CitySan Francisco
Period26/06/161/07/16

Fingerprint

Dive into the research topics of 'An efficient MapReduce cube algorithm for varied data distributions'. Together they form a unique fingerprint.

Cite this