TY - GEN

T1 - An efficient MapReduce cube algorithm for varied data distributions

AU - Milo, Tova

AU - Altshuler, Eyal

N1 - Publisher Copyright:
© 2016 ACM.

PY - 2016/6/26

Y1 - 2016/6/26

N2 - Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SPSketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.

AB - Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SPSketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.

UR - http://www.scopus.com/inward/record.url?scp=84979658415&partnerID=8YFLogxK

U2 - 10.1145/2882903.2882922

DO - 10.1145/2882903.2882922

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:84979658415

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 1151

EP - 1165

BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data

PB - Association for Computing Machinery

T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016

Y2 - 26 June 2016 through 1 July 2016

ER -