TY - GEN
T1 - An efficient MapReduce cube algorithm for varied data distributions
AU - Milo, Tova
AU - Altshuler, Eyal
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/26
Y1 - 2016/6/26
N2 - Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SPSketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.
AB - Data cubes allow users to discover insights from their data and are commonly used in data analysis. While very useful, the data cube is expensive to compute, in particular when the input relation is very large. To address this problem, we consider cube computation in MapReduce, the popular paradigm for distributed big data processing, and present an efficient algorithm for computing cubes over large data sets. We show that our new algorithm consistently performs better than the previous solutions. In particular, existing techniques for cube computation in MapReduce suffer from sensitivity to the distribution of the input data and their performance heavily depends on whether or not, and how exactly, the data is skewed. In contrast, the cube algorithm that we present here is resilient and significantly outperforms previous solutions for varying data distributions. At the core of our solution is a dedicated data structure called the Skews and Partitions Sketch (SP-Sketch for short). The SPSketch is compact in size and fast to compute, and records all needed information for identifying skews and effectively partitioning the workload between the machines. Our algorithm uses the sketch to speed up computation and minimize communication overhead. Our theoretical analysis and thorough experimental study demonstrate the feasibility and efficiency of our solution, including comparisons to state of the art tools for big data processing such as Pig and Hive.
UR - http://www.scopus.com/inward/record.url?scp=84979658415&partnerID=8YFLogxK
U2 - 10.1145/2882903.2882922
DO - 10.1145/2882903.2882922
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84979658415
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1151
EP - 1165
BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
Y2 - 26 June 2016 through 1 July 2016
ER -