TY - GEN
T1 - Compressing random forests
AU - Painsky, Amichai
AU - Rosset, Saharon
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/31
Y1 - 2017/1/31
N2 - Ensemble methods are considered among the state-of-The-Art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on Random Forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble.
AB - Ensemble methods are considered among the state-of-The-Art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on Random Forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble.
KW - Compression
KW - Entropy coding
KW - Random forest
UR - http://www.scopus.com/inward/record.url?scp=85014515961&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2016.72
DO - 10.1109/ICDM.2016.72
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85014515961
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1131
EP - 1136
BT - Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
A2 - Bonchi, Francesco
A2 - Wu, Xindong
A2 - Baeza-Yates, Ricardo
A2 - Domingo-Ferrer, Josep
A2 - Zhou, Zhi-Hua
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Conference on Data Mining, ICDM 2016
Y2 - 12 December 2016 through 15 December 2016
ER -