TY - GEN
T1 - Towards Fast Crash-Consistent Cluster Checkpointing
AU - Wood, Andrew
AU - Hershcovitch, Moshik
AU - Ennmouri, Ilias
AU - Zong, Weiyu
AU - Chennuri, Saurav
AU - Cohen, Sarel
AU - Sundararaman, Swaminathan
AU - Waddington, Daniel
AU - Chin, Peter
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.
AB - Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.
UR - http://www.scopus.com/inward/record.url?scp=85142291811&partnerID=8YFLogxK
U2 - 10.1109/HPEC55821.2022.9926330
DO - 10.1109/HPEC55821.2022.9926330
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85142291811
T3 - 2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
BT - 2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
Y2 - 19 September 2022 through 23 September 2022
ER -