Towards Fast Crash-Consistent Cluster Checkpointing

Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, Sarel Cohen, Swaminathan Sundararaman, Daniel Waddington, Peter Chin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.

Original languageEnglish
Title of host publication2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665497862
StatePublished - 2022
Externally publishedYes
Event2022 IEEE High Performance Extreme Computing Conference, HPEC 2022 - Virtual, Online, United States
Duration: 19 Sep 202223 Sep 2022

Publication series

Name2022 IEEE High Performance Extreme Computing Conference, HPEC 2022


Conference2022 IEEE High Performance Extreme Computing Conference, HPEC 2022
Country/TerritoryUnited States
CityVirtual, Online


Dive into the research topics of 'Towards Fast Crash-Consistent Cluster Checkpointing'. Together they form a unique fingerprint.

Cite this