DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae, Bogdan; Li, Jiali; Wozniak, Justin M.; Bosilca, George; Dorier, Matthieu; Cappello, Franck

doi:10.1109/CCGrid49817.2020.00-76

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Conference · Tue Dec 31 23:00:00 EST 2019

DOI:https://doi.org/10.1109/CCGrid49817.2020.00-76· OSTI ID:1770321

Nicolae, Bogdan; Li, Jiali; Wozniak, Justin M.; Bosilca, George; Dorier, Matthieu; Cappello, Franck

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.

View Conference

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: USDOE Exascale Computing Project (ECP)

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 1770321

Country of Publication:: United States

Language:: English

References (20)

DataSpaces: an interaction and coordination framework for coupled simulation workflows Docan, Ciprian; Parashar, Manish; Klasky, Scott Cluster Computing, Vol. 15, Issue 2 https://doi.org/10.1007/s10586-011-0162-y	journal	February 2011
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System Moody, A.; Bronevetsky, G.; Mohror, K. https://doi.org/10.2172/984082	report	April 2010
ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems Zhu, Yue; Chowdhury, Fahim; Fu, Huansong 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023	conference	September 2018
Parallel I/O Optimizations for Scalable Deep Learning Pumma, Sarunya; Si, Min; Feng, Wu-chun 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/ICPADS.2017.00097	conference	December 2017
Polynomial Codes Over Certain Finite Fields Reed, I. S.; Solomon, G. Journal of the Society for Industrial and Applied Mathematics, Vol. 8, Issue 2 https://doi.org/10.1137/0108018	journal	June 1960
Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration Wozniak, Justin M.; Davis, Philip E.; Shu, Tong 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC) https://doi.org/10.1109/MLHPC.2018.8638629	conference	November 2018
Efficient User-Level Storage Disaggregation for Deep Learning Zhu, Yue; Yu, Weikuan; Jiao, Bing 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891023	conference	September 2019
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training Li, Jiali; Nicolae, Bogdan; Wozniak, Justin 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) https://doi.org/10.1109/MLHPC49564.2019.00006	conference	November 2019
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression Jin, Sian; Di, Sheng; Liang, Xin HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3326608	conference	June 2019
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing Nicolae, Bogdan; Cappello, Franck HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2462902.2462918	conference	October 2018
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research Wozniak, Justin M.; Jain, Rajeev; Balaprakash, Prasanna BMC Bioinformatics, Vol. 19, Issue S18 https://doi.org/10.1186/s12859-018-2508-4	journal	December 2018
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O Dorier, Matthieu; Antoniu, Gabriel; Cappello, Franck 2012 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2012.26	conference	September 2012
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal Nicolae, Bogdan 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14	conference	May 2013
Optimizing I/O forwarding techniques for extreme-scale event tracing Ilsche, Thomas; Schuchart, Joseph; Cope, Jason Cluster Computing, Vol. 17, Issue 1 https://doi.org/10.1007/s10586-013-0272-9	journal	June 2013
DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks Balaprakash, Prasanna; Salim, Michael; Uram, Thomas D. 2018 IEEE 25th International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2018.00014	conference	December 2018
Caffe: Convolutional Architecture for Fast Feature Embedding Jia, Yangqing; Shelhamer, Evan; Donahue, Jeff Proceedings of the ACM International Conference on Multimedia - MM '14 https://doi.org/10.1145/2647868.2654889	conference	January 2014

Similar Records

Extending the Roofline Model for Asynchronous Many-Task Runtimes

Conference · Tue Sep 13 00:00:00 EDT 2016 · OSTI ID:1440707

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1047769

Accelerating Flash-X Simulations with Asynchronous I/O

Conference · Tue Nov 01 00:00:00 EDT 2022 · OSTI ID:1959631

Related Subjects

checkpointing
deep learning
fine-grain asynchronous I/O
multi-level data persistence
state preservation

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Citation Formats

References (20)

Similar Records

Related Subjects