Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Conference ·

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.

Research Organization:
Argonne National Laboratory (ANL)
Sponsoring Organization:
USDOE Exascale Computing Project (ECP)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1770321
Country of Publication:
United States
Language:
English

References (20)

DataSpaces: an interaction and coordination framework for coupled simulation workflows journal February 2011
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System report April 2010
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems
  • Zhu, Yue; Chowdhury, Fahim; Fu, Huansong
  • 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023
conference September 2018
Parallel I/O Optimizations for Scalable Deep Learning conference December 2017
Polynomial Codes Over Certain Finite Fields journal June 1960
Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration conference November 2018
Efficient User-Level Storage Disaggregation for Deep Learning conference September 2019
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training conference November 2019
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
  • Jin, Sian; Di, Sheng; Liang, Xin
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3326608
conference June 2019
Deep Residual Learning for Image Recognition conference June 2016
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
  • Nicolae, Bogdan; Cappello, Franck
  • HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2462902.2462918
conference October 2018
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research journal December 2018
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O conference September 2012
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal
  • Nicolae, Bogdan
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14
conference May 2013
Optimizing I/O forwarding techniques for extreme-scale event tracing journal June 2013
DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks conference December 2018
Caffe: Convolutional Architecture for Fast Feature Embedding conference January 2014

Similar Records

Extending the Roofline Model for Asynchronous Many-Task Runtimes
Conference · Tue Sep 13 00:00:00 EDT 2016 · OSTI ID:1440707

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1047769

Accelerating Flash-X Simulations with Asynchronous I/O
Conference · Tue Nov 01 00:00:00 EDT 2022 · OSTI ID:1959631