DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Exascale Computing Project (ECP)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1770321
- Country of Publication:
- United States
- Language:
- English
DataSpaces: an interaction and coordination framework for coupled simulation workflows
|
journal | February 2011 |
| Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System | report | April 2010 |
ImageNet: A large-scale hierarchical image database
|
conference | June 2009 |
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems
|
conference | September 2018 |
Parallel I/O Optimizations for Scalable Deep Learning
|
conference | December 2017 |
Polynomial Codes Over Certain Finite Fields
|
journal | June 1960 |
Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration
|
conference | November 2018 |
Efficient User-Level Storage Disaggregation for Deep Learning
|
conference | September 2019 |
FTI: high performance fault tolerance interface for hybrid systems
|
conference | January 2011 |
Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training
|
conference | November 2019 |
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
|
conference | June 2019 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
|
conference | May 2019 |
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
|
conference | October 2018 |
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research
|
journal | December 2018 |
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O
|
conference | September 2012 |
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal
|
conference | May 2013 |
Optimizing I/O forwarding techniques for extreme-scale event tracing
|
journal | June 2013 |
DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks
|
conference | December 2018 |
Caffe: Convolutional Architecture for Fast Feature Embedding
|
conference | January 2014 |
Similar Records
Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Accelerating Flash-X Simulations with Asynchronous I/O