Towards Low-Overhead Resilience for Data Parallel Deep Learning
Data parallel techniques have been widely adopted both in academia and industry as a tool to enable scalable training of deep learning models. At scale, DL training jobs can fail due to software or hardware bugs, may need to be preempted or terminated due to unexpected events, or may perform suboptimally because they were misconfigured. Under such circumstances, there is a need to recover and/or reconfigure data-parallel DL training jobs on-the-fly, while minimizing the impact on the accuracy of the DNN model and the runtime overhead. In this regard, state-of-art techniques adopted by the HPC community mostly rely on checkpoint-restart, which inevitably leads to loss of progress, thus increasing the runtime overhead. In this paper we explore alternative techniques that exploit the properties of modern deep learning frameworks (overlapping of gradient averaging and weight updates with local gradient computations through pipeline parallelism) to reduce the overhead of resilience/elasticity. To this end we introduce a failure simulation framework and two resilience strategies (immediate mini-batch rollback and lossy forward recovery), which we study compared with checkpoint-restart approaches in a variety of settings in order to understand the trade-offs between the accuracy loss of the DNN model and the runtime overhead.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1887187
- Country of Publication:
- United States
- Language:
- English
Similar Records
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
Exploring the feasibility of lossy compression for PDE simulations
Conference
·
Tue Dec 31 23:00:00 EST 2019
·
OSTI ID:1770321
Exploring the feasibility of lossy compression for PDE simulations
Journal Article
·
Sun Mar 11 20:00:00 EDT 2018
· International Journal of High Performance Computing Applications
·
OSTI ID:1425688