Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Conference ·

Data parallel techniques have been widely adopted both in academia and industry as a tool to enable scalable training of deep learning models. At scale, DL training jobs can fail due to software or hardware bugs, may need to be preempted or terminated due to unexpected events, or may perform suboptimally because they were misconfigured. Under such circumstances, there is a need to recover and/or reconfigure data-parallel DL training jobs on-the-fly, while minimizing the impact on the accuracy of the DNN model and the runtime overhead. In this regard, state-of-art techniques adopted by the HPC community mostly rely on checkpoint-restart, which inevitably leads to loss of progress, thus increasing the runtime overhead. In this paper we explore alternative techniques that exploit the properties of modern deep learning frameworks (overlapping of gradient averaging and weight updates with local gradient computations through pipeline parallelism) to reduce the overhead of resilience/elasticity. To this end we introduce a failure simulation framework and two resilience strategies (immediate mini-batch rollback and lossy forward recovery), which we study compared with checkpoint-restart approaches in a variety of settings in order to understand the trade-offs between the accuracy loss of the DNN model and the runtime overhead.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1887187
Resource Relation:
Conference: 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, 05/16/22 - 05/19/22, Messina, IT
Country of Publication:
United States
Language:
English

References (15)

Leveraging Naturally Distributed Data Redundancy to Reduce Collective I/O Replication Overhead conference May 2015
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models conference May 2020
DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training conference September 2020
Understanding error propagation in deep learning neural network (DNN) accelerators and applications
  • Li, Guanpeng; Hari, Siva Kumar Sastry; Sullivan, Michael
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126964
conference November 2017
On the Resilience of Deep Learning for Reduced-voltage FPGAs conference March 2020
Taming unbalanced training workloads in deep learning with partial collective operations conference February 2020
Deep Residual Learning for Image Recognition conference June 2016
CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters conference September 2020
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
PyTorch distributed journal August 2020
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal
  • Nicolae, Bogdan
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14
conference May 2013
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research journal December 2018
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013