Towards Low-Overhead Resilience for Data Parallel Deep Learning

Nicolae, Bogdan; Hobson, Tanner; Yildiz, Orcun; Peterka, Thomas; Morozov, Dmitriy

doi:10.1109/CCGrid54584.2022.00043

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Conference · Sat Jan 01 04:00:00 EST 2022

DOI:https://doi.org/10.1109/CCGrid54584.2022.00043· OSTI ID:1887187

Nicolae, Bogdan; Hobson, Tanner; Yildiz, Orcun; Peterka, Thomas; Morozov, Dmitriy

Data parallel techniques have been widely adopted both in academia and industry as a tool to enable scalable training of deep learning models. At scale, DL training jobs can fail due to software or hardware bugs, may need to be preempted or terminated due to unexpected events, or may perform suboptimally because they were misconfigured. Under such circumstances, there is a need to recover and/or reconfigure data-parallel DL training jobs on-the-fly, while minimizing the impact on the accuracy of the DNN model and the runtime overhead. In this regard, state-of-art techniques adopted by the HPC community mostly rely on checkpoint-restart, which inevitably leads to loss of progress, thus increasing the runtime overhead. In this paper we explore alternative techniques that exploit the properties of modern deep learning frameworks (overlapping of gradient averaging and weight updates with local gradient computations through pipeline parallelism) to reduce the overhead of resilience/elasticity. To this end we introduce a failure simulation framework and two resilience strategies (immediate mini-batch rollback and lossy forward recovery), which we study compared with checkpoint-restart approaches in a variety of settings in order to understand the trade-offs between the accuracy loss of the DNN model and the runtime overhead.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 1887187

Country of Publication:: United States

Language:: English

References (16)

Understanding error propagation in deep learning neural network (DNN) accelerators and applications Li, Guanpeng; Hari, Siva Kumar Sastry; Sullivan, Michael Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126964	conference	November 2017
Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models Nicolae, Bogdan; Li, Jiali; Wozniak, Justin M. 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) https://doi.org/10.1109/CCGrid49817.2020.00-76	conference	May 2020
On the Resilience of Deep Learning for Reduced-voltage FPGAs Givaki, Kamyar; Salami, Behzad; Hojabr, Reza 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) https://doi.org/10.1109/PDP50117.2020.00023	conference	March 2020
DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training Nicolae, Bogdan; Wozniak, Justin M.; Dorier, Matthieu 2020 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER49012.2020.00033	conference	September 2020
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters Maurya, Avinash; Nicolae, Bogdan; Guliani, Ishan 2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT) https://doi.org/10.1109/DS-RT50469.2020.9213578	conference	September 2020
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
PyTorch distributed Li, Shen; Zhao, Yanli; Varma, Rohan Proceedings of the VLDB Endowment, Vol. 13, Issue 12 https://doi.org/10.14778/3415478.3415530	journal	August 2020
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research Wozniak, Justin M.; Jain, Rajeev; Balaprakash, Prasanna BMC Bioinformatics, Vol. 19, Issue S18 https://doi.org/10.1186/s12859-018-2508-4	journal	December 2018
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal Nicolae, Bogdan 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14	conference	May 2013
Taming unbalanced training workloads in deep learning with partial collective operations Li, Shigang; Ben-Nun, Tal; Girolamo, Salvatore Di Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3332466.3374528	conference	February 2020
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Algorithm-Based Fault Tolerance for Convolutional Neural Networks Zhao, Kai; Di, Sheng; Li, Sihuan IEEE Transactions on Parallel and Distributed Systems https://doi.org/10.1109/TPDS.2020.3043449	journal	January 2021
Leveraging Naturally Distributed Data Redundancy to Reduce Collective I/O Replication Overhead Nicolae, Bogdan 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.82	conference	May 2015

Similar Records

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Conference · Tue Dec 31 23:00:00 EST 2019 · OSTI ID:1770321

Exploring the feasibility of lossy compression for PDE simulations

Journal Article · Sun Mar 11 20:00:00 EDT 2018 · International Journal of High Performance Computing Applications · OSTI ID:1425688

Related Subjects

data-parallel training
deep learning
failure analysis
failure simulation
performance modeling

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Citation Formats

References (16)

Similar Records

Related Subjects