skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: What does fault tolerant Deep Learning need from MPI?

Abstract

Deep Learning (DL) algorithms have become the {\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement ourmore » approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1415701
Report Number(s):
PNNL-SA-127971
KJ0402000
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 24th European MPI Users' Group Meeting, September 25-28, 2017, Chicago, Illinois, Paper No. 13
Country of Publication:
United States
Language:
English

Citation Formats

Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., and Daily, Jeffrey A. What does fault tolerant Deep Learning need from MPI?. United States: N. p., 2017. Web. doi:10.1145/3127024.3127037.
Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., & Daily, Jeffrey A. What does fault tolerant Deep Learning need from MPI?. United States. doi:10.1145/3127024.3127037.
Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., and Daily, Jeffrey A. Mon . "What does fault tolerant Deep Learning need from MPI?". United States. doi:10.1145/3127024.3127037.
@article{osti_1415701,
title = {What does fault tolerant Deep Learning need from MPI?},
author = {Amatya, Vinay C. and Vishnu, Abhinav and Siegel, Charles M. and Daily, Jeffrey A.},
abstractNote = {Deep Learning (DL) algorithms have become the {\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.},
doi = {10.1145/3127024.3127037},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Deep learning for computational chemistry
journal, March 2017

  • Goh, Garrett B.; Hodas, Nathan O.; Vishnu, Abhinav
  • Journal of Computational Chemistry, Vol. 38, Issue 16
  • DOI: 10.1002/jcc.24764

Replication-Based Fault Tolerance for MPI Applications
journal, July 2009

  • Walters, J. P.; Chaudhary, V.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7
  • DOI: 10.1109/TPDS.2008.172

ImageNet Large Scale Visual Recognition Challenge
journal, April 2015

  • Russakovsky, Olga; Deng, Jia; Su, Hao
  • International Journal of Computer Vision, Vol. 115, Issue 3
  • DOI: 10.1007/s11263-015-0816-y

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

  • Laguna, Ignacio; Richards, David F.; Gamblin, Todd
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015623623

A Large-Scale Study of Failures in High-Performance Computing Systems
journal, October 2010

  • Schroeder, Bianca; Gibson, Garth A.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4
  • DOI: 10.1109/TDSC.2009.4

Gradient-based learning applied to document recognition
journal, January 1998

  • Lecun, Y.; Bottou, L.; Bengio, Y.
  • Proceedings of the IEEE, Vol. 86, Issue 11
  • DOI: 10.1109/5.726791

Searching for exotic particles in high-energy physics with deep learning
journal, July 2014

  • Baldi, P.; Sadowski, P.; Whiteson, D.
  • Nature Communications, Vol. 5, Issue 1
  • DOI: 10.1038/ncomms5308