skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: What does fault tolerant Deep Learning need from MPI?

Abstract

Deep Learning (DL) algorithms have become the {\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement ourmore » approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1415701
Report Number(s):
PNNL-SA-127971
KJ0402000
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 24th European MPI Users' Group Meeting, September 25-28, 2017, Chicago, Illinois, Paper No. 13
Country of Publication:
United States
Language:
English

Citation Formats

Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., and Daily, Jeffrey A.. What does fault tolerant Deep Learning need from MPI?. United States: N. p., 2017. Web. doi:10.1145/3127024.3127037.
Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., & Daily, Jeffrey A.. What does fault tolerant Deep Learning need from MPI?. United States. doi:10.1145/3127024.3127037.
Amatya, Vinay C., Vishnu, Abhinav, Siegel, Charles M., and Daily, Jeffrey A.. Mon . "What does fault tolerant Deep Learning need from MPI?". United States. doi:10.1145/3127024.3127037.
@article{osti_1415701,
title = {What does fault tolerant Deep Learning need from MPI?},
author = {Amatya, Vinay C. and Vishnu, Abhinav and Siegel, Charles M. and Daily, Jeffrey A.},
abstractNote = {Deep Learning (DL) algorithms have become the {\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.},
doi = {10.1145/3127024.3127037},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Sep 25 00:00:00 EDT 2017},
month = {Mon Sep 25 00:00:00 EDT 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: