skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards Native Execution of Deep Learning on a Leadership-Class HPC System

Abstract

Large parallel machines generally offer the best parallel performance with "native execution" that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called "learning speed up" is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
  2. Argonne National Laboratory (ANL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1550753
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IPDPS 2019 Workshop on Scalable Deep Learning on Parallel and Distributed Infrastructures - Rio de Janeiro, , Brazil - 5/20/2019 4:00:00 AM-5/24/2019 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Yoginath, Srikanth B., Alam, Maksudul, Ramanathan, Arvind, Bhowmik, Debsindhu, Laanait, Nouamane, and Perumalla, Kalyan R S. Towards Native Execution of Deep Learning on a Leadership-Class HPC System. United States: N. p., 2019. Web. doi:10.1109/IPDPSW.2019.00160.
Yoginath, Srikanth B., Alam, Maksudul, Ramanathan, Arvind, Bhowmik, Debsindhu, Laanait, Nouamane, & Perumalla, Kalyan R S. Towards Native Execution of Deep Learning on a Leadership-Class HPC System. United States. doi:10.1109/IPDPSW.2019.00160.
Yoginath, Srikanth B., Alam, Maksudul, Ramanathan, Arvind, Bhowmik, Debsindhu, Laanait, Nouamane, and Perumalla, Kalyan R S. Wed . "Towards Native Execution of Deep Learning on a Leadership-Class HPC System". United States. doi:10.1109/IPDPSW.2019.00160. https://www.osti.gov/servlets/purl/1550753.
@article{osti_1550753,
title = {Towards Native Execution of Deep Learning on a Leadership-Class HPC System},
author = {Yoginath, Srikanth B. and Alam, Maksudul and Ramanathan, Arvind and Bhowmik, Debsindhu and Laanait, Nouamane and Perumalla, Kalyan R S},
abstractNote = {Large parallel machines generally offer the best parallel performance with "native execution" that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called "learning speed up" is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.},
doi = {10.1109/IPDPSW.2019.00160},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: