skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Abstract

Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High- performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well- known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation in- cluding scalability, accuracy, variability, storage resource, GPU- GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single- and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative andmore » comprehensive study of DL training on a supercomputing system with multiple GPUs.« less

Authors:
 [1];  [1];  [2];  [1]; ORCiD logo [3]
  1. Virginia Tech, Blacksburg, VA
  2. Rochester Institute of Technology, Rochester, NY
  3. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1569375
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (IEEE Cluster 2019) - Albuquerque, New Mexico, United States of America - 9/23/2019 12:00:00 PM-9/26/2019 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Han, Jingoo, Xu, Luna, Rafique, Mustafa, Butt, Ali R., and Lim, Seung-Hwan. A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers. United States: N. p., 2019. Web.
Han, Jingoo, Xu, Luna, Rafique, Mustafa, Butt, Ali R., & Lim, Seung-Hwan. A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers. United States.
Han, Jingoo, Xu, Luna, Rafique, Mustafa, Butt, Ali R., and Lim, Seung-Hwan. Sun . "A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers". United States. https://www.osti.gov/servlets/purl/1569375.
@article{osti_1569375,
title = {A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers},
author = {Han, Jingoo and Xu, Luna and Rafique, Mustafa and Butt, Ali R. and Lim, Seung-Hwan},
abstractNote = {Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High- performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well- known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation in- cluding scalability, accuracy, variability, storage resource, GPU- GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single- and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: