skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Conference ·

Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High- performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well- known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation in- cluding scalability, accuracy, variability, storage resource, GPU- GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single- and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1569375
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (IEEE Cluster 2019) - Albuquerque, New Mexico, United States of America - 9/23/2019 12:00:00 PM-9/26/2019 8:00:00 AM
Country of Publication:
United States
Language:
English

References (36)

Characterizing Deep-Learning I/O Workloads in TensorFlow
  • Chien, Steven W. D.; Markidis, Stefano; Sishtla, Chaitanya Prasad
  • 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS) https://doi.org/10.1109/PDSW-DISCS.2018.00011
conference November 2018
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
  • Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S.
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055
conference November 2018
Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study conference December 2016
Parameter Hub conference October 2018
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems journal May 2016
NVIDIA cuda software and gpu parallel computing architecture conference January 2007
Topology-aware GPU scheduling for learning workloads in cloud environments
  • Amaral, Marcelo; Polo, Jordà; Carrera, David
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126933
conference November 2017
What every computer scientist should know about floating-point arithmetic journal March 1991
Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters conference May 2017
Evaluation of Deep Learning Frameworks Over Different HPC Architectures conference June 2017
CosmoFlow: Using Deep Learning to Learn the Universe at Scale
  • Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068
conference November 2018
Scaling deep learning on GPU and knights landing clusters conference November 2017
Heterogeneity-aware Distributed Parameter Servers conference May 2017
Scaling up data-parallel analytics platforms: Linear algebraic operation cases conference December 2017
Deep learning journal May 2015
Deep learning at 15PF: supervised and semi-supervised classification for scientific data
  • Kurth, Thorsten; Smorkalov, Mikhail; Deslippe, Jack
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126916
conference January 2017
IBM Power9 Processor Architecture journal March 2017
Ako conference October 2016
Overlapping Data Transfers with Computation on GPU with Tiles conference August 2017
Top-down parsing for Neural Network Exchange Format (NNEF) in TensorFlow-based deep learning computation conference January 2018
Profiling DNN Workloads on a Volta-based DGX-1 System conference September 2018
Single-target localization in video sequences using offline deep-ranked metric learning and online learned models updating journal May 2018
TensorFI: A Configurable Fault Injector for TensorFlow Applications conference October 2018
A configurable rule based classful token bucket filter network request scheduler for the lustre file system
  • Qian, Yingjin; Li, Xi; Ihara, Shuichi
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126932
conference January 2017
Stepping up to Summit journal March 2018
Exascale Deep Learning for Climate Analytics conference November 2018
On the role of burst buffers in leadership-class storage systems conference April 2012
Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing journal March 2015
Rotation-invariant convolutional neural networks for galaxy morphology prediction journal April 2015
FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms conference November 2016
A survey on deep learning in medical image analysis journal December 2017
Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity report December 2018
Epidemic failure detection and consensus for extreme parallelism journal February 2017
Gradient-based learning applied to document recognition journal January 1998
Exploiting Lustre File Joining for Effective Collective IO conference May 2007

Related Subjects