A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Han, Jingoo; Xu, Luna; Rafique, Mustafa; Butt, Ali R.; Lim, Seung-Hwan

doi:10.1109/CLUSTER.2019.8890993

Title: A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Conference · Sun Sep 01 00:00:00 EDT 2019

DOI:https://doi.org/10.1109/CLUSTER.2019.8890993· OSTI ID:1569375

Han, Jingoo ^[1]; Xu, Luna ^[1]; Rafique, Mustafa ^[2]; Butt, Ali R. ^[1];

^[3]

Virginia Tech, Blacksburg, VA
Rochester Institute of Technology, Rochester, NY
ORNL

Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High- performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well- known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation in- cluding scalability, accuracy, variability, storage resource, GPU- GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single- and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1569375

Resource Relation:: Conference: IEEE International Conference on Cluster Computing (IEEE Cluster 2019) - Albuquerque, New Mexico, United States of America - 9/23/2019 12:00:00 PM-9/26/2019 8:00:00 AM

Country of Publication:: United States

Language:: English

References (36)

Characterizing Deep-Learning I/O Workloads in TensorFlow Chien, Steven W. D.; Markidis, Stefano; Sishtla, Chaitanya Prasad 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS) https://doi.org/10.1109/PDSW-DISCS.2018.00011	conference	November 2018
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055	conference	November 2018
Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study Gupta, Suyog; Zhang, Wei; Wang, Fei 2016 IEEE 16th International Conference on Data Mining (ICDM) https://doi.org/10.1109/ICDM.2016.0028	conference	December 2016
Parameter Hub Luo, Liang; Nelson, Jacob; Ceze, Luis Proceedings of the ACM Symposium on Cloud Computing https://doi.org/10.1145/3267809.3267840	conference	October 2018
ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems Mittal, Sparsh; Vetter, Jeffrey S. IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 5 https://doi.org/10.1109/TPDS.2015.2442980	journal	May 2016
NVIDIA cuda software and gpu parallel computing architecture Kirk, David Proceedings of the 6th international symposium on Memory management - ISMM '07 https://doi.org/10.1145/1296907.1296909	conference	January 2007
Topology-aware GPU scheduling for learning workloads in cloud environments Amaral, Marcelo; Polo, Jordà; Carrera, David Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126933	conference	November 2017
What every computer scientist should know about floating-point arithmetic Goldberg, David ACM Computing Surveys, Vol. 23, Issue 1 https://doi.org/10.1145/103162.103163	journal	March 1991
Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters Campos, Victor; Sastre, Francesc; Yagues, Maurici 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.110	conference	May 2017
Evaluation of Deep Learning Frameworks Over Different HPC Architectures Shams, Shayan; Platania, Richard; Lee, Kisung 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) https://doi.org/10.1109/ICDCS.2017.259	conference	June 2017
CosmoFlow: Using Deep Learning to Learn the Universe at Scale Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068	conference	November 2018
Scaling deep learning on GPU and knights landing clusters You, Yang; Buluç, Aydın; Demmel, James Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126912	conference	November 2017
Heterogeneity-aware Distributed Parameter Servers Jiang, Jiawei; Cui, Bin; Zhang, Ce Proceedings of the 2017 ACM International Conference on Management of Data https://doi.org/10.1145/3035918.3035933	conference	May 2017
Scaling up data-parallel analytics platforms: Linear algebraic operation cases Xu, Luna; Lim, Seung-Hwan; Li, Min 2017 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData.2017.8257935	conference	December 2017
Deep learning LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey Nature, Vol. 521, Issue 7553 https://doi.org/10.1038/nature14539	journal	May 2015
Deep learning at 15PF: supervised and semi-supervised classification for scientific data Kurth, Thorsten; Smorkalov, Mikhail; Deslippe, Jack Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126916	conference	January 2017
IBM Power9 Processor Architecture Sadasivam, Satish Kumar; Thompto, Brian W.; Kalla, Ron IEEE Micro, Vol. 37, Issue 2 https://doi.org/10.1109/MM.2017.40	journal	March 2017
Ako Watcharapichat, Pijika; Morales, Victoria Lopez; Fernandez, Raul Castro Proceedings of the Seventh ACM Symposium on Cloud Computing https://doi.org/10.1145/2987550.2987586	conference	October 2016
Overlapping Data Transfers with Computation on GPU with Tiles Bastem, Burak; Unat, Didem; Zhang, Weiqun 2017 46th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2017.26	conference	August 2017
Top-down parsing for Neural Network Exchange Format (NNEF) in TensorFlow-based deep learning computation Seo, Bokyung; Shin, Myungjae; Mo, Yeong Jong 2018 International Conference on Information Networking (ICOIN) https://doi.org/10.1109/ICOIN.2018.8343173	conference	January 2018
Profiling DNN Workloads on a Volta-based DGX-1 System Mojumder, Saiful A.; Louis, Marcia S.; Sun, Yifan 2018 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2018.8573521	conference	September 2018
Single-target localization in video sequences using offline deep-ranked metric learning and online learned models updating Huang, Wei; Zeng, Jing; Zhang, Peng Multimedia Tools and Applications, Vol. 77, Issue 21 https://doi.org/10.1007/s11042-018-6042-1	journal	May 2018
TensorFI: A Configurable Fault Injector for TensorFlow Applications Li, Guanpeng; Pattabiraman, Karthik; DeBardeleben, Nathan 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) https://doi.org/10.1109/ISSREW.2018.00024	conference	October 2018
A configurable rule based classful token bucket filter network request scheduler for the lustre file system Qian, Yingjin; Li, Xi; Ihara, Shuichi Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126932	conference	January 2017
Stepping up to Summit Hines, Jonathan Computing in Science & Engineering, Vol. 20, Issue 2 https://doi.org/10.1109/MCSE.2018.021651341	journal	March 2018
Exascale Deep Learning for Climate Analytics Kurth, Thorsten; Treichler, Sean; Romero, Joshua SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00054	conference	November 2018
On the role of burst buffers in leadership-class storage systems Liu, Ning; Cope, Jason; Carns, Philip 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST) https://doi.org/10.1109/MSST.2012.6232369	conference	April 2012
Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing Vetter, Jeffrey S.; Mittal, Sparsh Computing in Science & Engineering, Vol. 17, Issue 2 https://doi.org/10.1109/MCSE.2015.4	journal	March 2015
Rotation-invariant convolutional neural networks for galaxy morphology prediction Dieleman, Sander; Willett, Kyle W.; Dambre, Joni Monthly Notices of the Royal Astronomical Society, Vol. 450, Issue 2 https://doi.org/10.1093/mnras/stv632	journal	April 2015
FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms Xu, Luna; Lim, Seung-Hwan; Butt, Ali R. 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) https://doi.org/10.1109/PDSW-DISCS.2016.009	conference	November 2016
A survey on deep learning in medical image analysis Litjens, Geert; Kooi, Thijs; Bejnordi, Babak Ehteshami Medical Image Analysis, Vol. 42 https://doi.org/10.1016/j.media.2017.07.005	journal	December 2017
Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity Vetter, Jeffrey S.; Brightwell, Ron; Gokhale, Maya https://doi.org/10.2172/1473756	report	December 2018
Epidemic failure detection and consensus for extreme parallelism Katti, Amogh; Di Fatta, Giuseppe; Naughton, Thomas The International Journal of High Performance Computing Applications, Vol. 32, Issue 5 https://doi.org/10.1177/1094342017690910	journal	February 2017
Gradient-based learning applied to document recognition Lecun, Y.; Bottou, L.; Bengio, Y. Proceedings of the IEEE, Vol. 86, Issue 11 https://doi.org/10.1109/5.726791	journal	January 1998
Exploiting Lustre File Joining for Effective Collective IO Yu, Weikuan; Vetter, Jeffrey; Canon, R. Shane Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) https://doi.org/10.1109/CCGRID.2007.51	conference	May 2007

Similar Records

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Conference · Fri May 01 00:00:00 EDT 2020 · OSTI ID:1569375

Han, Jingoo; Rafique, Mustafa; Xu, Luna; +3 more

SMC 2021 : Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dataset · Fri Mar 26 00:00:00 EDT 2021 · OSTI ID:1569375

Dash, Sajal

SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer

Dataset · Mon Mar 29 00:00:00 EDT 2021 · OSTI ID:1569375

Dash, Sajal; Paul, Arnab K.; Oral, Sarp; +1 more

Title: A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Citation Formats

References (36)

Similar Records

Related Subjects