skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Conference ·

Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. To this end, we design MARBLE, a first-of-its-kind job scheduler, which considers the non-linear scalability of GPUs at the intra-node level to schedule an appropriate number of GPUs per node for a job. By sharing the GPU resources on a node with multiple DL jobs, MARBLE avoids low GPU utilization in current multi-GPU DL training on HPC systems. Our comprehensive evaluation in the Summit supercomputer shows that MARBLE is able to improve DL training performance by up to 48.3% compared to the popular Platform Load Sharing Facility (LSF) scheduler. Compared to the state-of-the-art of DL scheduler, Optimus, MARBLE reduces the job completion time by up to 47%.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1649080
Resource Relation:
Conference: The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) - Melbourne, , Australia - 5/11/2020 12:00:00 PM-5/14/2020 4:00:00 AM
Country of Publication:
United States
Language:
English

References (25)

Going deeper with convolutions conference June 2015
167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation
  • Patton, Robert M.; Johnston, J. Travis; Young, Steven R.
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00053
conference November 2018
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems journal May 2016
Gradient-based learning applied to document recognition journal January 1998
Parallelizing Training of Deep Generative Models on Massive Scientific Datasets conference September 2019
Deep learning at 15PF: supervised and semi-supervised classification for scientific data
  • Kurth, Thorsten; Smorkalov, Mikhail; Deslippe, Jack
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126916
conference January 2017
Overlapping Data Transfers with Computation on GPU with Tiles conference August 2017
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
  • Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S.
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055
conference November 2018
Exascale Deep Learning for Climate Analytics conference November 2018
Deep learning journal May 2015
Deep Residual Learning for Image Recognition conference June 2016
Optimus conference April 2018
A Heterogeneity-Aware Task Scheduler for Spark conference September 2018
Nexus conference October 2019
Cynthia conference August 2019
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
Stepping up to Summit journal March 2018
Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters conference May 2017
Profiling DNN Workloads on a Volta-based DGX-1 System conference September 2018
A survey on deep learning in medical image analysis journal December 2017
Epidemic failure detection and consensus for extreme parallelism journal February 2017
SLURM: Simple Linux Utility for Resource Management book January 2003
Scalable system scheduling for HPC and big data journal January 2018
A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers conference September 2019
ImageNet Training in Minutes conference January 2018

Similar Records

Related Subjects