MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
- Virginia Tech, Blacksburg, VA
- Rochester Institute of Technology, Rochester, NY
- ORNL
Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. To this end, we design MARBLE, a first-of-its-kind job scheduler, which considers the non-linear scalability of GPUs at the intra-node level to schedule an appropriate number of GPUs per node for a job. By sharing the GPU resources on a node with multiple DL jobs, MARBLE avoids low GPU utilization in current multi-GPU DL training on HPC systems. Our comprehensive evaluation in the Summit supercomputer shows that MARBLE is able to improve DL training performance by up to 48.3% compared to the popular Platform Load Sharing Facility (LSF) scheduler. Compared to the state-of-the-art of DL scheduler, Optimus, MARBLE reduces the job completion time by up to 47%.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1649080
- Resource Relation:
- Conference: The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) - Melbourne, , Australia - 5/11/2020 12:00:00 PM-5/14/2020 4:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Going deeper with convolutions
|
conference | June 2015 |
167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation
|
conference | November 2018 |
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems
|
journal | May 2016 |
Gradient-based learning applied to document recognition
|
journal | January 1998 |
Parallelizing Training of Deep Generative Models on Massive Scientific Datasets
|
conference | September 2019 |
Deep learning at 15PF: supervised and semi-supervised classification for scientific data
|
conference | January 2017 |
Overlapping Data Transfers with Computation on GPU with Tiles
|
conference | August 2017 |
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
|
conference | November 2018 |
Exascale Deep Learning for Climate Analytics
|
conference | November 2018 |
Deep learning
|
journal | May 2015 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
Optimus
|
conference | April 2018 |
A Heterogeneity-Aware Task Scheduler for Spark
|
conference | September 2018 |
Nexus
|
conference | October 2019 |
Cynthia
|
conference | August 2019 |
ImageNet: A large-scale hierarchical image database
|
conference | June 2009 |
Stepping up to Summit
|
journal | March 2018 |
Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters
|
conference | May 2017 |
Profiling DNN Workloads on a Volta-based DGX-1 System
|
conference | September 2018 |
A survey on deep learning in medical image analysis
|
journal | December 2017 |
Epidemic failure detection and consensus for extreme parallelism
|
journal | February 2017 |
SLURM: Simple Linux Utility for Resource Management
|
book | January 2003 |
Scalable system scheduling for HPC and big data
|
journal | January 2018 |
A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers
|
conference | September 2019 |
ImageNet Training in Minutes
|
conference | January 2018 |
Similar Records
A Strawman for an HPC PowerStack
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect