MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Han, Jingoo; Rafique, Mustafa; Xu, Luna; Butt, Ali R.; Lim, Seung-Hwan; Vazhkudai, Sudharshan

doi:10.1109/CCGrid49817.2020.00-66

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Conference · Fri May 01 00:00:00 EDT 2020

DOI:https://doi.org/10.1109/CCGrid49817.2020.00-66· OSTI ID:1649080

Han, Jingoo ^[1]; Rafique, Mustafa ^[2]; Xu, Luna ^[1]; Butt, Ali R. ^[1]; ^[3]; ^[3]

Virginia Tech, Blacksburg, VA
Rochester Institute of Technology, Rochester, NY
ORNL

Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. To this end, we design MARBLE, a first-of-its-kind job scheduler, which considers the non-linear scalability of GPUs at the intra-node level to schedule an appropriate number of GPUs per node for a job. By sharing the GPU resources on a node with multiple DL jobs, MARBLE avoids low GPU utilization in current multi-GPU DL training on HPC systems. Our comprehensive evaluation in the Summit supercomputer shows that MARBLE is able to improve DL training performance by up to 48.3% compared to the popular Platform Load Sharing Facility (LSF) scheduler. Compared to the state-of-the-art of DL scheduler, Optimus, MARBLE reduces the job completion time by up to 47%.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1649080

Country of Publication:: United States

Language:: English

References (26)

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055	conference	November 2018
ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
Nexus Shen, Haichen; Chen, Lequn; Jin, Yuchen Proceedings of the 27th ACM Symposium on Operating Systems Principles https://doi.org/10.1145/3341301.3359658	conference	October 2019
167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation Patton, Robert M.; Johnston, J. Travis; Young, Steven R. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00053	conference	November 2018
Scalable system scheduling for HPC and big data Reuther, Albert; Byun, Chansup; Arcand, William Journal of Parallel and Distributed Computing, Vol. 111 https://doi.org/10.1016/j.jpdc.2017.06.009	journal	January 2018
Profiling DNN Workloads on a Volta-based DGX-1 System Mojumder, Saiful A.; Louis, Marcia S.; Sun, Yifan 2018 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2018.8573521	conference	September 2018
Overlapping Data Transfers with Computation on GPU with Tiles Bastem, Burak; Unat, Didem; Zhang, Weiqun 2017 46th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2017.26	conference	August 2017
ImageNet Training in Minutes You, Yang; Zhang, Zhao; Hsieh, Cho-Jui Proceedings of the 47th International Conference on Parallel Processing - ICPP 2018 https://doi.org/10.1145/3225058.3225069	conference	January 2018
A survey on deep learning in medical image analysis Litjens, Geert; Kooi, Thijs; Bejnordi, Babak Ehteshami Medical Image Analysis, Vol. 42 https://doi.org/10.1016/j.media.2017.07.005	journal	December 2017
A Heterogeneity-Aware Task Scheduler for Spark Xu, Luna; Butt, Ali R.; Lim, Seung-Hwan 2018 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2018.00042	conference	September 2018
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers Han, Jingoo; Xu, Luna; Rafique, M. Mustafa 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8890993	conference	September 2019
Deep learning LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey Nature, Vol. 521, Issue 7553 https://doi.org/10.1038/nature14539	journal	May 2015
CosmoFlow: Using Deep Learning to Learn the Universe at Scale Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068	conference	November 2018
Stepping up to Summit Hines, Jonathan Computing in Science & Engineering, Vol. 20, Issue 2 https://doi.org/10.1109/MCSE.2018.021651341	journal	March 2018
Optimus Peng, Yanghua; Bao, Yixin; Chen, Yangrui Proceedings of the Thirteenth EuroSys Conference https://doi.org/10.1145/3190508.3190517	conference	April 2018
Deep learning at 15PF: supervised and semi-supervised classification for scientific data Kurth, Thorsten; Smorkalov, Mikhail; Deslippe, Jack Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126916	conference	January 2017
SLURM: Simple Linux Utility for Resource Management Yoo, Andy B.; Jette, Morris A.; Grondona, Mark Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/10968987_3	book	January 2003
Cynthia Zheng, Haoyue; Xu, Fei; Chen, Li Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337873	conference	August 2019
Epidemic failure detection and consensus for extreme parallelism Katti, Amogh; Di Fatta, Giuseppe; Naughton, Thomas The International Journal of High Performance Computing Applications, Vol. 32, Issue 5 https://doi.org/10.1177/1094342017690910	journal	February 2017
Gradient-based learning applied to document recognition Lecun, Y.; Bottou, L.; Bengio, Y. Proceedings of the IEEE, Vol. 86, Issue 11 https://doi.org/10.1109/5.726791	journal	January 1998
Parallelizing Training of Deep Generative Models on Massive Scientific Datasets Jacobs, Sam Ade; Gaffney, Jim; Benson, Tom 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891012	conference	September 2019
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems Mittal, Sparsh; Vetter, Jeffrey S. IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 5 https://doi.org/10.1109/TPDS.2015.2442980	journal	May 2016
Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters Campos, Victor; Sastre, Francesc; Yagues, Maurici 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.110	conference	May 2017
Going deeper with convolutions Szegedy, Christian 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2015.7298594	conference	June 2015
Exascale Deep Learning for Climate Analytics Kurth, Thorsten; Treichler, Sean; Romero, Joshua SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00054	conference	November 2018

Similar Records

A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Conference · Sun Sep 01 00:00:00 EDT 2019 · OSTI ID:1569375

Quantum/AI Topology-Aware Latency-Adaptive HPC Workflow Scheduling Optimization

Conference · Sat Nov 30 23:00:00 EST 2024 · OSTI ID:2538094

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

Conference · Sat Nov 30 23:00:00 EST 2024 · OSTI ID:2500377

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Citation Formats

References (26)

Similar Records

Related Subjects