DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

Fan, Yuping; Li, Boyang; Favorite, Dustin; Singh, Naunidh; Childers, Taylor; Rich, Paul; Allcock, William; Papka, Michael E.; Lan, Zhiling

doi:10.1109/tpds.2022.3205325

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

Journal Article · Fri Sep 16 00:00:00 EDT 2022 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2022.3205325· OSTI ID:1984484

^[1]; Li, Boyang ^[1]; Favorite, Dustin ^[1]; Singh, Naunidh ^[1]; Childers, Taylor ^[2]; Rich, Paul ^[2]; Allcock, William ^[2]; ^[2]; Lan, Zhiling ^[1]

Illinois Institute of Technology, Chicago, IL (United States)
Argonne National Laboratory (ANL), Argonne, IL (United States)

Cluster schedulers are crucial in high-performance computing (HPC). They determine when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. An efficient training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by the system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. We implement DRAS into a HPC scheduling platform called CQGym. CQGym provides a common platform allowing users to flexibly evaluate DRAS and other scheduling methods such as heuristic and optimization methods. Furthermore, the experiments using CQGym with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 50%.

View Accepted Manuscript (DOE)

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States)

Sponsoring Organization:: National Science Foundation (NSF); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)

Grant/Contract Number:: AC02-06CH11357; AC02-05CH11231

OSTI ID:: 1984484

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 12 Vol. 33; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (24)

Data Centers Job Scheduling with Deep Reinforcement Learning Liang, Sisheng; Yang, Zhou; Jin, Fang Advances in Knowledge Discovery and Data Mining https://doi.org/10.1007/978-3-030-47436-2_68	book	January 2020
A Slurm Simulator: Implementation and Parametric Analysis Simakov, Nikolay A.; Innus, Martins D.; Jones, Matthew D. Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-72971-8_10	book	December 2017
System-wide trade-off modeling of performance, power, and resilience on petascale systems Yu, Li; Zhou, Zhou; Fan, Yuping The Journal of Supercomputing, Vol. 74, Issue 7 https://doi.org/10.1007/s11227-018-2368-8	journal	April 2018
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling Fan, Yuping; Lan, Zhiling Software Impacts, Vol. 8 https://doi.org/10.1016/j.simpa.2021.100077	journal	May 2021
Energy-efficient and thermal-aware resource management for heterogeneous datacenters Sun, Hongyang; Stolf, Patricia; Pierson, Jean-Marc Sustainable Computing: Informatics and Systems, Vol. 4, Issue 4 https://doi.org/10.1016/j.suscom.2014.08.005	journal	December 2014
Deep learning LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey Nature, Vol. 521, Issue 7553 https://doi.org/10.1038/nature14539	journal	May 2015
Mastering the game of Go without human knowledge Silver, David; Schrittwieser, Julian; Simonyan, Karen Nature, Vol. 550, Issue 7676 https://doi.org/10.1038/nature24270	journal	October 2017
Function Optimization using Connectionist Reinforcement Learning Algorithms Williams, Ronald J.; Peng, Jing Connection Science, Vol. 3, Issue 3 https://doi.org/10.1080/09540099108946587	journal	January 1991
Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling Mu'alem, A. W.; Feitelson, D. G. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 6 https://doi.org/10.1109/71.932708	journal	June 2001
Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates Fan, Yuping; Rich, Paul; Allcock, William E. 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.11	conference	September 2017
Residual Reinforcement Learning for Robot Control Johannink, Tobias; Bahl, Shikhar; Nair, Ashvin 2019 International Conference on Robotics and Automation (ICRA) https://doi.org/10.1109/ICRA.2019.8794127	conference	May 2019
Minimizing Electricity Cost: Optimization of Distributed Internet Data Centers in a Multi-Electricity-Market Environment Rao, Lei; Liu, Xue; Xie, Le 2010 Proceedings IEEE INFOCOM https://doi.org/10.1109/INFCOM.2010.5461933	conference	March 2010
Deep Reinforcement Agent for Scheduling in HPC Fan, Yuping; Lan, Zhiling; Childers, Taylor 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS49936.2021.00090	conference	May 2021
A Data Driven Scheduling Approach for Power Management on HPC Systems Wallace, Sean; Yang, Xu; Vishwanath, Venkatram SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.55	conference	November 2016
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning Zhang, Di; Dai, Dong; He, Youbiao SC20: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC41405.2020.00035	conference	November 2020
Self-Optimizing Memory Controllers Ipek, Engin; Mutlu, Onur; Martínez, José F. ACM SIGARCH Computer Architecture News, Vol. 36, Issue 3 https://doi.org/10.1145/1394608.1382172	journal	June 2008
Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems Yang, Xu; Zhou, Zhou; Wallace, Sean Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503264	conference	January 2013
Multi-resource packing for cluster schedulers Grandl, Robert; Ananthanarayanan, Ganesh; Kandula, Srikanth Proceedings of the 2014 ACM conference on SIGCOMM https://doi.org/10.1145/2619239.2626334	conference	August 2014
Resource Management with Deep Reinforcement Learning Mao, Hongzi; Alizadeh, Mohammad; Menache, Ishai Proceedings of the 15th ACM Workshop on Hot Topics in Networks https://doi.org/10.1145/3005745.3005750	conference	November 2016
Scheduling Beyond CPUs for HPC Fan, Yuping; Lan, Zhiling; Rich, Paul HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325401	conference	June 2019
The Effect of System Utilization on Application Performance Variability Li, Boyang; Chunduri, Sudheer; Harms, Kevin Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers https://doi.org/10.1145/3322789.3328743	conference	June 2019
DeepJS Li, Fengcun; Hu, Bo Proceedings of the 2019 4th International Conference on Big Data and Computing - ICBDC 2019 https://doi.org/10.1145/3335484.3335513	conference	January 2019
Learning scheduling algorithms for data processing clusters Mao, Hongzi; Schwarzkopf, Malte; Venkatakrishnan, Shaileshh Bojja Proceedings of the ACM Special Interest Group on Data Communication https://doi.org/10.1145/3341302.3342080	conference	August 2019
Deep Reinforcement Learning framework for Autonomous Driving Sallab, AhmadEL; Abdou, Mohammed; Perot, Etienne Electronic Imaging, Vol. 2017, Issue 19 https://doi.org/10.2352/ISSN.2470-1173.2017.19.AVM-023	journal	January 2017

Similar Records

MARS: Malleable Actor-Critic Reinforcement Learning Scheduler

Conference · Sat Nov 12 23:00:00 EST 2022 · OSTI ID:1958175

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Conference · Sun Nov 01 00:00:00 EDT 2020 · OSTI ID:1777791

Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems

Journal Article · Tue May 06 00:00:00 EDT 2025 · The Journal of Supercomputing · OSTI ID:2575548

Related Subjects

97 MATHEMATICS AND COMPUTING
High-performance computing
OpenAI Gym
backfilling
cluster scheduling
deep reinforcement learning
job starvation
resource reservation

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

Citation Formats

References (24)

Similar Records

Related Subjects