Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

Journal Article · · IEEE Transactions on Parallel and Distributed Systems

Cluster schedulers are crucial in high-performance computing (HPC). They determine when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. An efficient training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by the system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. We implement DRAS into a HPC scheduling platform called CQGym. CQGym provides a common platform allowing users to flexibly evaluate DRAS and other scheduling methods such as heuristic and optimization methods. Furthermore, the experiments using CQGym with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 50%.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-06CH11357; AC02-05CH11231
OSTI ID:
1984484
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 12 Vol. 33; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (24)

Data Centers Job Scheduling with Deep Reinforcement Learning book January 2020
A Slurm Simulator: Implementation and Parametric Analysis book December 2017
System-wide trade-off modeling of performance, power, and resilience on petascale systems journal April 2018
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling journal May 2021
Energy-efficient and thermal-aware resource management for heterogeneous datacenters journal December 2014
Deep learning journal May 2015
Mastering the game of Go without human knowledge journal October 2017
Function Optimization using Connectionist Reinforcement Learning Algorithms journal January 1991
Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling journal June 2001
Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates conference September 2017
Residual Reinforcement Learning for Robot Control conference May 2019
Minimizing Electricity Cost: Optimization of Distributed Internet Data Centers in a Multi-Electricity-Market Environment conference March 2010
Deep Reinforcement Agent for Scheduling in HPC conference May 2021
A Data Driven Scheduling Approach for Power Management on HPC Systems
  • Wallace, Sean; Yang, Xu; Vishwanath, Venkatram
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.55
conference November 2016
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning conference November 2020
Self-Optimizing Memory Controllers journal June 2008
Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems
  • Yang, Xu; Zhou, Zhou; Wallace, Sean
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503264
conference January 2013
Multi-resource packing for cluster schedulers conference August 2014
Resource Management with Deep Reinforcement Learning conference November 2016
Scheduling Beyond CPUs for HPC
  • Fan, Yuping; Lan, Zhiling; Rich, Paul
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325401
conference June 2019
The Effect of System Utilization on Application Performance Variability conference June 2019
DeepJS conference January 2019
Learning scheduling algorithms for data processing clusters conference August 2019
Deep Reinforcement Learning framework for Autonomous Driving journal January 2017