Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Conference ·
 [1];  [1];  [2]
  1. University of North Carolina at Charlotte
  2. ORNL
Improving the performance of job executions is an important goal of HPC batch job schedulers, such as minimizing job waiting time, slowdown, or completion time. Such a goal is often accomplished using carefully designed heuristics based on job features, such as job size and job duration. However, these heuristics overlook important runtime factors (e.g., cluster availability and waiting job patterns), which may vary across time and make a previously sound scheduling decision not hold any longer. In this study, we propose a new approach to incorporate runtime factors into batch job scheduling for better job execution performance. The key idea is to add a scheduling inspector on top of the base job scheduler to scrutinize its scheduling decisions. The inspector will take the runtime factors into consideration and accordingly determine the fitness of the scheduled job. It then either accepts the scheduled job or rejects it and asks the base schedulers to try again later. We realize such an inspector, namely SchedInspector, by leveraging the intelligence of reinforcement learning. Through extensive experiments, we show SchedInspector can intelligently integrate the runtime factors into various batch job scheduling policies, including the state-of-the-art one, to gain better job execution performance, such as smaller average bounded job slowdown (up to 69% better) or average job waiting time (up to 52% better), across various real-world workloads. We also show that although rejecting scheduling decisions may leave the resources idle hence affect the system utilization, SchedInspector is able to achieve the job execution performance improvement with marginal impact on the system utilization (typically less than 1%). We consider one key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batch job scheduling policies.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1885384
Country of Publication:
United States
Language:
English

References (23)

Work-Conserving Optimal Real-Time Scheduling on Multiprocessors conference July 2008
Mixed Integer Linear Programming in Process Scheduling: Modeling, Algorithms, and Applications journal October 2005
Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning? conference May 2019
Power-aware linear programming based scheduling for heterogeneous computer clusters journal May 2012
Experience with using the Parallel Workloads Archive journal October 2014
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning conference November 2020
A Deep Reinforcement Learning Scheduler with Back-filling for High Performance Computing conference December 2021
Computational models and heuristic methods for Grid scheduling problems journal April 2010
Auto-association by multilayer perceptrons and singular value decomposition journal September 1988
NP-complete scheduling problems journal June 1975
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling conference January 2010
Swift machine learning model serving scheduling conference November 2019
Learning scheduling algorithms for data processing clusters conference August 2019
SLURM: Simple Linux Utility for Resource Management book January 2003
CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning
  • Li, Yan; Chang, Kenneth; Bel, Oceane
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126951
conference January 2017
A QoS aware non-work-conserving disk scheduler conference April 2012
Deep Reinforcement Agent for Scheduling in HPC conference May 2021
Fault-aware, utility-based job scheduling on Blue, Gene/P systems conference August 2009
Heuristics and augmented neural networks for task scheduling with non-identical machines journal November 2006
Resource Management with Deep Reinforcement Learning conference November 2016
A review on evolution of production scheduling with neural networks journal August 2007
Reinforcement Learning: A Survey journal January 1996
Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers conference November 2020

Similar Records

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning
Conference · Sun Nov 01 00:00:00 EDT 2020 · OSTI ID:1777791

Tandem Predictions for HPC Jobs
Conference · Wed Jul 17 00:00:00 EDT 2024 · OSTI ID:2447811

Is Knowledge about Running Applications Helping Improve Runtime Prediction of HPC Jobs?
Conference · Sun Sep 10 00:00:00 EDT 2023 · OSTI ID:2242427

Related Subjects