SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Zhang, Di; Dai, Dong; Xie, Bing

doi:10.1145/3502181.3531470

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Conference · Wed Jun 01 04:00:00 EDT 2022

DOI:https://doi.org/10.1145/3502181.3531470· OSTI ID:1885384

Zhang, Di ^[1]; Dai, Dong ^[1]; Xie, Bing ^[2]

University of North Carolina at Charlotte
ORNL

Improving the performance of job executions is an important goal of HPC batch job schedulers, such as minimizing job waiting time, slowdown, or completion time. Such a goal is often accomplished using carefully designed heuristics based on job features, such as job size and job duration. However, these heuristics overlook important runtime factors (e.g., cluster availability and waiting job patterns), which may vary across time and make a previously sound scheduling decision not hold any longer. In this study, we propose a new approach to incorporate runtime factors into batch job scheduling for better job execution performance. The key idea is to add a scheduling inspector on top of the base job scheduler to scrutinize its scheduling decisions. The inspector will take the runtime factors into consideration and accordingly determine the fitness of the scheduled job. It then either accepts the scheduled job or rejects it and asks the base schedulers to try again later. We realize such an inspector, namely SchedInspector, by leveraging the intelligence of reinforcement learning. Through extensive experiments, we show SchedInspector can intelligently integrate the runtime factors into various batch job scheduling policies, including the state-of-the-art one, to gain better job execution performance, such as smaller average bounded job slowdown (up to 69% better) or average job waiting time (up to 52% better), across various real-world workloads. We also show that although rejecting scheduling decisions may leave the resources idle hence affect the system utilization, SchedInspector is able to achieve the job execution performance improvement with marginal impact on the system utilization (typically less than 1%). We consider one key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batch job scheduling policies.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1885384

Country of Publication:: United States

Language:: English

References (23)

Work-Conserving Optimal Real-Time Scheduling on Multiprocessors Funaoka, Kenji; Kato, Shinpei; Yamasaki, Nobuyuki 2008 Euromicro Conference on Real-Time Systems https://doi.org/10.1109/ECRTS.2008.15	conference	July 2008
Mixed Integer Linear Programming in Process Scheduling: Modeling, Algorithms, and Applications Floudas, Christodoulos A.; Lin, Xiaoxia Annals of Operations Research, Vol. 139, Issue 1 https://doi.org/10.1007/s10479-005-3446-x	journal	October 2005
Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning? Legrand, Arnaud; Trystram, Denis; Zrigui, Salah 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00077	conference	May 2019
Power-aware linear programming based scheduling for heterogeneous computer clusters Al-Daoud, Hadil; Al-Azzoni, Issam; Down, Douglas G. Future Generation Computer Systems, Vol. 28, Issue 5 https://doi.org/10.1016/j.future.2011.04.001	journal	May 2012
Experience with using the Parallel Workloads Archive Feitelson, Dror G.; Tsafrir, Dan; Krakov, David Journal of Parallel and Distributed Computing, Vol. 74, Issue 10 https://doi.org/10.1016/j.jpdc.2014.06.013	journal	October 2014
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning Zhang, Di; Dai, Dong; He, Youbiao SC20: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC41405.2020.00035	conference	November 2020
A Deep Reinforcement Learning Scheduler with Back-filling for High Performance Computing Wang, Lingfei; Harwood, Aaron; Rodriguez, Maria A. 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE) https://doi.org/10.1109/CSDE53843.2021.9718493	conference	December 2021
Computational models and heuristic methods for Grid scheduling problems Xhafa, Fatos; Abraham, Ajith Future Generation Computer Systems, Vol. 26, Issue 4 https://doi.org/10.1016/j.future.2009.11.005	journal	April 2010
Auto-association by multilayer perceptrons and singular value decomposition Bourlard, H.; Kamp, Y. Biological Cybernetics, Vol. 59, Issue 4-5 https://doi.org/10.1007/BF00332918	journal	September 1988
NP-complete scheduling problems Ullman, J. D. Journal of Computer and System Sciences, Vol. 10, Issue 3 https://doi.org/10.1016/S0022-0000(75)80008-0	journal	June 1975
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling Zaharia, Matei; Borthakur, Dhruba; Sen Sarma, Joydeep Proceedings of the 5th European conference on Computer systems - EuroSys '10 https://doi.org/10.1145/1755913.1755940	conference	January 2010
Swift machine learning model serving scheduling Qin, Heyang; Zawad, Syed; Zhou, Yanqi Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356164	conference	November 2019
Learning scheduling algorithms for data processing clusters Mao, Hongzi; Schwarzkopf, Malte; Venkatakrishnan, Shaileshh Bojja Proceedings of the ACM Special Interest Group on Data Communication https://doi.org/10.1145/3341302.3342080	conference	August 2019
SLURM: Simple Linux Utility for Resource Management Yoo, Andy B.; Jette, Morris A.; Grondona, Mark Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/10968987_3	book	January 2003
CAPES: unsupervised storage performance tuning using neural network-based deep reinforcement learning Li, Yan; Chang, Kenneth; Bel, Oceane Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126951	conference	January 2017
A QoS aware non-work-conserving disk scheduler Rocha, Pedro Eugenio; Bona, Luis C. E. 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST) https://doi.org/10.1109/MSST.2012.6232386	conference	April 2012
Deep Reinforcement Agent for Scheduling in HPC Fan, Yuping; Lan, Zhiling; Childers, Taylor 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS49936.2021.00090	conference	May 2021
Fault-aware, utility-based job scheduling on Blue, Gene/P systems Tang, Wei; Lan, Zhiling; Desai, Narayan 2009 IEEE International Conference on Cluster Computing and Workshops https://doi.org/10.1109/CLUSTR.2009.5289206	conference	August 2009
Heuristics and augmented neural networks for task scheduling with non-identical machines Agarwal, Anurag; Colak, Selcuk; Jacob, Varghese S. European Journal of Operational Research, Vol. 175, Issue 1 https://doi.org/10.1016/j.ejor.2005.03.045	journal	November 2006
Resource Management with Deep Reinforcement Learning Mao, Hongzi; Alizadeh, Mohammad; Menache, Ishai Proceedings of the 15th ACM Workshop on Hot Topics in Networks https://doi.org/10.1145/3005745.3005750	conference	November 2016
A review on evolution of production scheduling with neural networks Akyol, Derya Eren; Bayhan, G. Mirac Computers & Industrial Engineering, Vol. 53, Issue 1 https://doi.org/10.1016/j.cie.2007.04.006	journal	August 2007
Reinforcement Learning: A Survey Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Journal of Artificial Intelligence Research, Vol. 4 https://doi.org/10.1613/jair.301	journal	January 1996
Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers Ambati, Pradeep; Bashir, Noman; Irwin, David SC20: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC41405.2020.00071	conference	November 2020

Similar Records

RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning

Conference · Sun Nov 01 00:00:00 EDT 2020 · OSTI ID:1777791

Tandem Predictions for HPC Jobs

Conference · Wed Jul 17 00:00:00 EDT 2024 · OSTI ID:2447811

Is Knowledge about Running Applications Helping Improve Runtime Prediction of HPC Jobs?

Conference · Sun Sep 10 00:00:00 EDT 2023 · OSTI ID:2242427

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

Citation Formats

References (23)

Similar Records

Related Subjects