CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters
The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments , acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing in-frastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1804062
- Country of Publication:
- United States
- Language:
- English
Similar Records
RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning
MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Conference
·
Sun Nov 01 00:00:00 EDT 2020
·
OSTI ID:1777791
MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
Conference
·
Fri May 01 00:00:00 EDT 2020
·
OSTI ID:1649080
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
Conference
·
Wed Jun 01 00:00:00 EDT 2022
·
OSTI ID:1885384