Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Tandem Predictions for HPC Jobs

Conference ·
At the core of the predictive analytics applied to High Performance Computing (HPC), the most prominent tasks are the prediction of job runtimes and the prediction of job queue times, both of which have the potential for informing HPC users during their every-day decision making. Accurate runtime predictions can help users better choose so-called wallclock times at job submission, decreasing the odds of their jobs waiting in queues longer than necessary. The accurate and timely queue time predictions offered for the available partitions can inform the favorable selection of partitions for running jobs. This potential is well understood as we see in the abundance of research studies that propose solutions for these tasks, including the work published in the last several years. These tasks are seemingly receptive to the Machine Learning (ML) solutions, considering that there is no shortage of training data where HPC centers over time run millions and millions of jobs. However, we study the existing research literature, as well as look for examples in the toolchains supported on the exemplar HPC facilities, and, surprisingly, do not find any practical solutions that are ready to be adopted. We interpret this as a manifestation of the shortage of UX/UI efforts that support HPC analytics and also as a sign that the research has not come to the consensus on solving these tasks. In this study, we aim to shed new light on the long-running task of job queue time prediction by exploring the utility of runtime predictions in improving prediction accuracy and, actually, predicting these two metrics together, in tandem. In other words, we show how runtime predictions become valuable input in the queue time modeling. We challenge the existing approaches to feature engineering for the queue time prediction and describe promising results we obtained for a large dataset of HPC jobs from a supercomputer at the National Renewable Energy Laboratory.
Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE National Renewable Energy Laboratory (NREL)
DOE Contract Number:
AC36-08GO28308
OSTI ID:
2447811
Report Number(s):
NREL/CP-2C00-90228; MainId:92006; UUID:8ce1881e-ec2e-4819-91ea-0cf63bd05770; MainAdminId:73768
Country of Publication:
United States
Language:
English

References (8)

An Integrated Job Monitor, Analyzer and Predictor conference September 2021
A Machine Learning Approach for an HPC Use Case: the Jobs Queuing Time Prediction journal June 2023
Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach conference July 2023
Approbation of Methods for Supercomputer Job Queue Wait Time Estimation journal August 2023
Queue congestion prediction for large-scale high performance computing systems using a hidden Markov model journal February 2022
A Conceptual Framework for HPC Operational Data Analytics conference September 2021
Queue Waiting Time Prediction for Large-scale High-performance Computing System conference July 2019
SLURM: Simple Linux Utility for Resource Management book January 2003

Similar Records

Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach
Conference · Sun Sep 10 00:00:00 EDT 2023 · OSTI ID:2246634

Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach: Preprint
Conference · Mon Jun 26 00:00:00 EDT 2023 · OSTI ID:1988023

Quantifying Uncertainty in HPC Job Queue Time Predictions
Conference · Wed Jul 17 00:00:00 EDT 2024 · OSTI ID:2433908