Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Is Knowledge about Running Applications Helping Improve Runtime Prediction of HPC Jobs?

Conference ·

High-performance computing systems rely upon scheduling algorithms to achieve high utilization. These schedulers rely upon user estimates of job resource requirements, such as runtime, to determine optimal scheduling of incoming jobs. These user estimates, however, are prone to error. To mitigate this error, significant research has been directed at providing better estimates of job runtime, usually employing machine learning techniques. These techniques are dependent upon the input features selected. Among the possible features is the primary application used by the job. In a survey of more than 20 papers directed at improving runtime prediction, only four included primary application as an input feature. We focus this investigation specifically on the value of adding primary application as an input feature, and find that it does improve model performance, especially for jobs with longer runtimes, though this improvement varies based on the application used. We recommend further research to determine the cause of this variability as well as an optimal strategy for employing a mixture of models both including and not including primary application as a feature.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
USDOE National Renewable Energy Laboratory (NREL)
DOE Contract Number:
AC36-08GO28308
OSTI ID:
2242427
Report Number(s):
NREL/CP-2C00-88316; MainId:89091; UUID:ea739d46-7ac7-436c-afdd-baad9f085205; MainAdminID:71321
Resource Relation:
Conference: Presented at PEARC '23: Practice and Experience in Advanced Research Computing, 23-27 July 2023, Portland, Oregon
Country of Publication:
United States
Language:
English

References (3)

OKCM: improving parallel task scheduling in high-performance computing systems using online learning journal November 2020
Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems conference July 2021
Improving the performance of batch schedulers using online job runtime classification journal June 2022