DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine learning based job status prediction in scientific clusters

Abstract

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.

Authors:
 [1];  [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1379580
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Proceedings of 2016 SAI Computing Conference, SAI 2016
Additional Journal Information:
Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction

Citation Formats

Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States: N. p., 2016. Web. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, & Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States. https://doi.org/10.1109/SAI.2016.7555961
Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Thu . "Machine learning based job status prediction in scientific clusters". United States. https://doi.org/10.1109/SAI.2016.7555961. https://www.osti.gov/servlets/purl/1379580.
@article{osti_1379580,
title = {Machine learning based job status prediction in scientific clusters},
author = {Yoo, Wucherl and Sim, Alex and Wu, Kesheng},
abstractNote = {Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.},
doi = {10.1109/SAI.2016.7555961},
journal = {Proceedings of 2016 SAI Computing Conference, SAI 2016},
number = ,
volume = ,
place = {United States},
year = {Thu Sep 01 00:00:00 EDT 2016},
month = {Thu Sep 01 00:00:00 EDT 2016}
}

Works referencing / citing this record:

Machine Learning Predictions for Underestimation of Job Runtime on HPC System
book, January 2018