skip to main content

DOE PAGESDOE PAGES

Title: Machine learning based job status prediction in scientific clusters

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.
Authors:
 [1] ;  [1] ;  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Grant/Contract Number:
AC02-05CH11231
Type:
Accepted Manuscript
Journal Name:
Proceedings of 2016 SAI Computing Conference, SAI 2016
Additional Journal Information:
Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
Research Org:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction
OSTI Identifier:
1379580

Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States: N. p., Web. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, & Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. 2016. "Machine learning based job status prediction in scientific clusters". United States. doi:10.1109/SAI.2016.7555961. https://www.osti.gov/servlets/purl/1379580.
@article{osti_1379580,
title = {Machine learning based job status prediction in scientific clusters},
author = {Yoo, Wucherl and Sim, Alex and Wu, Kesheng},
abstractNote = {Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.},
doi = {10.1109/SAI.2016.7555961},
journal = {Proceedings of 2016 SAI Computing Conference, SAI 2016},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {9}
}