skip to main content


Title: Machine learning based job status prediction in scientific clusters

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.
 [1] ;  [1] ;  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Grant/Contract Number:
Accepted Manuscript
Journal Name:
Proceedings of 2016 SAI Computing Conference, SAI 2016
Additional Journal Information:
Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
Research Org:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Country of Publication:
United States
97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction
OSTI Identifier: