skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine learning based job status prediction in scientific clusters

Journal Article · · Proceedings of 2016 SAI Computing Conference, SAI 2016
 [1];  [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1379580
Journal Information:
Proceedings of 2016 SAI Computing Conference, SAI 2016, Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
Country of Publication:
United States
Language:
English

Cited By (1)

Machine Learning Predictions for Underestimation of Job Runtime on HPC System book January 2018