Machine learning based job status prediction in scientific clusters
Abstract
Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1379580
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Proceedings of 2016 SAI Computing Conference, SAI 2016
- Additional Journal Information:
- Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction
Citation Formats
Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States: N. p., 2016.
Web. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, & Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States. https://doi.org/10.1109/SAI.2016.7555961
Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Thu .
"Machine learning based job status prediction in scientific clusters". United States. https://doi.org/10.1109/SAI.2016.7555961. https://www.osti.gov/servlets/purl/1379580.
@article{osti_1379580,
title = {Machine learning based job status prediction in scientific clusters},
author = {Yoo, Wucherl and Sim, Alex and Wu, Kesheng},
abstractNote = {Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.},
doi = {10.1109/SAI.2016.7555961},
journal = {Proceedings of 2016 SAI Computing Conference, SAI 2016},
number = ,
volume = ,
place = {United States},
year = {Thu Sep 01 00:00:00 EDT 2016},
month = {Thu Sep 01 00:00:00 EDT 2016}
}
Works referencing / citing this record:
Machine Learning Predictions for Underestimation of Job Runtime on HPC System
book, January 2018
- Guo, Jian; Nomura, Akihiro; Barton, Ryan
- Supercomputing Frontiers