Machine learning based job status prediction in scientific clusters

Yoo, Wucherl; Sim, Alex; Wu, Kesheng

doi:10.1109/SAI.2016.7555961

Title: Machine learning based job status prediction in scientific clusters

Abstract

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.

Authors:

Yoo, Wucherl ^[1]; Sim, Alex ^[1]; Wu, Kesheng ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Thu Sep 01 00:00:00 EDT 2016

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1379580

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Proceedings of 2016 SAI Computing Conference, SAI 2016

Additional Journal Information:: Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction

Citation Formats


                    Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Machine learning based job status prediction in scientific clusters.  United States: N. p., 2016. 
Web.  doi:10.1109/SAI.2016.7555961.

Copy to clipboard


                    Yoo, Wucherl, Sim, Alex, & Wu, Kesheng. Machine learning based job status prediction in scientific clusters.  United States.  https://doi.org/10.1109/SAI.2016.7555961

Copy to clipboard


                    Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Thu .  
"Machine learning based job status prediction in scientific clusters".  United States.  https://doi.org/10.1109/SAI.2016.7555961.  https://www.osti.gov/servlets/purl/1379580.

Copy to clipboard


                    
@article{osti_1379580,

  title        = {Machine learning based job status prediction in scientific clusters},

  author       = {Yoo, Wucherl and Sim, Alex and Wu, Kesheng},

  abstractNote = {Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.},

  doi          = {10.1109/SAI.2016.7555961},

  journal      = {Proceedings of 2016 SAI Computing Conference, SAI 2016},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Thu Sep 01 00:00:00 EDT 2016},

  month        = {Thu Sep 01 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/SAI.2016.7555961

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referencing / citing this record:

Machine Learning Predictions for Underestimation of Job Runtime on HPC System
book, January 2018

Guo, Jian; Nomura, Akihiro; Barton, Ryan
Supercomputing Frontiers
DOI: 10.1007/978-3-319-69953-0_11

Similar Records in DOE PAGES and OSTI.GOV collections:

Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Thesis/Dissertation Arumugam, Kamesh

Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore,more »« less
https://doi.org/10.2172/1422715

Full Text Available
Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Book Yoo, Wucherl ; Koo, Michelle ; Cao, Yu ; ...

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe-more »« less
https://doi.org/10.1007/978-3-319-33742-5_7

Full Text Available
Performance analysis tool for HPC and big data applications on scientific clusters

Book Yoo, W ; Koo, M ; Cao, Y ; ...

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe-more »« less
https://doi.org/10.1007/978-3-319-33742-5_7

Full Text Available
Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows

Conference Singh, Alok ; Altintas, Ilkay ; Schram, Malachi ; ...

Deep Learning for Enhancing Fault TolerantCapabilities of Scientific WorkflowsAlok Singh, Ilkay AltintasSan Diego Supercomputer Center, UCSDLa Jolla, CA, USA{a1singh, ialtintas}@ucsd.eduMalachi Schram, Nathan TallentPacific Northwestern National LaboratoryRichland, WA, USA{malachi.schram, nathan.tallent}@pnnl.govAbstract-In the history of Computer Science, the act of ‘delegation’ has been the greatest multiplier of society’s problem-solving ability. A scientist working on detecting anomalies in a phenomenon, does not need to re-invent matrix multiplication techniques to solve her problem. Scientific workflows provide ultimate ‘delegation’ mechanism -where a domain scientist can completely forget the specifics of ‘how’ her program will execute on a large cluster in an efficient and cost-effective manner andmore »« less
https://doi.org/10.1109/BigData.2018.8622509
MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Conference Han, Jingoo ; Rafique, Mustafa ; Xu, Luna ; ...

Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. Tomore »« less
https://doi.org/10.1109/CCGrid49817.2020.00-66

Full Text Available

Similar Records

Title: Machine learning based job status prediction in scientific clusters

Abstract

Citation Formats

Machine Learning Predictions for Underestimation of Job Runtime on HPC System book, January 2018

Machine Learning Predictions for Underestimation of Job Runtime on HPC System
book, January 2018