skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine learning based job status prediction in scientific clusters

Abstract

Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.

Authors:
 [1];  [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1379580
Grant/Contract Number:
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Proceedings of 2016 SAI Computing Conference, SAI 2016
Additional Journal Information:
Conference: 2016 SAI Computing Conference (SAI), London (United Kingdom), 13-15 Jul 2016
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Decision trees; Prediction methods; Hardware; Reliability; Software; Complexity theory; Prediction algorithms; Job Log Analysis; Job Status Prediction

Citation Formats

Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States: N. p., 2016. Web. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, & Wu, Kesheng. Machine learning based job status prediction in scientific clusters. United States. doi:10.1109/SAI.2016.7555961.
Yoo, Wucherl, Sim, Alex, and Wu, Kesheng. 2016. "Machine learning based job status prediction in scientific clusters". United States. doi:10.1109/SAI.2016.7555961. https://www.osti.gov/servlets/purl/1379580.
@article{osti_1379580,
title = {Machine learning based job status prediction in scientific clusters},
author = {Yoo, Wucherl and Sim, Alex and Wu, Kesheng},
abstractNote = {Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.},
doi = {10.1109/SAI.2016.7555961},
journal = {Proceedings of 2016 SAI Computing Conference, SAI 2016},
number = ,
volume = ,
place = {United States},
year = 2016,
month = 9
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high-dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate-model error in a quantity of interest (QoI). This eliminates the need for the user to hand-select a small number of informative features. The methodology requires a training set of parameter instances at which the time-dependent surrogate-model error is computed bymore » simulating both the high-fidelity and surrogate models. Using these training data, the method first determines regression-model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time-instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate-model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time-dependent surrogate-model error (eg, time-integrated errors). We then apply the proposed framework to model errors in reduced-order models of nonlinear oil-water subsurface flow simulations, with time-varying well-control (bottom-hole pressure) parameters. The reduced-order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. Moreover, when the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time-instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time- and well-averaged errors.« less
  • Cited by 6
  • Reynolds Averaged Navier Stokes (RANS) models are widely used in industry to predict fluid flows, despite their acknowledged deficiencies. Not only do RANS models often produce inaccurate flow predictions, but there are very limited diagnostics available to assess RANS accuracy for a given flow configuration. If experimental or higher fidelity simulation results are not available for RANS validation, there is no reliable method to evaluate RANS accuracy. This paper explores the potential of utilizing machine learning algorithms to identify regions of high RANS uncertainty. Three different machine learning algorithms were evaluated: support vector machines, Adaboost decision trees, and random forests.more » The algorithms were trained on a database of canonical flow configurations for which validated direct numerical simulation or large eddy simulation results were available, and were used to classify RANS results on a point-by-point basis as having either high or low uncertainty, based on the breakdown of specific RANS modeling assumptions. Classifiers were developed for three different basic RANS eddy viscosity model assumptions: the isotropy of the eddy viscosity, the linearity of the Boussinesq hypothesis, and the non-negativity of the eddy viscosity. It is shown that these classifiers are able to generalize to flows substantially different from those on which they were trained. As a result, feature selection techniques, model evaluation, and extrapolation detection are discussed in the context of turbulence modeling applications.« less
  • Free-electron lasers providing ultra-short high-brightness pulses of X-ray radiation have great potential for a wide impact on science, and are a critical element for unravelling the structural dynamics of matter. To fully harness this potential, we must accurately know the X-ray properties: intensity, spectrum and temporal profile. Owing to the inherent fluctuations in free-electron lasers, this mandates a full characterization of the properties for each and every pulse. While diagnostics of these properties exist, they are often invasive and many cannot operate at a high-repetition rate. Here, we present a technique for circumventing this limitation. Employing a machine learning strategy,more » we can accurately predict X-ray properties for every shot using only parameters that are easily recorded at high-repetition rate, by training a model on a small set of fully diagnosed pulses. Lastly, this opens the door to fully realizing the promise of next-generation high-repetition rate X-ray lasers.« less