skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault Modeling of Extreme Scale Applications Using Machine Learning

Abstract

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.

Authors:
 [1];  [2];  [1];  [1];  [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Brookhaven National Lab. (BNL), Upton, NY (United States)
Publication Date:
Research Org.:
Brookhaven National Lab. (BNL), Upton, NY (United States)
Sponsoring Org.:
USDOE; Laboratory-Directed Research and Development (LDRD)
OSTI Identifier:
1336191
Report Number(s):
BNL-112692-2016-JA
Journal ID: ISSN 1530-2075
Grant/Contract Number:  
SC0012704
Resource Type:
Accepted Manuscript
Journal Name:
Parallel and Distributed Processing Symposium, 2016 IEEE International
Additional Journal Information:
Conference: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL (United States), 23-27 May 2016; Journal ID: ISSN 1530-2075
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Faults; Memory; Design Elements; Exascale; Modeling; Machine Learning; Applications

Citation Formats

Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States: N. p., 2016. Web. doi:10.1109/IPDPS.2016.111.
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., & Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States. doi:10.1109/IPDPS.2016.111.
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Sun . "Fault Modeling of Extreme Scale Applications Using Machine Learning". United States. doi:10.1109/IPDPS.2016.111. https://www.osti.gov/servlets/purl/1336191.
@article{osti_1336191,
title = {Fault Modeling of Extreme Scale Applications Using Machine Learning},
author = {Vishnu, Abhinav and Dam, Hubertus van and Tallent, Nathan R. and Kerbyson, Darren J. and Hoisie, Adolfy},
abstractNote = {Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.},
doi = {10.1109/IPDPS.2016.111},
journal = {Parallel and Distributed Processing Symposium, 2016 IEEE International},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {5}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share: