skip to main content

DOE PAGESDOE PAGES

Title: Fault Modeling of Extreme Scale Applications Using Machine Learning

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
Authors:
 [1] ;  [2] ;  [1] ;  [1] ;  [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Brookhaven National Lab. (BNL), Upton, NY (United States)
Publication Date:
Report Number(s):
BNL-112692-2016-JA
Journal ID: ISSN 1530-2075
Grant/Contract Number:
SC0012704
Type:
Accepted Manuscript
Journal Name:
Parallel and Distributed Processing Symposium, 2016 IEEE International
Additional Journal Information:
Conference: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL (United States), 23-27 May 2016; Journal ID: ISSN 1530-2075
Research Org:
Brookhaven National Laboratory (BNL), Upton, NY (United States)
Sponsoring Org:
USDOE; Laboratory-Directed Research and Development (LDRD)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Faults; Memory; Design Elements; Exascale; Modeling; Machine Learning; Applications
OSTI Identifier:
1336191