DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault Modeling of Extreme Scale Applications Using Machine Learning

Abstract

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.

Authors:
 [1];  [2];  [1];  [1];  [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Brookhaven National Lab. (BNL), Upton, NY (United States)
Publication Date:
Research Org.:
Brookhaven National Lab. (BNL), Upton, NY (United States)
Sponsoring Org.:
USDOE; Laboratory-Directed Research and Development (LDRD)
OSTI Identifier:
1336191
Report Number(s):
BNL-112692-2016-JA
Journal ID: ISSN 1530-2075
Grant/Contract Number:  
SC0012704
Resource Type:
Accepted Manuscript
Journal Name:
Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Additional Journal Information:
Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Conference: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL (United States), 23-27 May 2016; Journal ID: ISSN 1530-2075
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Faults; Memory; Design Elements; Exascale; Modeling; Machine Learning; Applications

Citation Formats

Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States: N. p., 2016. Web. doi:10.1109/IPDPS.2016.111.
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., & Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States. https://doi.org/10.1109/IPDPS.2016.111
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Sun . "Fault Modeling of Extreme Scale Applications Using Machine Learning". United States. https://doi.org/10.1109/IPDPS.2016.111. https://www.osti.gov/servlets/purl/1336191.
@article{osti_1336191,
title = {Fault Modeling of Extreme Scale Applications Using Machine Learning},
author = {Vishnu, Abhinav and Dam, Hubertus van and Tallent, Nathan R. and Kerbyson, Darren J. and Hoisie, Adolfy},
abstractNote = {Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.},
doi = {10.1109/IPDPS.2016.111},
journal = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = ,
place = {United States},
year = {Sun May 01 00:00:00 EDT 2016},
month = {Sun May 01 00:00:00 EDT 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 12 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Measuring architectural vulnerability factors
journal, November 2003


Liquid water: obtaining the right answer for the right reasons
conference, January 2009

  • Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654127

Soft-LLFI: A Comprehensive Framework for Software Fault Injection
conference, November 2014

  • Aliabadi, Maryam Raiyat; Pattabiraman, Karthik; Bidokhti, Nematollah
  • 2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)
  • DOI: 10.1109/ISSREW.2014.114

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012

  • Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.29

PinADX: an interface for customizable debugging with dynamic instrumentation
conference, January 2012

  • Lueck, Gregory; Patil, Harish; Pereira, Cristiano
  • Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
  • DOI: 10.1145/2259016.2259032

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014

  • Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.128

Quantifying software vulnerability
conference, January 2008

  • Sridharan, Vilas; Kaeli, David R.
  • Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies - WREFT '08
  • DOI: 10.1145/1366224.1366225

A large-scale study of failures in high-performance computing systems
conference, January 2006

  • Schroeder, B.; Gibson, G. A.
  • International Conference on Dependable Systems and Networks (DSN'06)
  • DOI: 10.1109/DSN.2006.5

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.18

Fault-tolerant communication runtime support for data-centric programming models
conference, December 2010

  • Vishnu, Abhinav; Van Dam, Huub; De Jong, Wibe
  • 2010 International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HIPC.2010.5713195

Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples
journal, December 2010

  • van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A.
  • Journal of Chemical Theory and Computation, Vol. 7, Issue 1
  • DOI: 10.1021/ct100439u

Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014

  • Yu, Li; Li, Dong; Mittal, Sparsh
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.62

Correcting soft errors online in LU factorization
conference, January 2013

  • Davies, Teresa; Chen, Zizhong
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
  • DOI: 10.1145/2493123.2462920

Fault resilience of the algebraic multi-grid solver
conference, January 2012

  • Casas, Marc; de Supinski, Bronis R.; Bronevetsky, Greg
  • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
  • DOI: 10.1145/2304576.2304590

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
conference, January 2013

  • Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • DOI: 10.1145/2503210.2503257

A study of DRAM failures in the field
conference, November 2012

  • Sridharan, Vilas; Liberty, Dean
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.13

Fault Tolerance in Message Passing Interface Programs
journal, August 2004

  • Gropp, William; Lusk, Ewing
  • The International Journal of High Performance Computing Applications, Vol. 18, Issue 3
  • DOI: 10.1177/1094342004046045

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the international conference on Supercomputing - ICS '11
  • DOI: 10.1145/1995896.1995922

Soft error vulnerability of iterative linear algebra methods
conference, January 2008

  • Bronevetsky, Greg; de Supinski, Bronis
  • Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
  • DOI: 10.1145/1375527.1375552