Fault Modeling of Extreme Scale Applications Using Machine Learning
Abstract
Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
- Authors:
-
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Brookhaven National Lab. (BNL), Upton, NY (United States)
- Publication Date:
- Research Org.:
- Brookhaven National Lab. (BNL), Upton, NY (United States)
- Sponsoring Org.:
- USDOE; Laboratory-Directed Research and Development (LDRD)
- OSTI Identifier:
- 1336191
- Report Number(s):
- BNL-112692-2016-JA
Journal ID: ISSN 1530-2075
- Grant/Contract Number:
- SC0012704
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Additional Journal Information:
- Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Conference: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL (United States), 23-27 May 2016; Journal ID: ISSN 1530-2075
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Faults; Memory; Design Elements; Exascale; Modeling; Machine Learning; Applications
Citation Formats
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States: N. p., 2016.
Web. doi:10.1109/IPDPS.2016.111.
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., & Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning. United States. https://doi.org/10.1109/IPDPS.2016.111
Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Sun .
"Fault Modeling of Extreme Scale Applications Using Machine Learning". United States. https://doi.org/10.1109/IPDPS.2016.111. https://www.osti.gov/servlets/purl/1336191.
@article{osti_1336191,
title = {Fault Modeling of Extreme Scale Applications Using Machine Learning},
author = {Vishnu, Abhinav and Dam, Hubertus van and Tallent, Nathan R. and Kerbyson, Darren J. and Hoisie, Adolfy},
abstractNote = {Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.},
doi = {10.1109/IPDPS.2016.111},
journal = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = ,
place = {United States},
year = {Sun May 01 00:00:00 EDT 2016},
month = {Sun May 01 00:00:00 EDT 2016}
}
Web of Science
Works referenced in this record:
Measuring architectural vulnerability factors
journal, November 2003
- Mukherjee, S. S.; Weaver, C. T.; Emer, J.
- IEEE Micro, Vol. 23, Issue 6
Liquid water: obtaining the right answer for the right reasons
conference, January 2009
- Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
Soft-LLFI: A Comprehensive Framework for Software Fault Injection
conference, November 2014
- Aliabadi, Maryam Raiyat; Pattabiraman, Karthik; Bidokhti, Nematollah
- 2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012
- Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
- 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
PinADX: an interface for customizable debugging with dynamic instrumentation
conference, January 2012
- Lueck, Gregory; Patil, Harish; Pereira, Cristiano
- Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014
- Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Quantifying software vulnerability
conference, January 2008
- Sridharan, Vilas; Kaeli, David R.
- Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies - WREFT '08
A large-scale study of failures in high-performance computing systems
conference, January 2006
- Schroeder, B.; Gibson, G. A.
- International Conference on Dependable Systems and Networks (DSN'06)
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010
- Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
- 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Fault-tolerant communication runtime support for data-centric programming models
conference, December 2010
- Vishnu, Abhinav; Van Dam, Huub; De Jong, Wibe
- 2010 International Conference on High Performance Computing (HiPC)
Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples
journal, December 2010
- van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A.
- Journal of Chemical Theory and Computation, Vol. 7, Issue 1
Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014
- Yu, Li; Li, Dong; Mittal, Sparsh
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
Correcting soft errors online in LU factorization
conference, January 2013
- Davies, Teresa; Chen, Zizhong
- Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
Fault resilience of the algebraic multi-grid solver
conference, January 2012
- Casas, Marc; de Supinski, Bronis R.; Bronevetsky, Greg
- Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
conference, January 2013
- Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
A study of DRAM failures in the field
conference, November 2012
- Sridharan, Vilas; Liberty, Dean
- 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Fault Tolerance in Message Passing Interface Programs
journal, August 2004
- Gropp, William; Lusk, Ewing
- The International Journal of High Performance Computing Applications, Vol. 18, Issue 3
Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the international conference on Supercomputing - ICS '11
Soft error vulnerability of iterative linear algebra methods
conference, January 2008
- Bronevetsky, Greg; de Supinski, Bronis
- Proceedings of the 22nd annual international conference on Supercomputing - ICS '08