Fault Modeling of Extreme Scale Applications Using Machine Learning

Vishnu, Abhinav; Dam, Hubertus van; Tallent, Nathan R.; Kerbyson, Darren J.; Hoisie, Adolfy

doi:10.1109/IPDPS.2016.111

Title: Fault Modeling of Extreme Scale Applications Using Machine Learning

Abstract

Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.

Authors:

Vishnu, Abhinav ^[1]; Dam, Hubertus van ^[2]; Tallent, Nathan R. ^[1]; Kerbyson, Darren J. ^[1]; Hoisie, Adolfy ^[1]

Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Brookhaven National Lab. (BNL), Upton, NY (United States)

Publication Date:: Sun May 01 00:00:00 EDT 2016

Research Org.:: Brookhaven National Lab. (BNL), Upton, NY (United States)

Sponsoring Org.:: USDOE; Laboratory-Directed Research and Development (LDRD)

OSTI Identifier:: 1336191

Report Number(s):: BNL-112692-2016-JA
Journal ID: ISSN 1530-2075

Grant/Contract Number:: SC0012704

Resource Type:: Accepted Manuscript

Journal Name:: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Additional Journal Information:: Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Conference: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL (United States), 23-27 May 2016; Journal ID: ISSN 1530-2075

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Faults; Memory; Design Elements; Exascale; Modeling; Machine Learning; Applications

Citation Formats


                    Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning.  United States: N. p., 2016. 
Web.  doi:10.1109/IPDPS.2016.111.

Copy to clipboard


                    Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., & Hoisie, Adolfy. Fault Modeling of Extreme Scale Applications Using Machine Learning.  United States.  https://doi.org/10.1109/IPDPS.2016.111

Copy to clipboard


                    Vishnu, Abhinav, Dam, Hubertus van, Tallent, Nathan R., Kerbyson, Darren J., and Hoisie, Adolfy. Sun .  
"Fault Modeling of Extreme Scale Applications Using Machine Learning".  United States.  https://doi.org/10.1109/IPDPS.2016.111.  https://www.osti.gov/servlets/purl/1336191.

Copy to clipboard


                    
@article{osti_1336191,

  title        = {Fault Modeling of Extreme Scale Applications Using Machine Learning},

  author       = {Vishnu, Abhinav and Dam, Hubertus van and Tallent, Nathan R. and Kerbyson, Darren J. and Hoisie, Adolfy},

  abstractNote = {Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. Here, this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error — and hence a recovery algorithm should be invoked — or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multibit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. Lastly, we use three applications — NWChem, LULESH and SVM — as examples for demonstrating the effectiveness of the proposed fault modeling methodology.},

  doi          = {10.1109/IPDPS.2016.111},

  journal      = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Sun May 01 00:00:00 EDT 2016},

  month        = {Sun May 01 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/IPDPS.2016.111

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 12 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Measuring architectural vulnerability factors
journal, November 2003

Mukherjee, S. S.; Weaver, C. T.; Emer, J.
IEEE Micro, Vol. 23, Issue 6
DOI: 10.1109/MM.2003.1261389

Liquid water: obtaining the right answer for the right reasons
conference, January 2009

Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
DOI: 10.1145/1654059.1654127

Soft-LLFI: A Comprehensive Framework for Software Fault Injection
conference, November 2014

Aliabadi, Maryam Raiyat; Pattabiraman, Karthik; Bidokhti, Nematollah
2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)
DOI: 10.1109/ISSREW.2014.114

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012

Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.29

PinADX: an interface for customizable debugging with dynamic instrumentation
conference, January 2012

Lueck, Gregory; Patil, Harish; Pereira, Cristiano
Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
DOI: 10.1145/2259016.2259032

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014

Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.128

Quantifying software vulnerability
conference, January 2008

Sridharan, Vilas; Kaeli, David R.
Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies - WREFT '08
DOI: 10.1145/1366224.1366225

A large-scale study of failures in high-performance computing systems
conference, January 2006

Schroeder, B.; Gibson, G. A.
International Conference on Dependable Systems and Networks (DSN'06)
DOI: 10.1109/DSN.2006.5

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2010.18

Fault-tolerant communication runtime support for data-centric programming models
conference, December 2010

Vishnu, Abhinav; Van Dam, Huub; De Jong, Wibe
2010 International Conference on High Performance Computing (HiPC)
DOI: 10.1109/HIPC.2010.5713195

Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples
journal, December 2010

van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A.
Journal of Chemical Theory and Computation, Vol. 7, Issue 1
DOI: 10.1021/ct100439u

Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014

Yu, Li; Li, Dong; Mittal, Sparsh
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.62

Correcting soft errors online in LU factorization
conference, January 2013

Davies, Teresa; Chen, Zizhong
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
DOI: 10.1145/2493123.2462920

Fault resilience of the algebraic multi-grid solver
conference, January 2012

Casas, Marc; de Supinski, Bronis R.; Bronevetsky, Greg
Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
DOI: 10.1145/2304576.2304590

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
conference, January 2013

Sridharan, Vilas; Stearley, Jon; DeBardeleben, Nathan
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
DOI: 10.1145/2503210.2503257

A study of DRAM failures in the field
conference, November 2012

Sridharan, Vilas; Liberty, Dean
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.13

Fault Tolerance in Message Passing Interface Programs
journal, August 2004

Gropp, William; Lusk, Ewing
The International Journal of High Performance Computing Applications, Vol. 18, Issue 3
DOI: 10.1177/1094342004046045

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
Proceedings of the international conference on Supercomputing - ICS '11
DOI: 10.1145/1995896.1995922

Soft error vulnerability of iterative linear algebra methods
conference, January 2008

Bronevetsky, Greg; de Supinski, Bronis
Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
DOI: 10.1145/1375527.1375552

Similar Records in DOE PAGES and OSTI.GOV collections:

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

Technical Report Schreiber, Robert

Summary of technical results of Blackcomb Memory Devices We explored various different memory technologies (STTRAM, PCRAM, FeRAM, and ReRAM). The progress can be classified into three categories, below. Modeling and Tool Releases Various modeling tools have been developed over the last decade to help in the design of SRAM or DRAM-based memory hierarchies. To explore new design opportunities that NVM technologies can bring to the designers, we have developed similar high-level models for NVM, including PCRAMsim [Dong 2009], NVSim [Dong 2012], and NVMain [Poremba 2012]. NVSim is a circuit-level model for NVM performance, energy, and area estimation, which supports variousmore »« less
Resiliency in numerical algorithm design for extreme scale simulations

Journal Article Agullo, Emmanuel ; Altenbernd, Mirco ; Anzt, Hartwig ; ... - International Journal of High Performance Computing Applications

Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10²³ floating-point operations. It is clearly unacceptable to lose the whole computation if any ofmore »« less
https://doi.org/10.1177/10943420211055188

Full Text Available
Final report for CCS cross-layer reliability visioning study

Technical Report Quinn, Heather M ; Dehon, Andre ; Carter, Nicj

The geometric rate of improvement of transistor size and integrated circuit performance known as Moore's Law has been an engine of growth for our economy, enabling new products and services, creating new value and wealth, increasing safety, and removing menial tasks from our daily lives. Affordable, highly integrated components have enabled both life-saving technologies and rich entertainment applications. Anti-lock brakes, insulin monitors, and GPS-enabled emergency response systems save lives. Cell phones, internet appliances, virtual worlds, realistic video games, and mp3 players enrich our lives and connect us together. Over the past 40 years of silicon scaling, the increasing capabilities ofmore »« less
https://doi.org/10.2172/1044902

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more »« less
https://doi.org/10.2172/1092881

Full Text Available

Similar Records

Title: Fault Modeling of Extreme Scale Applications Using Machine Learning

Abstract

Citation Formats

Measuring architectural vulnerability factors journal, November 2003

Liquid water: obtaining the right answer for the right reasons conference, January 2009

Soft-LLFI: A Comprehensive Framework for Software Fault Injection conference, November 2014

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool conference, November 2012

PinADX: an interface for customizable debugging with dynamic instrumentation conference, January 2012

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability conference, May 2014

Quantifying software vulnerability conference, January 2008

A large-scale study of failures in high-performance computing systems conference, January 2006

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System conference, November 2010

Fault-tolerant communication runtime support for data-centric programming models conference, December 2010

Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples journal, December 2010

Quantitatively Modeling Application Resilience with the Data Vulnerability Factor conference, November 2014

Correcting soft errors online in LU factorization conference, January 2013

Fault resilience of the algebraic multi-grid solver conference, January 2012

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults conference, January 2013

A study of DRAM failures in the field conference, November 2012

Fault Tolerance in Message Passing Interface Programs journal, August 2004

Characterizing the impact of soft errors on iterative methods in scientific computing conference, January 2011

Soft error vulnerability of iterative linear algebra methods conference, January 2008

Measuring architectural vulnerability factors
journal, November 2003

Liquid water: obtaining the right answer for the right reasons
conference, January 2009

Soft-LLFI: A Comprehensive Framework for Software Fault Injection
conference, November 2014

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012

PinADX: an interface for customizable debugging with dynamic instrumentation
conference, January 2012

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014

Quantifying software vulnerability
conference, January 2008

A large-scale study of failures in high-performance computing systems
conference, January 2006

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Fault-tolerant communication runtime support for data-centric programming models
conference, December 2010

Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples
journal, December 2010

Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014

Correcting soft errors online in LU factorization
conference, January 2013

Fault resilience of the algebraic multi-grid solver
conference, January 2012

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
conference, January 2013

A study of DRAM failures in the field
conference, November 2012

Fault Tolerance in Message Passing Interface Programs
journal, August 2004

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

Soft error vulnerability of iterative linear algebra methods
conference, January 2008