Understanding scale-Dependent soft-Error Behavior of Scientific Applications
- ORNL
- Pacific Northwest National Laboratory (PNNL)
Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1474572
- Resource Relation:
- Conference: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) - Washington D.C., District of Columbia, United States of America - 5/1/2018 8:00:00 AM-5/4/2018 8:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
|
conference | November 2012 |
Matrix Multiplication on GPUs with On-Line Fault Tolerance
|
conference | May 2011 |
Shoestring: probabilistic soft error reliability on the cheap
|
conference | January 2010 |
Localized Fault Recovery for Nested Fork-Join Programs
|
conference | May 2017 |
Radiation-induced soft errors in advanced semiconductor technologies
|
journal | September 2005 |
An Experimental Study of Soft Errors in Microprocessors
|
journal | November 2005 |
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
|
conference | November 2014 |
Quantitative evaluation of soft error injection techniques for robust system design
|
conference | January 2013 |
BoomerAMG: A parallel algebraic multigrid solver and preconditioner
|
journal | April 2002 |
Bulldozer: An Approach to Multithreaded Compute Performance
|
journal | March 2011 |
Fast Parallel Algorithms for Short-Range Molecular Dynamics
|
journal | March 1995 |
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
|
conference | June 2014 |
Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller
|
journal | September 2011 |
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
|
conference | May 2013 |
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
|
conference | October 2014 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor
|
conference | January 2003 |
Searching for exotic particles in high-energy physics with deep learning
|
journal | July 2014 |
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
|
journal | November 2005 |
Soft error vulnerability of iterative linear algebra methods
|
conference | January 2008 |
Fault resilience of the algebraic multi-grid solver
|
conference | January 2012 |
Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
|
conference | November 2014 |
Fault Modeling of Extreme Scale Applications Using Machine Learning
|
conference | May 2016 |
Similar Records
Center for Technology for Advanced Scientific Componet Software (TASCS)
A Case for Soft Error Detection and Correction in Computational Chemistry