SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing Systems
- Univ. of Utah, Salt Lake City, UT (United States)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
We report the trend of rapid technology scaling is expected to make the hardware of high-performance computing (HPC) systems more susceptible to computational errors due to random bit flips. Some bit flips may cause a program to crash or have a minimal effect on the output, but others may lead to silent data corruption (SDC), i.e., undetected yet significant output errors. Classical fault injection analysis methods employ uniform sampling of random bit flips during program execution to derive a statistical resiliency profile. However, summarizing such fault injection result with sufficient detail is difficult, and understanding the behavior of the fault-corrupted program is still a challenge. In this article, we introduce SpotSDC, a visualization system to facilitate the analysis of a program's resilience to SDC. SpotSDC provides multiple perspectives at various levels of detail of the impact on the output relative to where in the source code the flipped bit occurs, which bit is flipped, and when during the execution it happens. SpotSDC also enables users to study the code protection and provide new insights to understand the behavior of a fault-injected program. Based on lessons learned, we demonstrate how what we found can improve the fault injection campaign method.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- Grant/Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1868154
- Report Number(s):
- LLNL-JRNL-764021; 953990
- Journal Information:
- IEEE Transactions on Visualization and Computer Graphics, Vol. 27, Issue 10; ISSN 1077-2626
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors
|
conference | June 2017 |
Visualizing Network Traffic to Understand the Performance of Massively Parallel Simulations
|
journal | December 2012 |
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults
|
journal | June 2012 |
Statistical fault injection: Quantified error and confidence
|
conference | April 2009 |
Self-stabilizing iterative solvers
|
conference | January 2013 |
Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines
|
journal | May 2010 |
Understanding the propagation of transient errors in HPC applications
|
conference | January 2015 |
Ensemble-Vis: A Framework for the Statistical Visualization of Ensemble Data
|
conference | December 2009 |
ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning
|
conference | November 2018 |
Tree visualization with tree-maps: 2-d space-filling approach
|
journal | January 1992 |
Shoestring: probabilistic soft error reliability on the cheap
|
journal | March 2010 |
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
|
conference | June 2014 |
Supercomputing's monster in the closet
|
journal | March 2016 |
Low-cost program-level detectors for reducing silent data corruptions
|
conference | June 2012 |
An algorithm for the machine calculation of complex Fourier series
|
journal | May 1965 |
Treevis.net: A Tree Visualization Reference
|
journal | November 2011 |
The eyes have it: a task by data type taxonomy for information visualizations
|
conference | January 1996 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
Software-controlled fault tolerance
|
journal | December 2005 |
Visualization of large hierarchical data by circle packing
|
conference | April 2006 |
Evaluating the Impact of SDC on the GMRES Iterative Solver
|
conference | May 2014 |
Addressing failures in exascale computing
|
journal | March 2014 |
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
|
conference | May 2011 |
Modeling Soft-Error Propagation in Programs
|
conference | June 2018 |
CFGExplorer: Designing a Visual Control Flow Analytics System around Basic Program Analysis Operations
|
journal | June 2018 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
|
conference | February 2015 |
Fault injection acceleration by architectural importance sampling
|
conference | October 2015 |
Analysis of Fault Tolerance in Artificial Neural Networks
|
journal | January 2001 |
A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications
|
journal | January 2019 |
Contour Boxplots: A Method for Characterizing Uncertainty in Feature Sets from Simulation Ensembles
|
journal | December 2013 |
SWIFT: Software Implemented Fault Tolerance
|
conference | January 2005 |
Juniper: A Tree+Table Approach to Multivariate Graph Visualization
|
journal | January 2019 |
The Soft Error Problem: An Architectural Perspective
|
conference | January 2005 |
Error Detecting and Error Correcting Codes
|
journal | April 1950 |
A system for graph-based visualization of the evolution of software
|
conference | January 2003 |
D
|
conference | February 2018 |
IPAS: intelligent protection against silent output corruption in scientific applications
|
conference | February 2016 |
Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time
|
journal | December 2014 |
FlipTracker: Understanding Natural Error Resilience in HPC Applications
|
conference | November 2018 |
Curve Boxplot: Generalization of Boxplot for Ensembles of Curves
|
journal | December 2014 |
Fixed-Rate Compressed Floating-Point Arrays
|
journal | December 2014 |
Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers
|
journal | August 2019 |
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
|
conference | October 2014 |
Lineage: Visualizing Multivariate Clinical Data in Genealogy Graphs
|
journal | March 2019 |
Fault resilience of the algebraic multi-grid solver
|
conference | January 2012 |
Understanding Error Propagation in GPGPU Applications
|
conference | November 2016 |
ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps
|
journal | June 2003 |
Visualization and Visual Analysis of Ensemble Data: A Survey
|
journal | September 2019 |
Similar Records
LADR: low-cost application-level detector for reducing silent output corruptions
Exploring the capabilities of support vector machines in detecting silent data corruptions