skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing Systems

Journal Article · · IEEE Transactions on Visualization and Computer Graphics

We report the trend of rapid technology scaling is expected to make the hardware of high-performance computing (HPC) systems more susceptible to computational errors due to random bit flips. Some bit flips may cause a program to crash or have a minimal effect on the output, but others may lead to silent data corruption (SDC), i.e., undetected yet significant output errors. Classical fault injection analysis methods employ uniform sampling of random bit flips during program execution to derive a statistical resiliency profile. However, summarizing such fault injection result with sufficient detail is difficult, and understanding the behavior of the fault-corrupted program is still a challenge. In this article, we introduce SpotSDC, a visualization system to facilitate the analysis of a program's resilience to SDC. SpotSDC provides multiple perspectives at various levels of detail of the impact on the output relative to where in the source code the flipped bit occurs, which bit is flipped, and when during the execution it happens. SpotSDC also enables users to study the code protection and provide new insights to understand the behavior of a fault-injected program. Based on lessons learned, we demonstrate how what we found can improve the fault injection campaign method.

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1868154
Report Number(s):
LLNL-JRNL-764021; 953990
Journal Information:
IEEE Transactions on Visualization and Computer Graphics, Vol. 27, Issue 10; ISSN 1077-2626
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (48)

One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors
  • Sangchoolie, Behrooz; Pattabiraman, Karthik; Karlsson, Johan
  • 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2017.30
conference June 2017
Visualizing Network Traffic to Understand the Performance of Massively Parallel Simulations journal December 2012
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults journal June 2012
Statistical fault injection: Quantified error and confidence conference April 2009
Self-stabilizing iterative solvers conference January 2013
Hierarchical Aggregation for Information Visualization: Overview, Techniques, and Design Guidelines journal May 2010
Understanding the propagation of transient errors in HPC applications
  • Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807670
conference January 2015
Ensemble-Vis: A Framework for the Statistical Visualization of Ensemble Data conference December 2009
ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning
  • Menon, Harshitha; Lam, Michael O.; Osei-Kuffuor, Daniel
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00051
conference November 2018
Tree visualization with tree-maps: 2-d space-filling approach journal January 1992
Shoestring: probabilistic soft error reliability on the cheap journal March 2010
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
  • Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.2
conference June 2014
Supercomputing's monster in the closet journal March 2016
Low-cost program-level detectors for reducing silent data corruptions
  • Hari, Siva Kumar Sastry; Adve, Sarita V.; Naeimi, Helia
  • 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) https://doi.org/10.1109/DSN.2012.6263960
conference June 2012
An algorithm for the machine calculation of complex Fourier series journal May 1965
Treevis.net: A Tree Visualization Reference journal November 2011
The eyes have it: a task by data type taxonomy for information visualizations conference January 1996
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Software-controlled fault tolerance journal December 2005
Visualization of large hierarchical data by circle packing
  • Wang, Weixin; Wang, Hui; Dai, Guozhong
  • CHI06: CHI 2006 Conference on Human Factors in Computing Systems, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems https://doi.org/10.1145/1124772.1124851
conference April 2006
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
Addressing failures in exascale computing journal March 2014
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
  • Yim, Keun Soo; Pham, Cuong; Saleheen, Mushfiq
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.36
conference May 2011
Modeling Soft-Error Propagation in Programs
  • Li, Guanpeng; Pattabiraman, Karthik; Hari, Siva Kumar Sastry
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00016
conference June 2018
CFGExplorer: Designing a Visual Control Flow Analytics System around Basic Program Analysis Operations journal June 2018
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation conference February 2015
Fault injection acceleration by architectural importance sampling conference October 2015
Analysis of Fault Tolerance in Artificial Neural Networks journal January 2001
A Visual Analytics Framework for the Detection of Anomalous Call Stack Trees in High Performance Computing Applications journal January 2019
Contour Boxplots: A Method for Characterizing Uncertainty in Feature Sets from Simulation Ensembles journal December 2013
SWIFT: Software Implemented Fault Tolerance conference January 2005
Juniper: A Tree+Table Approach to Multivariate Graph Visualization journal January 2019
The Soft Error Problem: An Architectural Perspective conference January 2005
Error Detecting and Error Correcting Codes journal April 1950
A system for graph-based visualization of the evolution of software conference January 2003
D is CV ar: discovering critical variables using algorithmic differentiation for transient faults
  • Menon, Harshitha; Mohror, Kathryn
  • PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178502
conference February 2018
IPAS: intelligent protection against silent output corruption in scientific applications
  • Laguna, Ignacio; Schulz, Martin; Richards, David F.
  • CGO '16: 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, Proceedings of the 2016 International Symposium on Code Generation and Optimization https://doi.org/10.1145/2854038.2854059
conference February 2016
Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time journal December 2014
FlipTracker: Understanding Natural Error Resilience in HPC Applications conference November 2018
Curve Boxplot: Generalization of Boxplot for Ensembles of Curves journal December 2014
Fixed-Rate Compressed Floating-Point Arrays journal December 2014
Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers journal August 2019
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
  • Lu, Qining; Pattabiraman, Karthik; Gupta, Meeta S.
  • ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK, Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems https://doi.org/10.1145/2656106.2656127
conference October 2014
Lineage: Visualizing Multivariate Clinical Data in Genealogy Graphs journal March 2019
Fault resilience of the algebraic multi-grid solver conference January 2012
Understanding Error Propagation in GPGPU Applications
  • Li, Guanpeng; Pattabiraman, Karthik; Cher, Chen-Yang
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.20
conference November 2016
ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps journal June 2003
Visualization and Visual Analysis of Ensemble Data: A Survey journal September 2019