Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

LADR: low-cost application-level detector for reducing silent output corruptions

Conference ·

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ~ 3% runtime overheads and ~ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ~ 75%, and memory overheads by 35% ~ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1468063
Resource Relation:
Conference: International Symposium on High-Performance Parallel and Distributed Computing , New York, New York, June 11-15, 2018
Country of Publication:
United States
Language:
English

References (26)

Combining Partial Redundancy and Checkpointing for HPC June 2012
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
  • No authors listed
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.29
November 2012
SWIFT: Software Implemented Fault Tolerance January 2005
Software fault tolerance for FPUs via vectorization July 2015
Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics June 2014
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications January 2015
Experimental and analytical study of Xeon Phi reliability January 2017
Understanding the propagation of hard errors to software and implications for resilient system design
  • No authors listed
  • Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII https://doi.org/10.1145/1346281.1346315
January 2008
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection November 2014
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods January 2013
Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults June 2014
Transient-fault recovery for chip multiprocessors January 2003
Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
  • No authors listed
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.55
May 2014
Fast Error-Bounded Lossy HPC Data Compression with SZ May 2016
FTI: high performance fault tolerance interface for hybrid systems January 2011
ACR: automatic checkpoint/restart for soft and hard error protection January 2013
ED/sup 4/I: error detection by diverse data and duplicated instructions January 2002
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications October 2016
SIMD-based soft error detection January 2016
Evaluating the viability of process replication reliability for exascale systems January 2011
Algorithm-based fault tolerance for dense matrix factorizations January 2012
Algorithm-based recovery for iterative methods without checkpointing January 2011
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU May 2011
NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing November 2014
Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation September 2015
Correcting soft errors online in LU factorization January 2013