skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: LADR: low-cost application-level detector for reducing silent output corruptions

Conference ·

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ~ 3% runtime overheads and ~ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ~ 75%, and memory overheads by 35% ~ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1468063
Resource Relation:
Conference: International Symposium on High-Performance Parallel and Distributed Computing , New York, New York, June 11-15, 2018
Country of Publication:
United States
Language:
English

References (26)

Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
  • Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.29
conference November 2012
SWIFT: Software Implemented Fault Tolerance conference January 2005
Software fault tolerance for FPUs via vectorization conference July 2015
Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics
  • Birke, Robert; Giurgiu, Ioana; Chen, Lydia Y.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.18
conference June 2014
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications
  • Berrocal, Eduardo; Bautista-Gomez, Leonardo; Di, Sheng
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749253
conference January 2015
Experimental and analytical study of Xeon Phi reliability
  • Oliveira, Daniel; Pilla, LaĆ©rcio; DeBardeleben, Nathan
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126960
conference January 2017
Understanding the propagation of hard errors to software and implications for resilient system design
  • Li, Man-Lap; Ramachandran, Pradeep; Sahoo, Swarup Kumar
  • Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII https://doi.org/10.1145/1346281.1346315
conference January 2008
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
  • Cher, Chen-Yong; Gupta, Meeta S.; Bose, Pradip
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.53
conference November 2014
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods conference January 2013
Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults
  • Dadashi, Majid; Rashid, Layali; Pattabiraman, Karthik
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.1
conference June 2014
Transient-fault recovery for chip multiprocessors conference January 2003
Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
  • Yim, Keun Soo
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.55
conference May 2014
Fast Error-Bounded Lossy HPC Data Compression with SZ conference May 2016
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
ACR: automatic checkpoint/restart for soft and hard error protection
  • Ni, Xiang; Meneses, Esteban; Jain, Nikhil
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503266
conference January 2013
ED/sup 4/I: error detection by diverse data and duplicated instructions journal January 2002
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications journal October 2016
SIMD-based soft error detection conference January 2016
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Algorithm-based fault tolerance for dense matrix factorizations
  • Du, Peng; Bouteiller, Aurelien; Bosilca, George
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145845
conference January 2012
Algorithm-based recovery for iterative methods without checkpointing conference January 2011
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
  • Yim, Keun Soo; Pham, Cuong; Saleheen, Mushfiq
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.36
conference May 2011
NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing
  • Chen, Zhengzhang; Son, Seung Woo; Hendrix, William
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.65
conference November 2014
Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation conference September 2015
Correcting soft errors online in LU factorization conference January 2013