skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications

Abstract

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 real-world HPC applications and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99% of SDCs with a false alarm rate less that 1% of iterations for most cases. The memory cost and detection overhead are reduced to 15% and 6.3%, respectively, for a large majority of applications.

Authors:
;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science - Office of Advanced Scientific Computing Research
OSTI Identifier:
1391750
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Journal Article
Resource Relation:
Journal Name: IEEE Transactions on Parallel and Distributed Systems; Journal Volume: 27; Journal Issue: 10
Country of Publication:
United States
Language:
English
Subject:
Exascale HPC; Fault Tolerance; Silent Data Corruption

Citation Formats

Di, Sheng, and Cappello, Franck. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. United States: N. p., 2016. Web. doi:10.1109/TPDS.2016.2517639.
Di, Sheng, & Cappello, Franck. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. United States. doi:10.1109/TPDS.2016.2517639.
Di, Sheng, and Cappello, Franck. Sat . "Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications". United States. doi:10.1109/TPDS.2016.2517639.
@article{osti_1391750,
title = {Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications},
author = {Di, Sheng and Cappello, Franck},
abstractNote = {For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous problems because there is no indication that there are errors during the execution. We propose an adaptive impact-driven method that can detect SDCs dynamically. The key contributions are threefold. (1) We carefully characterize 18 real-world HPC applications and discuss the runtime data features, as well as the impact of the SDCs on their execution results. (2) We propose an impact-driven detection model that does not blindly improve the prediction accuracy, but instead detects only influential SDCs to guarantee user-acceptable execution results. (3) Our solution can adapt to dynamic prediction errors based on local runtime data and can automatically tune detection ranges for guaranteeing low false alarms. Experiments show that our detector can detect 80-99.99% of SDCs with a false alarm rate less that 1% of iterations for most cases. The memory cost and detection overhead are reduced to 15% and 6.3%, respectively, for a large majority of applications.},
doi = {10.1109/TPDS.2016.2517639},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 10,
volume = 27,
place = {United States},
year = {Sat Oct 01 00:00:00 EDT 2016},
month = {Sat Oct 01 00:00:00 EDT 2016}
}