skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods

Abstract

Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the design of soft error detectors to minimize SDCs. However, the detectors have been studied under different contexts, making comparative evaluation difficult. In this paper, we present the first comprehensive evaluation of four online soft error detection techniques in detecting the adverse impact of soft errors on iterative methods. We observe that, across five iterative methods, the detectors studied achieve high but not perfect detection rates. To understand the potential for improved detection, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive at their conclusions. Our evaluation demonstrates improved but still far from perfect detection accuracy for the machine learning based detectors. This extensive evaluation demonstrates the need for designing error detectors to handle the evolutionary behavior exhibited by iterative solvers.

Authors:
 [1]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2];  [3];  [2]
  1. Oak Ridge National Laboratory
  2. BATTELLE (PACIFIC NW LAB)
  3. Barcelona Supercomputing Center
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1572688
Report Number(s):
PNNL-SA-133097
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 15th ACM International Conference on Computing Frontiers (CF 2018), May 8-10, 2019, Ishia, Italy
Country of Publication:
United States
Language:
English
Subject:
Silent data corruption, Soft error detection, Iterative Methods

Citation Formats

Kestor, Gokcen G., Mutlu, Burcu, Manzano Franco, Joseph B., Subasi, Omer, Unsal, Osman, and Krishnamoorthy, Sriram. Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods. United States: N. p., 2018. Web. doi:10.1145/3203217.3203240.
Kestor, Gokcen G., Mutlu, Burcu, Manzano Franco, Joseph B., Subasi, Omer, Unsal, Osman, & Krishnamoorthy, Sriram. Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods. United States. doi:10.1145/3203217.3203240.
Kestor, Gokcen G., Mutlu, Burcu, Manzano Franco, Joseph B., Subasi, Omer, Unsal, Osman, and Krishnamoorthy, Sriram. Tue . "Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods". United States. doi:10.1145/3203217.3203240.
@article{osti_1572688,
title = {Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods},
author = {Kestor, Gokcen G. and Mutlu, Burcu and Manzano Franco, Joseph B. and Subasi, Omer and Unsal, Osman and Krishnamoorthy, Sriram},
abstractNote = {Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the design of soft error detectors to minimize SDCs. However, the detectors have been studied under different contexts, making comparative evaluation difficult. In this paper, we present the first comprehensive evaluation of four online soft error detection techniques in detecting the adverse impact of soft errors on iterative methods. We observe that, across five iterative methods, the detectors studied achieve high but not perfect detection rates. To understand the potential for improved detection, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive at their conclusions. Our evaluation demonstrates improved but still far from perfect detection accuracy for the machine learning based detectors. This extensive evaluation demonstrates the need for designing error detectors to handle the evolutionary behavior exhibited by iterative solvers.},
doi = {10.1145/3203217.3203240},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: