skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. Los Alamos National Laboratory
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA), Office of Defense Programs (DP) (NA-10)
OSTI Identifier:
1369144
Report Number(s):
LA-UR-17-25325
DOE Contract Number:
AC52-06NA25396
Resource Type:
Conference
Resource Relation:
Conference: DSN 2017 ; 2017-06-26 - 2017-06-26 ; Denver, Colorado, United States
Country of Publication:
United States
Language:
English
Subject:
Computer Hardware; Computer Science

Citation Formats

Tan, Li, Debardeleben, Nathan A., Guan, Qiang, Blanchard, Sean P., and Lang, Michael Kenneth. RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability. United States: N. p., 2017. Web.
Tan, Li, Debardeleben, Nathan A., Guan, Qiang, Blanchard, Sean P., & Lang, Michael Kenneth. RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability. United States.
Tan, Li, Debardeleben, Nathan A., Guan, Qiang, Blanchard, Sean P., and Lang, Michael Kenneth. 2017. "RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability". United States. doi:. https://www.osti.gov/servlets/purl/1369144.
@article{osti_1369144,
title = {RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability},
author = {Tan, Li and Debardeleben, Nathan A. and Guan, Qiang and Blanchard, Sean P. and Lang, Michael Kenneth},
abstractNote = {},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month = 7
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ~2,325% for serial execution and ~1,730% at 128 MPI processes, both withmore » very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.« less
  • With ongoing chip miniaturization and voltage scaling, particle strike-induced soft errors present increasingly severe threat to the reliability of on-chip caches. In this paper, we present a technique to reduce the vulnerability of caches to soft-errors. Our technique uses data compression to reduce the number of vulnerable data bits in the cache and performs selective duplication of more critical data-bits to provide extra protection to them. Microarchitectural simulations have shown that our technique is effective in reducing architectural vulnerability factor (AVF) of the cache and outperforms another technique. For single and dual-core system configuration, the average reduction in AVF ismore » 5.59X and 8.44X, respectively. Also, the implementation and performance overheads of our technique are minimal and it is useful for a broad range of workloads.« less
  • Devices become increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft errors primarily caused problems for space and high-atmospheric computing applications. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming significant even at terrestrial altitudes. The soft error vulnerability of iterative linear algebra methods, which many scientific applications use, is a critical aspect of the overall application vulnerability. These methods are often considered invulnerable to many soft errors because they converge from an imprecise solution to a precise one. However, we show that iterative methods can be vulnerable to softmore » errors, with a high rate of silent data corruptions. We quantify this vulnerability, with algorithms generating up to 8.5% erroneous results when subjected to a single bit-flip. Further, we show that detecting soft errors in an iterative method depends on its detailed convergence properties and requires more complex mechanisms than simply checking the residual. Finally, we explore inexpensive techniques to tolerate soft errors in these methods.« less
  • Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. Many users consider these methods invulnerable to most soft errors since they convergemore » from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.« less
  • Understanding the soft error vulnerability of supercomputer applications is critical as these systems are using ever larger numbers of devices that have decreasing feature sizes and, thus, increasing frequency of soft errors. As many large scale parallel scientific applications use BLAS and LAPACK linear algebra routines, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. This paper analyzes the vulnerability of these routines to soft errors by characterizing how their outputs are affected by injected errors and by evaluating several techniques for predicting how errors propagate from the input to the output ofmore » each routine. The resulting error profiles can be used to understand the fault vulnerability of full applications that use these routines.« less