Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A Tunable, Software-based DRAM Error Detection and Correction Library for HPC

Conference ·
OSTI ID:1042909
 [1];  [2];  [3];  [1]
  1. ORNL
  2. Sandia National Laboratories (SNL)
  3. North Carolina State University
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50\% overhead of resources, less than the 100\% needed for double modular redundancy.
Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
ORNL LDRD Director's R&D
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1042909
Country of Publication:
United States
Language:
English

Similar Records

An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications
Conference · Wed Dec 31 23:00:00 EST 2014 · OSTI ID:1335909

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1336035

Cooperative Application/OS DRAM Fault Recovery
Technical Report · Mon Apr 30 20:00:00 EDT 2012 · OSTI ID:1044954