A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
Conference
·
OSTI ID:1042909
- ORNL
- Sandia National Laboratories (SNL)
- North Carolina State University
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50\% overhead of resources, less than the 100\% needed for double modular redundancy.
- Research Organization:
- Oak Ridge National Laboratory (ORNL)
- Sponsoring Organization:
- ORNL LDRD Director's R&D
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1042909
- Country of Publication:
- United States
- Language:
- English
Similar Records
An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
Cooperative Application/OS DRAM Fault Recovery
Conference
·
Wed Dec 31 23:00:00 EST 2014
·
OSTI ID:1335909
Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era
Conference
·
Thu Dec 31 23:00:00 EST 2015
·
OSTI ID:1336035
Cooperative Application/OS DRAM Fault Recovery
Technical Report
·
Mon Apr 30 20:00:00 EDT 2012
·
OSTI ID:1044954