Havens: Explicit Reliable Memory Regions for HPC Applications
Conference
·
OSTI ID:1330545
- ORNL
Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); ORNL Program Development
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1330545
- Country of Publication:
- United States
- Language:
- English
Similar Records
A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
Checksumming strategies for data in volatile memories
Application health monitoring for extreme-scale resiliency using cooperative fault management
Conference
·
Sat Dec 31 23:00:00 EST 2011
·
OSTI ID:1042909
Checksumming strategies for data in volatile memories
Conference
·
Tue Sep 09 00:00:00 EDT 2014
·
OSTI ID:1236931
Application health monitoring for extreme-scale resiliency using cooperative fault management
Journal Article
·
Wed Jul 24 20:00:00 EDT 2019
· Concurrency and Computation. Practice and Experience
·
OSTI ID:1558573