skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Havens: Explicit Reliable Memory Regions for HPC Applications

Conference ·
OSTI ID:1330545

Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC); USDOE Laboratory Directed Research and Development (LDRD) Program
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1330545
Resource Relation:
Conference: IEEE High Performance Extreme Computing Conference (HPEC 16), Waltham, MA, USA, 20160913, 20160915
Country of Publication:
United States
Language:
English