skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Investigating an API for resilient exascale computing.

Technical Report ·
DOI:https://doi.org/10.2172/1096503· OSTI ID:1096503

Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); University of New Mexico,, Albuquerque, NM
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1096503
Report Number(s):
SAND2013-3790; 463782
Country of Publication:
United States
Language:
English