SciTech Connect

Title: Investigating an API for resilient exascale computing.

Investigating an API for resilient exascale computing. Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3
Authors: ; ; ; ; ;
Publication Date:
OSTI Identifier:OSTI ID: 1096503
Report Number(s):SAND2013-3790
463782
DOE Contract Number:AC04-94AL85000
Resource Type:Technical Report
Research Org:University of New Mexico, Albuquerque, NM; Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:USDOE National Nuclear Security Administration (NNSA)
Country of Publication:United States
Language:English
Subject: