Investigating an API for resilient exascale computing.

Stearley, Jon R.; Tomkins, James; VanDyke, John P.; Ferreira, Kurt Brian; Bridges, Patrick

doi:10.2172/1096503

Title: Investigating an API for resilient exascale computing.

Technical Report · Wed May 01 00:00:00 EDT 2013

DOI:https://doi.org/10.2172/1096503· OSTI ID:1096503

Stearley, Jon R.; Tomkins, James; VanDyke, John P.; Ferreira, Kurt Brian; Bridges, Patrick

Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3

View Technical Report

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); University of New Mexico,, Albuquerque, NM

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1096503

Report Number(s):: SAND2013-3790; 463782

Country of Publication:: United States

Language:: English

Similar Records

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report · Thu Apr 16 00:00:00 EDT 2020 · OSTI ID:1096503

Kramer, William; Jha, Saurabh; Brandt, James; +1 more

Keeping checkpoint/restart viable for exascale systems.

Technical Report · Thu Sep 01 00:00:00 EDT 2011 · OSTI ID:1096503

Riesen, Rolf E; Bridges, Patrick G; Stearley, Jon R; +6 more

...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1096503

Klasky, Scott A; Lofstead, J.; Bent, John; +5 more

Title: Investigating an API for resilient exascale computing.

Citation Formats

Similar Records

Related Subjects