Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Conference ·
Exascale systems promise the potential for computation at unprecedented scales and resolutions, but achieving exascale by the end of this decade presents significant challenges. A key challenge is due to the very large number of cores and components and the resulting mean time between failures (MTBF) in the order of hours or minutes. Since the typical run times of target scientific applications are longer than this MTBF, fault tolerance techniques will be essential. An important class of failures that must be addressed is process or node failures. While checkpoint/restart (C/R) is currently the most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible at exascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerate process failures in MPI applications at exascale. The discussed approach leverages User Level Failure Mitigation (ULFM), which is being proposed as an MPI extension to allow applications to create policies for tolerating process failures. Specifically, this paper demonstrates how different implementations of application-driven in-memory checkpoint storage and recovery compare in terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability of the Fenix online global recovery framework on a production system -the Titan Cray XK7 at ORNL- and demonstrate the ability of Fenix to tolerate dynamically injected failures using the execution of four benchmarks and mini-applications with different behaviors.
Research Organization:
Rutgers Univ., Piscataway, NJ (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
FC02-06ER54857; AC05-00OR22725; SC0007455
OSTI ID:
1567425
Country of Publication:
United States
Language:
English

Similar Records

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Conference · Sat Nov 01 00:00:00 EDT 2014 · SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS · OSTI ID:1567373

Specification of Fenix MPI Fault Tolerance library (V.0.9)
Technical Report · Tue Jun 07 00:00:00 EDT 2016 · OSTI ID:1494327

Node failure resiliency for Uintah without checkpointing
Journal Article · Sat Jun 01 20:00:00 EDT 2019 · Concurrency and Computation. Practice and Experience · OSTI ID:1637354