Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Conference · · SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS
DOI:https://doi.org/10.1109/SC.2014.78· OSTI ID:1567373
Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
OSTI ID:
1567373
Conference Information:
Journal Name: SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS
Country of Publication:
United States
Language:
English

Similar Records

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
Conference · Mon Aug 01 00:00:00 EDT 2016 · OSTI ID:1567425

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:931501

Failure recovery for bulk synchronous applications with MPI stages
Journal Article · Tue Feb 26 19:00:00 EST 2019 · Parallel Computing · OSTI ID:1784608

Related Subjects