Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth; Chen, Jacqueline; Klasky, Scott; Parashar, Manish

doi:10.1109/SC.2014.78

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Conference · Sat Nov 01 04:00:00 EDT 2014 · SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS

DOI:https://doi.org/10.1109/SC.2014.78· OSTI ID:1567373

Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth; Chen, Jacqueline; Klasky, Scott; Parashar, Manish

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Organization:: USDOE Office of Science (SC)

OSTI ID:: 1567373

Conference Information:: Journal Name: SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS

Country of Publication:: United States

Language:: English

Similar Records

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Conference · Mon Aug 01 00:00:00 EDT 2016 · OSTI ID:1567425

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:931501

Failure recovery for bulk synchronous applications with MPI stages

Journal Article · Tue Feb 26 19:00:00 EST 2019 · Parallel Computing · OSTI ID:1784608

Related Subjects

Computer Science

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Citation Formats

Similar Records

Related Subjects