skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Abstract

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.

Authors:
; ; ; ; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567373
Resource Type:
Conference
Journal Name:
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS
Additional Journal Information:
Conference: Supercomputing Conference, New Orleans, LA, November 16-21, 2014
Country of Publication:
United States
Language:
English
Subject:
Computer Science

Citation Formats

Gamell, Marc, Katz, Daniel S., Kolla, Hemanth, Chen, Jacqueline, Klasky, Scott, and Parashar, Manish. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. United States: N. p., 2014. Web. doi:10.1109/SC.2014.78.
Gamell, Marc, Katz, Daniel S., Kolla, Hemanth, Chen, Jacqueline, Klasky, Scott, & Parashar, Manish. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. United States. doi:10.1109/SC.2014.78.
Gamell, Marc, Katz, Daniel S., Kolla, Hemanth, Chen, Jacqueline, Klasky, Scott, and Parashar, Manish. Sat . "Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis". United States. doi:10.1109/SC.2014.78.
@article{osti_1567373,
title = {Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
author = {Gamell, Marc and Katz, Daniel S. and Kolla, Hemanth and Chen, Jacqueline and Klasky, Scott and Parashar, Manish},
abstractNote = {Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.},
doi = {10.1109/SC.2014.78},
journal = {SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS},
number = ,
volume = ,
place = {United States},
year = {2014},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: