skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.

Abstract

Abstract not provided.

Authors:
 [1];  [2]; ; ; ;  [2]
  1. (Intel)
  2. (Rutgers U)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1406166
Report Number(s):
SAND2016-10732C
648564
DOE Contract Number:
AC04-94AL85000
Resource Type:
Conference
Resource Relation:
Conference: Proposed for presentation at the ExaMPI'16 Workshop held November 13, 2016 in Salt Lake City, UT.
Country of Publication:
United States
Language:
English

Citation Formats

Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., and Parashaar, Manish. Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.. United States: N. p., 2016. Web.
Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., & Parashaar, Manish. Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.. United States.
Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., and Parashaar, Manish. Sat . "Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.". United States. doi:. https://www.osti.gov/servlets/purl/1406166.
@article{osti_1406166,
title = {Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.},
author = {Van Der Wijngaart, Rob and Gamell, Marc and Teranishi, Keita and Valenzuela, Eric and Heroux, Michael A. and Parashaar, Manish},
abstractNote = {Abstract not provided.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Oct 01 00:00:00 EDT 2016},
month = {Sat Oct 01 00:00:00 EDT 2016}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • Abstract not provided.
  • Fenix provides APIs to allow the users to add fault tolerance capability to MPI-based parallel programs in a transparent manner. Fenix-enabled programs can run through process failures during program execution using a pool of spare processes accommodated by Fenix.
  • Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enablesmore » failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.« less
  • This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery , restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.