skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.

Abstract

Abstract not provided.

Authors:
 [1];  [2]; ; ; ;  [2]
  1. (Intel)
  2. (Rutgers U)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1406166
Report Number(s):
SAND2016-10732C
648564
DOE Contract Number:
AC04-94AL85000
Resource Type:
Conference
Resource Relation:
Conference: Proposed for presentation at the ExaMPI'16 Workshop held November 13, 2016 in Salt Lake City, UT.
Country of Publication:
United States
Language:
English

Citation Formats

Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., and Parashaar, Manish. Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.. United States: N. p., 2016. Web.
Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., & Parashaar, Manish. Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.. United States.
Van Der Wijngaart, Rob, Gamell, Marc, Teranishi, Keita, Valenzuela, Eric, Heroux, Michael A., and Parashaar, Manish. 2016. "Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.". United States. doi:. https://www.osti.gov/servlets/purl/1406166.
@article{osti_1406166,
title = {Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications.},
author = {Van Der Wijngaart, Rob and Gamell, Marc and Teranishi, Keita and Valenzuela, Eric and Heroux, Michael A. and Parashaar, Manish},
abstractNote = {Abstract not provided.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2016,
month =
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • Fenix provides APIs to allow the users to add fault tolerance capability to MPI-based parallel programs in a transparent manner. Fenix-enabled programs can run through process failures during program execution using a pool of spare processes accommodated by Fenix.
  • Recent trends in high-performance computing point towards increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean-time-between-failures (MTBF) ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures andmore » performing redundant remote memory accesses. We present results from a computational chemistry application running at scale to show that our techniques provide applications with a high degree of fault tolerance and low (2%--4%) overhead for 2048 processors.« less
  • Many practical scientific applications would benefit from a simple checkpointing mechanism to provide automatic restart or recovery in response to faults and failures. CUMULVS is a middleware infrastructure for interacting with parallel scientific simulations to support online visualization and computational steering. The base CUMULVS system has been extended to provide a user-level mechanism for collecting checkpoints in a parallel simulation program. Via the same interface that CUMULVS uses to identify and describe data fields for visualization and parameters for steering, the user application can select the minimal program state necessary to restart or migrate an application task. The CUMULVS run-timemore » system uses this information to efficiently recover fault-tolerant applications by restarting failed tasks. Application tasks can also be migrated -- even across heterogeneous architecture boundaries -- to achieve load balancing or to improve the task`s locality with a required resource. This paper describes the CUMULVS interface for checkpointing, the issues faced in utilizing this interface when developing fault-tolerant and migrating applications, and the direction of future research in this area.« less
  • MicroQoSCORBA (MQC) is a middleware platform that focuses on embedded applications by providing a very fine level of configurability of its internal orthogonal components. Using this configurability, a developer can generate a customized middleware instantiation that is tailored to both the requirements and constraints of a specific embedded application and the embedded hardware. One of the key components provided by MQC is a set of fault-tolerant mechanisms, which allow for support of applications that require a higher level of reliability. This document provides a detailed description of the algorithms and protocols selected for these mechanisms, along with a discussion ofmore » their implementation and incorporation into the MQC platform.« less