skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

Conference ·
OSTI ID:1024715
 [1];  [1];  [2];  [3];  [4];  [5]
  1. ORNL
  2. Lawrence Livermore National Laboratory (LLNL)
  3. Argonne National Laboratory (ANL)
  4. Cray, Inc.
  5. Hewlett-Packard

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
1024715
Resource Relation:
Conference: EuroMPI 2011, Santorini, Greece, 20110918, 20110921
Country of Publication:
United States
Language:
English