Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
- ORNL
- Lawrence Livermore National Laboratory (LLNL)
- Argonne National Laboratory (ANL)
- Cray, Inc.
- Hewlett-Packard
The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- DE-AC05-00OR22725
- OSTI ID:
- 1024715
- Resource Relation:
- Conference: EuroMPI 2011, Santorini, Greece, 20110918, 20110921
- Country of Publication:
- United States
- Language:
- English
Similar Records
Building a Fault Tolerant MPI Application: A Ring Communication Example
The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications