A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
Conference
·
OSTI ID:1024714
- ORNL
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
- Sponsoring Organization:
- USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
- DOE Contract Number:
- DE-AC05-00OR22725
- OSTI ID:
- 1024714
- Resource Relation:
- Conference: EuroMPI 2011, Santorini, Greece, 20110918, 20110921
- Country of Publication:
- United States
- Language:
- English
Similar Records
Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI
Building a Fault Tolerant MPI Application: A Ring Communication Example
Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
Conference
·
Sat Jan 01 00:00:00 EST 2011
·
OSTI ID:1024714
Building a Fault Tolerant MPI Application: A Ring Communication Example
Conference
·
Sat Jan 01 00:00:00 EST 2011
·
OSTI ID:1024714
Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
Conference
·
Sat Jan 01 00:00:00 EST 2011
·
OSTI ID:1024714
+3 more