A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
Abstract
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
- Authors:
-
- ORNL
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
- Sponsoring Org.:
- USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
- OSTI Identifier:
- 1024714
- DOE Contract Number:
- DE-AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: EuroMPI 2011, Santorini, Greece, 20110918, 20110921
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; TOLERANCE; COMPUTERS; COMPUTER CODES
Citation Formats
Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, and Graham, Richard L. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. United States: N. p., 2011.
Web.
Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, & Graham, Richard L. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. United States.
Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, and Graham, Richard L. 2011.
"A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI". United States.
@article{osti_1024714,
title = {A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI},
author = {Hursey, Joshua J and Naughton, III, Thomas J and Vallee, Geoffroy R and Graham, Richard L},
abstractNote = {The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.},
doi = {},
url = {https://www.osti.gov/biblio/1024714},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Jan 01 00:00:00 EST 2011},
month = {Sat Jan 01 00:00:00 EST 2011}
}
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.
Save to My Library
You must Sign In or Create an Account in order to save documents to your library.