skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

Abstract

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
OSTI Identifier:
1024714
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: EuroMPI 2011, Santorini, Greece, 20110918, 20110921
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; TOLERANCE; COMPUTERS; COMPUTER CODES

Citation Formats

Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, and Graham, Richard L. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. United States: N. p., 2011. Web.
Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, & Graham, Richard L. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. United States.
Hursey, Joshua J, Naughton, III, Thomas J, Vallee, Geoffroy R, and Graham, Richard L. 2011. "A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI". United States.
@article{osti_1024714,
title = {A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI},
author = {Hursey, Joshua J and Naughton, III, Thomas J and Vallee, Geoffroy R and Graham, Richard L},
abstractNote = {The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.},
doi = {},
url = {https://www.osti.gov/biblio/1024714}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Jan 01 00:00:00 EST 2011},
month = {Sat Jan 01 00:00:00 EST 2011}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: