Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI

Hursey, Joshua J; Graham, Richard L

Title: Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI

Conference · Sat Jan 01 00:00:00 EST 2011

OSTI ID:1024713

Hursey, Joshua J ^[1]; Graham, Richard L ^[1]

ORNL

Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what traditional techniques alone can provide. Applications will depend on libraries to sustain failure-free performance across process failure to continue to efficiently use High Performance Computing (HPC) systems even in the presence of process failure. Optimized Message Passing Interface (MPI) collective operations are a critical component of many scalable HPC applications. However, most of the collective algorithms are not able to handle process failure. Next generation MPI implementations must provide fault aware versions of such algorithms that can sustain performance across process failure. This paper discusses the design and implementation of fault aware collective algorithms for tree structured communication patterns. The three design approaches of rerouting, lookup avoiding and rebalancing are described, and analyzed for their performance impact relative to a similar fault unaware collective algorithm. The analysis shows that the rerouting approach causes up to a four times performance degradation while the rebalancing approach can bring the performance within 1% of the fault unaware performance. Additionally, this paper introduces the reader to a set of run-through stabilization semantics being developed by the MPI Forum's Fault Tolerance Working Group to support ABFT. This paper underscores the need for care to be taken when designing new fault aware collective algorithms for fault tolerant MPI implementations.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1024713

Resource Relation:: Conference: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) held in conjunction with the 25th {IEEE} International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, AK, USA, 20110516, 20110520

Country of Publication:: United States

Language:: English

Similar Records

Building a Fault Tolerant MPI Application: A Ring Communication Example

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1024713

Hursey, Joshua J; Graham, Richard L

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1024713

Hursey, Joshua J; Graham, Richard L; Bronevetsky, Greg; +3 more

The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications

Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1024713

Graham, Richard L; Hursey, Joshua J; Vallee, Geoffroy R; +1 more

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
ALGORITHMS
COMMUNICATIONS
DESIGN
EFFICIENCY
IMPLEMENTATION
PERFORMANCE
PROCESSING
PROGRAMMING
STABILIZATION
TOLERANCE
TREES

Title: Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI

Citation Formats

Similar Records

Related Subjects