skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating and extending user-level fault tolerance in MPI applications

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [1];  [1];  [1];  [1];  [1];  [2]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1342070
Report Number(s):
LLNL-JRNL-663434
Journal Information:
International Journal of High Performance Computing Applications, Vol. 30, Issue 3; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 22 works
Citation information provided by
Web of Science

References (23)

The International Exascale Software Project roadmap journal January 2011
Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance book January 2011
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
  • Sato, Kento; Moody, Adam; Mohror, Kathryn
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.126
conference May 2014
Theory of the lattice Boltzmann method: From the Boltzmann equation to the lattice Boltzmann equation journal December 1997
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI conference March 2007
Simulating solidification in metals at high pressure: The drive to petascale computing journal September 2006
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
CHARM++: a portable concurrent object oriented system based on C++
  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874
conference January 1993
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
Fault Tolerance in Message Passing Interface Programs journal August 2004
Evaluating User-Level Fault Tolerance for MPI Applications conference January 2014
Starfish: fault-tolerant dynamic MPI programs on clusters of workstations conference January 1999
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI book January 2012
An evaluation of User-Level Failure Mitigation support in MPI journal May 2013
Exascale Computing Technology Challenges book January 2010
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book January 2000
Massively parallel loading
  • Frings, Wolfgang; Ahn, Dong H.; LeGendre, Matthew
  • Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2465020
conference January 2013
Scalable Distributed Consensus to Support MPI Fault Tolerance
  • Buntinas, Darius
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.113
conference May 2012
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005
Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems conference January 2009
Toward Exascale Resilience journal September 2009
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability conference January 2007

Cited By (2)

ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER journal August 2018
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers journal July 2018