Evaluating and extending user-level fault tolerance in MPI applications
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1342070
- Report Number(s):
- LLNL-JRNL-663434
- Journal Information:
- International Journal of High Performance Computing Applications, Vol. 30, Issue 3; ISSN 1094-3420
- Publisher:
- SAGECopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
ER
|
journal | August 2018 |
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers
|
journal | July 2018 |
Similar Records
A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
Specification of Fenix MPI Fault Tolerance library (V.0.9)