skip to main content

DOE PAGESDOE PAGES

Title: Evaluating and extending user-level fault tolerance in MPI applications

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
Authors:
 [1] ;  [1] ;  [1] ;  [1] ;  [1] ;  [1] ;  [2]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
Report Number(s):
LLNL-JRNL-663434
Journal ID: ISSN 1094-3420
Grant/Contract Number:
AC52-07NA27344
Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 30; Journal Issue: 3; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Research Org:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; MPI; fault tolerance; failure recovery models; checkpointing; molecular dynamics simulation
OSTI Identifier:
1342070

Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, and Pritchard, Howard. Evaluating and extending user-level fault tolerance in MPI applications. United States: N. p., Web. doi:10.1177/1094342015623623.
Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, & Pritchard, Howard. Evaluating and extending user-level fault tolerance in MPI applications. United States. doi:10.1177/1094342015623623.
Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, and Pritchard, Howard. 2016. "Evaluating and extending user-level fault tolerance in MPI applications". United States. doi:10.1177/1094342015623623. https://www.osti.gov/servlets/purl/1342070.
@article{osti_1342070,
title = {Evaluating and extending user-level fault tolerance in MPI applications},
author = {Laguna, Ignacio and Richards, David F. and Gamblin, Todd and Schulz, Martin and de Supinski, Bronis R. and Mohror, Kathryn and Pritchard, Howard},
abstractNote = {The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.},
doi = {10.1177/1094342015623623},
journal = {International Journal of High Performance Computing Applications},
number = 3,
volume = 30,
place = {United States},
year = {2016},
month = {1}
}