skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating and extending user-level fault tolerance in MPI applications

Abstract

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.

Authors:
 [1];  [1];  [1];  [1];  [1];  [1];  [2]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1342070
Report Number(s):
LLNL-JRNL-663434
Journal ID: ISSN 1094-3420
Grant/Contract Number:  
AC52-07NA27344
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 30; Journal Issue: 3; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; MPI; fault tolerance; failure recovery models; checkpointing; molecular dynamics simulation

Citation Formats

Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, and Pritchard, Howard. Evaluating and extending user-level fault tolerance in MPI applications. United States: N. p., 2016. Web. doi:10.1177/1094342015623623.
Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, & Pritchard, Howard. Evaluating and extending user-level fault tolerance in MPI applications. United States. doi:10.1177/1094342015623623.
Laguna, Ignacio, Richards, David F., Gamblin, Todd, Schulz, Martin, de Supinski, Bronis R., Mohror, Kathryn, and Pritchard, Howard. Mon . "Evaluating and extending user-level fault tolerance in MPI applications". United States. doi:10.1177/1094342015623623. https://www.osti.gov/servlets/purl/1342070.
@article{osti_1342070,
title = {Evaluating and extending user-level fault tolerance in MPI applications},
author = {Laguna, Ignacio and Richards, David F. and Gamblin, Todd and Schulz, Martin and de Supinski, Bronis R. and Mohror, Kathryn and Pritchard, Howard},
abstractNote = {The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.},
doi = {10.1177/1094342015623623},
journal = {International Journal of High Performance Computing Applications},
number = 3,
volume = 30,
place = {United States},
year = {2016},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share: