skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications

Technical Report ·
DOI:https://doi.org/10.2172/1258538· OSTI ID:1258538
 [1];  [2];  [2];  [2];  [2]
  1. Univ. of Chicago, IL (United States). Dept. of Computer Science
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC52-07NA27344
OSTI ID:
1258538
Report Number(s):
LLNL-TR-692704
Country of Publication:
United States
Language:
English

Similar Records

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
Journal Article · Tue Aug 14 00:00:00 EDT 2018 · Concurrency and Computation. Practice and Experience · OSTI ID:1258538

Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI
Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1258538

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1258538

Related Subjects