skip to main content

SciTech ConnectSciTech Connect

Title: Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications

Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.
Authors:
 [1] ;  [2] ;  [2] ;  [2] ;  [2]
  1. Univ. of Chicago, IL (United States). Dept. of Computer Science
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
OSTI Identifier:
1258538
Report Number(s):
LLNL--TR-692704
DOE Contract Number:
AC52-07NA27344
Resource Type:
Technical Report
Research Org:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING