Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications
- Univ. of Chicago, IL (United States). Dept. of Computer Science
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.
- Research Organization:
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1258538
- Report Number(s):
- LLNL-TR-692704
- Country of Publication:
- United States
- Language:
- English
Similar Records
Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI
Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance