The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications
- ORNL
Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when processes fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applications and libraries in the presence of system component failures. This paper discusses how the RTS proposal impacts system services, highlighting specific requirements. Early experimentation results from Cray systems at ORNL using prototype MPI and runtime implementations are presented. Additionally, this paper outlines fault tolerance techniques targeted at leadership class applications.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- DE-AC05-00OR22725
- OSTI ID:
- 1049912
- Resource Relation:
- Conference: Cray User Group (CUG), Stuttgart, Germany, 20120429, 20120503
- Country of Publication:
- United States
- Language:
- English
Similar Records
Building a Fault Tolerant MPI Application: A Ring Communication Example
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques