Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems
Abstract
Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
- Authors:
- North Carolina State University (NCSU), Raleigh
- ORNL
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1003542
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: 20th ACM International Conference on Supercomputing (ICS) 2006, Cairns, Australia, 20060628, 20060701
- Country of Publication:
- United States
- Language:
- English
- Subject:
- Reliability; high-performance computing; node failure; message passing; group communication; scalability
Citation Formats
Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. United States: N. p., 2006.
Web. doi:10.1145/1183401.1183433.
Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, & Scott, Steven L. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. United States. doi:10.1145/1183401.1183433.
Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. Sun .
"Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems". United States.
doi:10.1145/1183401.1183433.
@article{osti_1003542,
title = {Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems},
author = {Varma, Jyothish S. and Wang, Chao and Mueller, Frank and Engelmann, Christian and Scott, Steven L},
abstractNote = {Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.},
doi = {10.1145/1183401.1183433},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}
-
Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when processes fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applicationsmore »
-
Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems
The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporatingmore » -
A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability. -