Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Conference ·
Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1003542
Country of Publication:
United States
Language:
English

Similar Records

Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1024713

ScalaTrace: Tracing, Analysis and Modeling of HPC Codes at Scale
Conference · Wed Mar 31 00:00:00 EDT 2010 · OSTI ID:1009216

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1024715