skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Abstract

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.

Authors:
 [1];  [1];  [1];  [2];  [2]
  1. North Carolina State University (NCSU), Raleigh
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1003542
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 20th ACM International Conference on Supercomputing (ICS) 2006, Cairns, Australia, 20060628, 20060701
Country of Publication:
United States
Language:
English
Subject:
Reliability; high-performance computing; node failure; message passing; group communication; scalability

Citation Formats

Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. United States: N. p., 2006. Web. doi:10.1145/1183401.1183433.
Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, & Scott, Steven L. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. United States. doi:10.1145/1183401.1183433.
Varma, Jyothish S., Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. Sun . "Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems". United States. doi:10.1145/1183401.1183433.
@article{osti_1003542,
title = {Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems},
author = {Varma, Jyothish S. and Wang, Chao and Mueller, Frank and Engelmann, Christian and Scott, Steven L},
abstractNote = {Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.},
doi = {10.1145/1183401.1183433},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: