skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Network Fault Tolerance in Open MPI

Abstract

High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Micro-benchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. The micro-benchmarks show that providing higher data reliability decrease performance by up to 30% relative to unprotected communications, but provide performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. The NAS Parallel Benchmarks show virtually no impact of the data reliability protocol on overall run-time.

Authors:
 [1];  [2];  [3]
  1. Los Alamos National Laboratory (LANL)
  2. ORNL
  3. University of Tennessee, Knoxville (UTK)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1073641
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Euro-Par 2007, Rennes, France, 20070828, 20070831
Country of Publication:
United States
Language:
English

Citation Formats

Shipman, Galen, Graham, Richard L, and Bosilca, George. Network Fault Tolerance in Open MPI. United States: N. p., 2007. Web.
Shipman, Galen, Graham, Richard L, & Bosilca, George. Network Fault Tolerance in Open MPI. United States.
Shipman, Galen, Graham, Richard L, and Bosilca, George. Mon . "Network Fault Tolerance in Open MPI". United States. doi:.
@article{osti_1073641,
title = {Network Fault Tolerance in Open MPI},
author = {Shipman, Galen and Graham, Richard L and Bosilca, George},
abstractNote = {High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Micro-benchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. The micro-benchmarks show that providing higher data reliability decrease performance by up to 30% relative to unprotected communications, but provide performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. The NAS Parallel Benchmarks show virtually no impact of the data reliability protocol on overall run-time.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: