Network Fault Tolerance in Open MPI
Abstract
High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Micro-benchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. The micro-benchmarks show that providing higher data reliability decrease performance by up to 30% relative to unprotected communications, but provide performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. The NAS Parallel Benchmarks show virtually no impact of the data reliability protocol on overall run-time.
- Authors:
- Los Alamos National Laboratory (LANL)
- ORNL
- University of Tennessee, Knoxville (UTK)
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1073641
- DOE Contract Number:
- DE-AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: Euro-Par 2007, Rennes, France, 20070828, 20070831
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Shipman, Galen, Graham, Richard L, and Bosilca, George. Network Fault Tolerance in Open MPI. United States: N. p., 2007.
Web.
Shipman, Galen, Graham, Richard L, & Bosilca, George. Network Fault Tolerance in Open MPI. United States.
Shipman, Galen, Graham, Richard L, and Bosilca, George. Mon .
"Network Fault Tolerance in Open MPI". United States.
doi:.
@article{osti_1073641,
title = {Network Fault Tolerance in Open MPI},
author = {Shipman, Galen and Graham, Richard L and Bosilca, George},
abstractNote = {High Performance Computing (HPC) systems are rapidly growing in size and complexity. As a result, transient and persistent network failures can occur on the time scale of application run times, reducing the productive utilization of these systems. The ubiquitous network protocol used to deal with such failures is TCP/IP, however, available implementations of this protocol provide unacceptable performance for HPC system users, and do not provide the high bandwidth, low latency communications of modern interconnects. This paper describes methods used to provide protection against several network errors such as dropped packets, corrupt packets, and loss of network interfaces while maintaining high-performance communications. Micro-benchmark experiments using vendor supplied TCP/IP and O/S bypass low-level communications stacks over InfiniBand and Myrinet are used to demonstrate the high-performance characteristics of our protocol. The NAS Parallel Benchmarks are used to demonstrate the scalability and the minimal performance impact of this protocol. The micro-benchmarks show that providing higher data reliability decrease performance by up to 30% relative to unprotected communications, but provide performance improvements of a factor of four over TCP/IP running over InfiniBand DDR. The NAS Parallel Benchmarks show virtually no impact of the data reliability protocol on overall run-time.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}
-
LA-MPI is a high-performance, network-fault-tolerant implementation of MPl designcd for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and pcrformance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, givcs high-performance through mcssage striping, and niost importantly provides the basis for network fault tolerance. Finally we include some performance numbers for the Quadrics and UDP network paths.
-
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime- to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanismmore » -
Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum's Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard. -
Scalable distributed consensus to support MPI fault tolerance.
As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a scalable, distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum's fault-tolerance working group. The algorithm was implemented and evaluated on amore »