skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Technical Report ·
DOI:https://doi.org/10.2172/1105002· OSTI ID:1105002

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.

Research Organization:
Andrew Lumsdaine, Indiana University
Sponsoring Organization:
USDOE
DOE Contract Number:
FC02-06ER25750
OSTI ID:
1105002
Report Number(s):
IU-Lumsdaine-25750
Country of Publication:
United States
Language:
English

Similar Records

Coordinated Fault-Tolerance for High-Performance Computing Final Project Report
Technical Report · Thu Jul 28 00:00:00 EDT 2011 · OSTI ID:1105002

Coordinated Fault Tolerance for High-Performance Computing
Technical Report · Mon Apr 08 00:00:00 EDT 2013 · OSTI ID:1105002

OVIS 3.2 user's guide.
Technical Report · Fri Oct 01 00:00:00 EDT 2010 · OSTI ID:1105002

Related Subjects