Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Lumsdaine, Andrew

doi:10.2172/1105002

Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Technical Report · Thu Mar 07 23:00:00 EST 2013

DOI:https://doi.org/10.2172/1105002· OSTI ID:1105002

Lumsdaine, Andrew ^[1]

Indiana Univ., Bloomington, IN (United States); Indiana University

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.

Research Organization:: Indiana Univ., Bloomington, IN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: FC02-06ER25750

OSTI ID:: 1105002

Report Number(s):: IU-Lumsdaine-25750

Country of Publication:: United States

Language:: English

Similar Records

Coordinated Fault Tolerance for High-Performance Computing

Technical Report · Mon Apr 08 00:00:00 EDT 2013 · OSTI ID:1072982

Simple Linux Utility for Resource Management

Software · Mon Mar 10 00:00:00 EDT 2008 · OSTI ID:1307519

Simple Linux Utility for Resource Management

Software · Sat Mar 08 19:00:00 EST 2008 · OSTI ID:code-883

Related Subjects

97 MATHEMATICS AND COMPUTING

Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Citation Formats

Similar Records

Related Subjects