Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Lumsdaine, Andrew

doi:10.2172/1105002

Title: Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Technical Report · Fri Mar 08 00:00:00 EST 2013

DOI:https://doi.org/10.2172/1105002· OSTI ID:1105002

Lumsdaine, Andrew

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.

View Technical Report

Cite

Export

Save

Research Organization:: Andrew Lumsdaine, Indiana University

Sponsoring Organization:: USDOE

DOE Contract Number:: FC02-06ER25750

OSTI ID:: 1105002

Report Number(s):: IU-Lumsdaine-25750

Country of Publication:: United States

Language:: English

Similar Records

Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

Technical Report · Thu Jul 28 00:00:00 EDT 2011 · OSTI ID:1105002

Panda, Dhabaleswar Kumar; Beckman, Pete

Coordinated Fault Tolerance for High-Performance Computing

Technical Report · Mon Apr 08 00:00:00 EDT 2013 · OSTI ID:1105002

Dongarra, Jack; Bosilca, George

OVIS 3.2 user's guide.

Technical Report · Fri Oct 01 00:00:00 EDT 2010 · OSTI ID:1105002

Mayo, Jackson R; Gentile, Ann C; Brandt, James M; +5 more

Related Subjects

97 MATHEMATICS AND COMPUTING

Title: Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

Citation Formats

Similar Records

Related Subjects