skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

Technical Report ·
DOI:https://doi.org/10.2172/1104503· OSTI ID:1104503

With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system through fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on top of existing publish-subscribe tools. We enhanced the intrinsic fault tolerance capabilities representative implementations of a variety of key HPC software subsystems and integrated them with the FTB. Targeting software subsystems included: MPI communication libraries, checkpoint/restart libraries, resource managers and job schedulers, and system monitoring tools. Leveraging the aforementioned infrastructure, as well as developing and utilizing additional tools, we have examined issues associated with expanded, end-to-end fault response from both system and application viewpoints. From the standpoint of system operations, we have investigated log and root cause analysis, anomaly detection and fault prediction, and generalized notification mechanisms. Our applications work has included libraries for fault-tolerance linear algebra, application frameworks for coupled multiphysics applications, and external frameworks to support the monitoring and response for general applications. Our final goal was to engage the high-end computing community to increase awareness of tools and issues around coordinated end-to-end fault management.

Research Organization:
The Ohio State Univ., Columbus, OH (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Contributing Organization:
Argonne National Laboratory, The Ohio State University, Lawrence Berkeley National Laboratory, Oakridge National Laboratory, Indiana University and University of Tennesse
DOE Contract Number:
FC02-06ER25749
OSTI ID:
1104503
Report Number(s):
DOE-OSU-25749-Final
Country of Publication:
United States
Language:
English

Similar Records

Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report
Technical Report · Fri Mar 08 00:00:00 EST 2013 · OSTI ID:1104503

CIFTS : A coordinated infrastructure for fault-tolerant systems.
Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1104503

Coordinated Fault Tolerance for High-Performance Computing
Technical Report · Mon Apr 08 00:00:00 EDT 2013 · OSTI ID:1104503

Related Subjects