Coordinated Fault-Tolerance for High-Performance Computing Final Project Report
- The Ohio State Univ., Columbus, OH (United States); The Ohio State University
- The Ohio State Univ., Columbus, OH (United States)
With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system through fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system?
- Research Organization:
- The Ohio State Univ., Columbus, OH (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Contributing Organization:
- Argonne National Laboratory, The Ohio State University, Lawrence Berkeley National Laboratory, Oakridge National Laboratory, Indiana University and University of Tennesse
- DOE Contract Number:
- FC02-06ER25749
- OSTI ID:
- 1104503
- Report Number(s):
- DOE-OSU--25749-Final
- Country of Publication:
- United States
- Language:
- English
Similar Records
CIFTS : A coordinated infrastructure for fault-tolerant systems.
Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report