CIFTS : A coordinated infrastructure for fault-tolerant systems.

Gupta, R; Beckman, P; Park, B H; Lusk, E; Hargrove, P; Geist, A; Panda, D K; Lumsdaine, A; Dongarra, J; ORNL,; LBNL,

CIFTS : A coordinated infrastructure for fault-tolerant systems.

Conference · Wed Dec 31 23:00:00 EST 2008

OSTI ID:982645

Gupta, R; Beckman, P; Park, B H; Lusk, E; Hargrove, P; Geist, A; Panda, D K; Lumsdaine, A; Dongarra, J; ORNL,; LBNL,

In the next few years SciDAC applications will utilize petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Petascale applications that are evolving to utilize these platforms face many new challenges. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occuring in the operating environment in a holistic manner. Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: SC

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 982645

Report Number(s):: ANL/MCS/CP-64768

Country of Publication:: United States

Language:: ENGLISH

Similar Records

Coordinated Fault Tolerance for High-Performance Computing

Technical Report · Mon Apr 08 00:00:00 EDT 2013 · OSTI ID:1072982

2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

Technical Report · Sat Jan 31 23:00:00 EST 2009 · OSTI ID:971988

Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

Technical Report · Thu Jul 28 00:00:00 EDT 2011 · OSTI ID:1104503

Related Subjects

99 GENERAL AND MISCELLANEOUS
CRITICAL MASS
DESIGN
ENVIRONMENT
FACE
IMPLEMENTATION
INTERFACES
LEVELS
LIBRARIES
MANAGEMENT
MEETINGS
PARALLEL PROCESSING
SPECIFICATIONS
USES

CIFTS : A coordinated infrastructure for fault-tolerant systems.

Citation Formats

Similar Records

Related Subjects