skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CIFTS : A coordinated infrastructure for fault-tolerant systems.

Abstract

In the next few years SciDAC applications will utilize petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Petascale applications that are evolving to utilize these platforms face many new challenges. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occuring in the operating environment in a holistic manner. Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extendmore » key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.« less

Authors:
; ; ; ; ; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
982645
Report Number(s):
ANL/MCS/CP-64768
TRN: US201015%%1255
DOE Contract Number:
DE-AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 38th International Conference on Parallel Processing (ICPP-09); Sep. 22, 2009 - Sep. 25, 2009; Vienna, Austria
Country of Publication:
United States
Language:
ENGLISH
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; CRITICAL MASS; DESIGN; ENVIRONMENT; FACE; IMPLEMENTATION; INTERFACES; LEVELS; LIBRARIES; MANAGEMENT; MEETINGS; PARALLEL PROCESSING; SPECIFICATIONS; USES

Citation Formats

Gupta, R., Beckman, P., Park, B. H., Lusk, E., Hargrove, P., Geist, A., Panda, D. K., Lumsdaine, A., Dongarra, J., ORNL, LBNL, Ohio State Univ., Indiana Univ., and Univ. of Tennessee. CIFTS : A coordinated infrastructure for fault-tolerant systems.. United States: N. p., 2009. Web.
Gupta, R., Beckman, P., Park, B. H., Lusk, E., Hargrove, P., Geist, A., Panda, D. K., Lumsdaine, A., Dongarra, J., ORNL, LBNL, Ohio State Univ., Indiana Univ., & Univ. of Tennessee. CIFTS : A coordinated infrastructure for fault-tolerant systems.. United States.
Gupta, R., Beckman, P., Park, B. H., Lusk, E., Hargrove, P., Geist, A., Panda, D. K., Lumsdaine, A., Dongarra, J., ORNL, LBNL, Ohio State Univ., Indiana Univ., and Univ. of Tennessee. Thu . "CIFTS : A coordinated infrastructure for fault-tolerant systems.". United States. doi:.
@article{osti_982645,
title = {CIFTS : A coordinated infrastructure for fault-tolerant systems.},
author = {Gupta, R. and Beckman, P. and Park, B. H. and Lusk, E. and Hargrove, P. and Geist, A. and Panda, D. K. and Lumsdaine, A. and Dongarra, J. and ORNL and LBNL and Ohio State Univ. and Indiana Univ. and Univ. of Tennessee},
abstractNote = {In the next few years SciDAC applications will utilize petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Petascale applications that are evolving to utilize these platforms face many new challenges. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occuring in the operating environment in a holistic manner. Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2009},
month = {Thu Jan 01 00:00:00 EST 2009}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility ormore » control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.« less
  • Fault-tolerant computers usually involve parallel architectures where the commutation of a particular task is duplicated and a consensus result is taken. More recently it has been realized that not all tasks in a schedule require the full fault tolerance provided by the parallel redundancy, and as a consequence architectures have been developed that dynamically reconfigure themselves to improve the throughput of less sensitive tasks by utilizing the parallelism. A new language is presented for programming this type of system. It has properties similar to those of OCCAM and Pascal-M and is suitable for real-time use. 27 references.
  • Distributed programmable control system design architecture geared to the distinct needs and requirements of power generation, especially on conversion and modernization projects is discussed. The proposed applications of this architecture are independent of any specific manufacturer or components and is intended to provide a conceptual overview of considerations in specifying or designing analog (modulating) or digital sequencing systems. The distinct system architecture allows maximum degree of standardization and uniformity of power plant control components while satisfying the unique reliability and availability requirements of specialized control systems. Unique treatment of selective, distributed modular back-up to provide improved availability and on-line maintenancemore » capability is covered extensively. Combined with the reliability factor is the flexibility needed for future growth and change through the utilization of software for diagnostics, data highways for communication, and color graphics for display.« less
  • This book presents the papers given at a symposium on fault tolerant computers. Topics considered at the symposium included fault-tolerant multiprocessors, computer architecture, communication network models, error detection schemes, failure probabilities, software errors and failures, algorithms, concurrent processing systems, repair, memory devices, graph models, nuclear power plant applications, multipipeline architecture, analog systems, integrated circuits, Prolog, and digital systems.
  • Rule-based systems operating in an embedded environment where internal variables may be corrupted during their execution as a result of transient faults must be able to recover automatically. Given a rule-based program p with bounded response time, the problem is to derive a self-stabilizing program q that implements p with the constraint that q must also have bounded response time. We first present an approach for solving this problem for a class of EQL rule-based programs with bounded response time. Then we extend this transformation approach to make a class of real-time MRL rule-based systems self-stabilizing. As a more expressivemore » superset of EQL, MRL allows existentially quantified as well as universally quantified variables (simple or macro), making it comparable in expressive power to that of the OPS5 and CLIPS.« less