skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SFT: Scalable Fault Tolerance

Abstract

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.

Authors:
; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
918857
Report Number(s):
PNNL-SA-52256
KJ0101030; TRN: US200820%%27
DOE Contract Number:
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: Operating Systems Review, 40(2):55 - 62; Journal Volume: 40; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER ARCHITECTURE; DESIGN; MEMORY MANAGEMENT; ERRORS; MITIGATION; T CODES

Citation Formats

Petrini, Fabrizio, Nieplocha, Jarek, and Tipparaju, Vinod. SFT: Scalable Fault Tolerance. United States: N. p., 2006. Web. doi:10.1145/1131322.1131336.
Petrini, Fabrizio, Nieplocha, Jarek, & Tipparaju, Vinod. SFT: Scalable Fault Tolerance. United States. doi:10.1145/1131322.1131336.
Petrini, Fabrizio, Nieplocha, Jarek, and Tipparaju, Vinod. Sat . "SFT: Scalable Fault Tolerance". United States. doi:10.1145/1131322.1131336.
@article{osti_918857,
title = {SFT: Scalable Fault Tolerance},
author = {Petrini, Fabrizio and Nieplocha, Jarek and Tipparaju, Vinod},
abstractNote = {In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.},
doi = {10.1145/1131322.1131336},
journal = {Operating Systems Review, 40(2):55 - 62},
number = 2,
volume = 40,
place = {United States},
year = {Sat Apr 15 00:00:00 EDT 2006},
month = {Sat Apr 15 00:00:00 EDT 2006}
}
  • In the last couple of decades, the massive computational power provided by the most modern supercomputers has resulted in simulation of higher order computational chem- istry methods, previously considered intractable. As the system sizes continue to increase, computational chemistry domain continues to escalate this trend using parallel comput- ing with programming models such as Message Passing Interface (MPI) and Partitioned Global Address Space (PGAS) programming models such as Global Arrays. The ever increasing scale of these supercomputers comes at a cost of reduced mean time between failures, currently in the order of days, and projected to be in the ordermore » of hours for up- coming extreme scale systems. While traditional disk based checkpointing methods are ubiquitous for storing intermediate solutions, they suffer from high overhead of writing and recovering from checkpoints. In practice, checkpointing itself often brings the system down. Clearly, methods beyond checkpointing are imperative to handling the aggravating issue of reducing MTBF. In this paper, we address this challenge by designing and im- plementing an efficient fault tolerant version of coupled cluster method with NWChem, using in memory data redundancy. We present the challenges associated with our de- sign including efficient data storage model, maintenance of at least one consistent data copy and the recovery process. Our performance evaluation without faults shows that the current design exhibits negligible overhead. In the presence of a fault, the proposed design incurs negligible overhead in comparison to the state of the art implementation without faults.« less
  • Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. The authors propose the detection and locationmore » of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection.« less
  • The goal of task allocation in a set of interconnected processors (computers) is to maximize the efficient use of resources and reduce the job turnaround time. A simple but effective method to allocate the tasks in multicomputer systems for minimizing the interprocessor communication cost subject to resource limitations defined by the system and designer. The authors demonstrate the effectiveness of the task allocation and reallocation for hardware fault tolerance by illustrations of applying the methods to different examples and practical communications network multiprocessor systems. 27 refs.
  • This paper introduces an assertion scheme based on the backward error analysis for error detection in algorithms that solve dense systems of linear equations, Ax = b. Unlike previous methods, this Backward Error Assertion Model is specifically designed to operate in an environment of floating point arithmetic subject to round-off errors, and it can be easily instrumented in a Watchdog processor environment. The complexity of verifying assertions is Omicron (n(sup 2)), compared to the Omicron (n(sup 3)) complexity of algorithms solving Ax = b. Unlike other proposed error detection methods, this assertion model does not require any encoding of themore » matrix A. Experimental results under various error models are presented to validate the effectiveness of this assertion scheme. 22 refs.« less
  • The early study of fault tolerance in efficient sorting networks only achieved single-fault tolerance. By eliminating critical comparators, Rudolph presented a 1-fault tolerant design of the balanced sorting network (BSN) at the cost of one redundant stage of N/2 comparators and two permuters external to the network. In this paper, we show, however, that 1-fault tolerance of BSN can be achieved without introducing redundancy and external permuters. Furthermore, we provide solutions to the open question of how to achieve multiple-fault tolerance in BSN. We analyze the problem from a higher-level by introducing a new concept of critical stages, and findmore » that all stages in previous designs are critical. A 2-fault tolerant design of BSN is then discovered after eliminating its critical stages. The new design has a similar network architecture (i.e., a multistage network with the output recirculated back to the input) and the same hardware cost as Rudolph`s, but it has many distinguished features. (1) It becomes 3-fault tolerant by duplicating the redundant stage. (2) It can be generalized to a (k + 1)-fault tolerant design if 1greater than or = k redundant stages are added; and the resulting network has no critical stages but critical (k + 1)-tuples of stages. Sorting would fail if and only if all stages of a critical (k + 1)-tuple are faulty. (3) It can be extended to new topologies with an arbitrary number of stages, without external permuters, that may achieve an arbitrary degree of fault tolerance. The performance analysis shows that the new designs achieve much higher probabilities of correct sorting in the presence of faulty comparators than the previous reported designs. 21 refs.« less