skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SFT: Scalable Fault Tolerance

Abstract

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.

Authors:
; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
918857
Report Number(s):
PNNL-SA-52256
KJ0101030; TRN: US200820%%27
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: Operating Systems Review, 40(2):55 - 62; Journal Volume: 40; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER ARCHITECTURE; DESIGN; MEMORY MANAGEMENT; ERRORS; MITIGATION; T CODES

Citation Formats

Petrini, Fabrizio, Nieplocha, Jarek, and Tipparaju, Vinod. SFT: Scalable Fault Tolerance. United States: N. p., 2006. Web. doi:10.1145/1131322.1131336.
Petrini, Fabrizio, Nieplocha, Jarek, & Tipparaju, Vinod. SFT: Scalable Fault Tolerance. United States. doi:10.1145/1131322.1131336.
Petrini, Fabrizio, Nieplocha, Jarek, and Tipparaju, Vinod. Sat . "SFT: Scalable Fault Tolerance". United States. doi:10.1145/1131322.1131336.
@article{osti_918857,
title = {SFT: Scalable Fault Tolerance},
author = {Petrini, Fabrizio and Nieplocha, Jarek and Tipparaju, Vinod},
abstractNote = {In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.},
doi = {10.1145/1131322.1131336},
journal = {Operating Systems Review, 40(2):55 - 62},
number = 2,
volume = 40,
place = {United States},
year = {Sat Apr 15 00:00:00 EDT 2006},
month = {Sat Apr 15 00:00:00 EDT 2006}
}