SFT: Scalable Fault Tolerance

Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

doi:10.1145/1131322.1131336

Title: SFT: Scalable Fault Tolerance

Journal Article · Sat Apr 15 00:00:00 EDT 2006 · Operating Systems Review, 40(2):55 - 62

DOI:https://doi.org/10.1145/1131322.1131336· OSTI ID:918857

Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 918857

Report Number(s):: PNNL-SA-52256; KJ0101030; TRN: US200820%%27

Journal Information:: Operating Systems Review, 40(2):55 - 62, Vol. 40, Issue 2

Country of Publication:: United States

Language:: English

Similar Records

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Conference · Thu Jan 01 00:00:00 EST 2004 · OSTI ID:918857

Petrini, F; Davis, Kei; Sancho, J C

Lightweight storage and overlay networks for fault tolerance.

Technical Report · Fri Jan 01 00:00:00 EST 2010 · OSTI ID:918857

Oldfield, Ron A

HPC application fault-tolerance using transparent redundant computation.

Conference · Sat Aug 01 00:00:00 EDT 2009 · OSTI ID:918857

Riesen, Rolf E; Laros, III, James H; Pedretti, Kevin Thomas Tauke; +3 more

Related Subjects

97 MATHEMATICS AND COMPUTING
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
COMPUTER ARCHITECTURE
DESIGN
MEMORY MANAGEMENT
ERRORS
MITIGATION
T CODES

Title: SFT: Scalable Fault Tolerance

Citation Formats

Similar Records

Related Subjects