skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluation Of Fault-Tolerant Policies Using Simulation

Abstract

Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work. In order to decrease this overhead, researchers try to both optimize existing FT mechanisms and implement new FT policies. For instance, combining reactive and pro-active approaches in order to decrease the number of checkpoints that must be performed during the application 's execution. However, currently no solutions exist which enable the evaluation of these FT approaches through simulation, instead experimentations must be done using real platforms. This increases complexity and limits experimentation into alternate solutions. This paper presents a simulation framework that evaluates different FT mechanisms and policies. The framework uses system failure logs for the simulation with a default behavior based on logs taken from the ASCI White at Lawrence Livermore National Laboratory. We evaluate the accuracy of our simulator comparing simulated results withmore » those taken from experiments done on a 32-node compute cluster. Therefore such a simulator can be used to develop new FT policies and/or to tune existing policies.« less

Authors:
 [1];  [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
965827
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE Cluster 2007, Austin, TX, USA, 20070917, 20070921
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTERS; ERRORS; EVALUATION; SIMULATORS; FAILURE MODE ANALYSIS

Citation Formats

Tikotekar, Anand A, Vallee, Geoffroy R, Naughton, III, Thomas J, and Scott, Stephen L. Evaluation Of Fault-Tolerant Policies Using Simulation. United States: N. p., 2007. Web.
Tikotekar, Anand A, Vallee, Geoffroy R, Naughton, III, Thomas J, & Scott, Stephen L. Evaluation Of Fault-Tolerant Policies Using Simulation. United States.
Tikotekar, Anand A, Vallee, Geoffroy R, Naughton, III, Thomas J, and Scott, Stephen L. Mon . "Evaluation Of Fault-Tolerant Policies Using Simulation". United States. doi:.
@article{osti_965827,
title = {Evaluation Of Fault-Tolerant Policies Using Simulation},
author = {Tikotekar, Anand A and Vallee, Geoffroy R and Naughton, III, Thomas J and Scott, Stephen L},
abstractNote = {Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work. In order to decrease this overhead, researchers try to both optimize existing FT mechanisms and implement new FT policies. For instance, combining reactive and pro-active approaches in order to decrease the number of checkpoints that must be performed during the application 's execution. However, currently no solutions exist which enable the evaluation of these FT approaches through simulation, instead experimentations must be done using real platforms. This increases complexity and limits experimentation into alternate solutions. This paper presents a simulation framework that evaluates different FT mechanisms and policies. The framework uses system failure logs for the simulation with a default behavior based on logs taken from the ASCI White at Lawrence Livermore National Laboratory. We evaluate the accuracy of our simulator comparing simulated results with those taken from experiments done on a 32-node compute cluster. Therefore such a simulator can be used to develop new FT policies and/or to tune existing policies.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • The rollback propagation and the performance of a fault-tolerant multiprocessor with a rollback recovery mechanism (FTMR/sup 2/M), which was designed to be tolerant of hardware failure with minimum time overhead, is considered. Rollback propagation between cooperating processes is usually required to ensure correct recovery from failure. To minimise the waste of processor time and storage overhead required for handling sophisticated rollback propagations, the FTMR/sup 2/M always keeps one recoverable state. Approaches for evaluating the recovery overhead and analysing the performance of FTMR/sup 2/M are presented. Two methods for detecting rollback propagations and multi-step rollbacks between cooperating processes are also proposed.more » 12 references.« less
  • The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporatingmore » some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG and BT) on 1024 processors, simple causal message logging has a latency overhead below 5%.« less
  • Abstract not provided.
  • Four fault tolerant architectures were evaluated for their potential reliability in service as control systems of nuclear power plants. The reliability analyses showed that human- and software-related common cause failures and single points of failure in the output modules are dominant contributors to system unreliability. The four architectures are triple-modular-redundant (TMR), both synchronous and asynchronous, and also dual synchronous and asynchronous. The evaluation includes a review of design features, an analysis of the importance of coverage, and reliability analyses of fault tolerant systems. An advantage of fault-tolerant controllers over those not fault tolerant, is that fault-tolerant controllers continue to functionmore » after the occurrence of most single hardware faults. However, most fault-tolerant controllers have single hardware components that will cause system failure, almost all controllers have single points of failure in software, and all are subject to common cause failures. Reliability analyses based on data from several industries that have fault-tolerant controllers were used to estimate the mean-time-between-failures of fault-tolerant controllers and to predict those failures modes that may be important in nuclear power plants. 7 refs., 4 tabs.« less
  • The described power system significantly improves energy conversion efficiency under Low Intensity, Low Temperature (LILT) conditions. Elements of the described DSP-based system apply directly to terrestrial solar power processing needs. Use of this system will enable increased efficiency of solar power processing in many applications that demand low power under adverse insolation conditions. Examples are portable solar-recharged communications systems, solar-powered remote telemetry stations, autonomous geological and seismological monitoring stations, portable remote field equipment, remote sight irrigation and area lighting. The feasibility of this system was evaluated by extensive computer simulation and an engineering demonstration model was designed and fabricated tomore » verify the concept.« less