Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments Qian Zhu Gagan Agrawal
 

Summary: Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments
Qian Zhu Gagan Agrawal
Department of Computer Science and Engineering
Ohio State University, Columbus OH 43210
{zhuq,agrawal}@cse.ohio-state.edu
Abstract
In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous
and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function
while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the
application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also
maximizing the success rate, which is the probability of finishing execution without failures. However, for the cases where failures
do occur, we have developed a hybrid failure-recovery scheme to ensure that the application can complete within the pre-specified
time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several
heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply
the hybrid failure recovery scheme, and the success-rate becomes 100%.
1 Introduction
Grid or utility based computing models allow flexible use of resources by applications. Resource discovery and resource
allocation are among the problems most widely studied in grid computing [8, 29, 12, 25]. A key problem faced by applications
executing in a grid computing environment is the inherent unreliability of the resources. As one considers a variety of commodity
resources as part of a grid, resource failures could occur during the execution of an application. Thus, resource allocation in a

  

Source: Agrawal, Gagan - Department of Computer Science and Engineering, Ohio State University

 

Collections: Computer Technologies and Information Sciences