| | |
Summary: Supporting Fault-Tolerance in Streaming Grid
Applications
Qian Zhu Liang Chen Gagan Agrawal
Department of Computer Science and Engineering
Ohio State University
Columbus, OH, 43210
{zhuq,chenlia,agrawal}@cse.ohio-state.edu
Abstract-- This paper considers the problem of supporting and
efficiently implementing fault-tolerance for tightly-coupled and
pipelined applications, especially streaming applications, in a grid
environment. We provide an alternative to basic checkpointing
and use the notion of Light-weight Summary Structure(LSS) to
enable efficient failure-recovery. The idea behind LSS is that at
certain points during the execution of a processing stage, the
state of the program can be summarized by a small amount
of memory. This allows us to store copies of LSS for enabling
failure-recovery, which causes low overhead fault-tolerance. Our
work can be viewed as an optimization and adaptation of the
idea of application-level checkpointing to a different execution
environment, and for a different class of applications.
|