Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Reliable and Efficient Distributed Checkpointing System for Grid Environments

Journal Article · · Journal of Grid Computing
 [1];  [1];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. Here we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF); Purdue Research Foundation
DOE Contract Number:
AC52-07NA27344
OSTI ID:
1772312
Report Number(s):
LLNL-JRNL--649440; 769461
Journal Information:
Journal of Grid Computing, Journal Name: Journal of Grid Computing Journal Issue: 4 Vol. 12; ISSN 1570-7873
Publisher:
Springer
Country of Publication:
United States
Language:
English

References (14)

Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems conference January 2005
High Availability in DHTs: Erasure Coding vs. Replication book January 2005
Scheduling on the Grid via multi-state resource availability prediction conference September 2008
A measurement study of available bandwidth estimation tools conference January 2003
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
  • Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.77
conference November 2012
Resource policing to support fine-grain cycle stealing in networks of workstations journal October 2004
A survey of rollback-recovery protocols in message-passing systems journal September 2002
Collective operations in application-level fault-tolerant MPI conference January 2003
Multi-state grid resource availability characterization conference September 2007
BioBench: A Benchmark Suite of Bioinformatics Applications conference January 2005
Failure-aware checkpointing in fine-grained cycle sharing systems conference January 2007
Independent checkpointing in a heterogeneous grid environment journal January 2012
Distributed computing in practice: the Condor experience
  • Thain, Douglas; Tannenbaum, Todd; Livny, Miron
  • Concurrency and Computation: Practice and Experience, Vol. 17, Issue 2-4, p. 323-356 https://doi.org/10.1002/cpe.938
journal January 2005
DMTCP: Transparent checkpointing for cluster computations and the desktop conference May 2009