Reliable and Efficient Distributed Checkpointing System for Grid Environments
Journal Article
·
· Journal of Grid Computing
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. Here we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF); Purdue Research Foundation
- DOE Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1772312
- Report Number(s):
- LLNL-JRNL--649440; 769461
- Journal Information:
- Journal of Grid Computing, Journal Name: Journal of Grid Computing Journal Issue: 4 Vol. 12; ISSN 1570-7873
- Publisher:
- Springer
- Country of Publication:
- United States
- Language:
- English
Similar Records
Efficient Checkpointing of Virtual Machines using Virtual Machine Introspection
Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
Using the Sirocco File System for high-bandwidth checkpoints.
Conference
·
2013
·
OSTI ID:1134179
Lazy Checkpointing : Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
Conference
·
2013
·
OSTI ID:1130431
Using the Sirocco File System for high-bandwidth checkpoints.
Technical Report
·
2012
·
OSTI ID:1039010