Template based parallel checkpointing in a massively parallel computer system
Abstract
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
- Inventors:
-
- Rochester, MN
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 985865
- Patent Number(s):
- 7478278
- Application Number:
- 11/106,010
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- B519700
- Resource Type:
- Patent
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Archer, Charles Jens, and Inglett, Todd Alan. Template based parallel checkpointing in a massively parallel computer system. United States: N. p., 2009.
Web.
Archer, Charles Jens, & Inglett, Todd Alan. Template based parallel checkpointing in a massively parallel computer system. United States.
Archer, Charles Jens, and Inglett, Todd Alan. Tue .
"Template based parallel checkpointing in a massively parallel computer system". United States. https://www.osti.gov/servlets/purl/985865.
@article{osti_985865,
title = {Template based parallel checkpointing in a massively parallel computer system},
author = {Archer, Charles Jens and Inglett, Todd Alan},
abstractNote = {A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 13 00:00:00 EST 2009},
month = {Tue Jan 13 00:00:00 EST 2009}
}
Works referenced in this record:
Checkpoint/rollback in a distributed system using coarse-grained dataflow
conference, January 1994
- Cummings, D.; Alkalaj, L.
- Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing
System-level fault-tolerance in large-scale parallel machines with buffered coscheduling
conference, January 2004
- Petrini, F.; Davis, K.; Sancho, J. C.
- 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002
- Bosilca, G.; Bouteiller, A.; Cappello, F.
- ACM/IEEE SC 2002 Conference (SC'02)
ickp: a consistent checkpointer for multicomputers
journal, July 1994
- Plank, J. S.
- IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, Issue 2
CLIP: a checkpointing tool for message-passing parallel programs
conference, January 1997
- Chen, Yuqun; Plank, James S.; Li, Kai
- Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '97
Evaluation of checkpoint mechanisms for massively parallel machines
conference, June 1996
- Chiueh, Tzi-Cker; Deng, Peitao
- Proceedings of Annual Symposium on Fault Tolerant Computing