Template based parallel checkpointing in a massively parallel computer system

Archer, Charles Jens; Inglett, Todd Alan

Template based parallel checkpointing in a massively parallel computer system

Patent · Tue Jan 13 04:00:00 EST 2009

OSTI ID:985865

Archer, Charles Jens ^[1]; Inglett, Todd Alan ^[1]

Rochester, MN

A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

Research Organization:: International Business Machines Corporation (Armonk, NY)

Sponsoring Organization:: USDOE

Assignee:: International Business Machines Corporation (Armonk, NY)

Patent Number(s):: 7,478,278

Application Number:: 11/106,010

OSTI ID:: 985865

Country of Publication:: United States

Language:: English

References (6)

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling Petrini, F.; Davis, K.; Sancho, J. C. 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. https://doi.org/10.1109/IPDPS.2004.1303239	conference	January 2004
Evaluation of checkpoint mechanisms for massively parallel machines Chiueh, Tzi-Cker; Deng, Peitao Proceedings of Annual Symposium on Fault Tolerant Computing https://doi.org/10.1109/FTCS.1996.534622	conference	June 1996
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes Bosilca, G.; Bouteiller, A.; Cappello, F. ACM/IEEE SC 2002 Conference (SC'02) https://doi.org/10.1109/SC.2002.10048	conference	January 2002
ickp: a consistent checkpointer for multicomputers Plank, J. S. IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, Issue 2 https://doi.org/10.1109/88.311574	journal	July 1994
Checkpoint/rollback in a distributed system using coarse-grained dataflow Cummings, D.; Alkalaj, L. Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing https://doi.org/10.1109/FTCS.1994.315619	conference	January 1994
CLIP: a checkpointing tool for message-passing parallel programs Chen, Yuqun; Plank, James S.; Li, Kai Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '97 https://doi.org/10.1145/509593.509626	conference	January 1997

Similar Records

The Scalable Checkpoint/Restart Library

Software · Sun Feb 22 19:00:00 EST 2009 · OSTI ID:code-1155

Parallel checksumming of data chunks of a shared data object using a log-structured file system

Patent · Tue Sep 06 00:00:00 EDT 2016 · OSTI ID:1320885

Cloud object store for checkpoints of high performance computing applications using decoupling middleware

Patent · Tue Apr 19 00:00:00 EDT 2016 · OSTI ID:1247993

Related Subjects

97 MATHEMATICS AND COMPUTING

Template based parallel checkpointing in a massively parallel computer system

Citation Formats

References (6)

Similar Records

Related Subjects