Template based parallel checkpointing in a massively parallel computer system
Patent
·
OSTI ID:985865
- Rochester, MN
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
- Research Organization:
- International Business Machines Corporation (Armonk, NY)
- Sponsoring Organization:
- USDOE
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Number(s):
- 7,478,278
- Application Number:
- 11/106,010
- OSTI ID:
- 985865
- Country of Publication:
- United States
- Language:
- English
Similar Records
The Scalable Checkpoint/Restart Library
Parallel checksumming of data chunks of a shared data object using a log-structured file system
Cloud object store for checkpoints of high performance computing applications using decoupling middleware
Software
·
Sun Feb 22 19:00:00 EST 2009
·
OSTI ID:code-1155
Parallel checksumming of data chunks of a shared data object using a log-structured file system
Patent
·
Tue Sep 06 00:00:00 EDT 2016
·
OSTI ID:1320885
Cloud object store for checkpoints of high performance computing applications using decoupling middleware
Patent
·
Tue Apr 19 00:00:00 EDT 2016
·
OSTI ID:1247993