skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Template based parallel checkpointing in a massively parallel computer system

Abstract

A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

Inventors:
 [1];  [1]
  1. Rochester, MN
Publication Date:
Research Org.:
International Business Machines Corporation (Armonk, NY)
Sponsoring Org.:
USDOE
OSTI Identifier:
985865
Patent Number(s):
7,478,278
Application Number:
11/106,010
Assignee:
International Business Machines Corporation (Armonk, NY) OSTI
DOE Contract Number:  
B519700
Resource Type:
Patent
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Archer, Charles Jens, and Inglett, Todd Alan. Template based parallel checkpointing in a massively parallel computer system. United States: N. p., 2009. Web.
Archer, Charles Jens, & Inglett, Todd Alan. Template based parallel checkpointing in a massively parallel computer system. United States.
Archer, Charles Jens, and Inglett, Todd Alan. Tue . "Template based parallel checkpointing in a massively parallel computer system". United States. https://www.osti.gov/servlets/purl/985865.
@article{osti_985865,
title = {Template based parallel checkpointing in a massively parallel computer system},
author = {Archer, Charles Jens and Inglett, Todd Alan},
abstractNote = {A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 13 00:00:00 EST 2009},
month = {Tue Jan 13 00:00:00 EST 2009}
}

Patent:

Save / Share: