DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Checkpointing using compute node health information

Abstract

A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.

Inventors:
; ; ; ;
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1632496
Patent Number(s):
10545839
Application Number:
15/853,343
Assignee:
International Business Machines Corporation (Armonk, NY)
DOE Contract Number:  
B599858
Resource Type:
Patent
Resource Relation:
Patent File Date: 12/22/2017
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Checkpointing using compute node health information. United States: N. p., 2020. Web.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., & Ryu, Kyung. Checkpointing using compute node health information. United States.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Tue . "Checkpointing using compute node health information". United States. https://www.osti.gov/servlets/purl/1632496.
@article{osti_1632496,
title = {Checkpointing using compute node health information},
author = {Andrade Costa, Carlos Henrique and Park, Yoonho and Cher, Chen-Yong and Rosenburg, Bryan S. and Ryu, Kyung},
abstractNote = {A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {1}
}

Works referenced in this record:

Controller placement for fast failover in the split architecture
patent, August 2014


Automated high resiliency system pool
patent, February 2015


Methods, apparatus and system for selective duplication of subtasks
patent, March 2016


Increasing resilience of a network service
patent, October 2014