Checkpointing using compute node health information
Abstract
A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.
- Inventors:
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1632496
- Patent Number(s):
- 10545839
- Application Number:
- 15/853,343
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- DOE Contract Number:
- B599858
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 12/22/2017
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Checkpointing using compute node health information. United States: N. p., 2020.
Web.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., & Ryu, Kyung. Checkpointing using compute node health information. United States.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Tue .
"Checkpointing using compute node health information". United States. https://www.osti.gov/servlets/purl/1632496.
@article{osti_1632496,
title = {Checkpointing using compute node health information},
author = {Andrade Costa, Carlos Henrique and Park, Yoonho and Cher, Chen-Yong and Rosenburg, Bryan S. and Ryu, Kyung},
abstractNote = {A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {1}
}
Works referenced in this record:
Controller placement for fast failover in the split architecture
patent, August 2014
- Tatipamula, Mallik; Beheshti-Zavareh, Neda; Zhang, Ying
- US Patent Document 8,804,490
Planning a reliable migration in a limited stability virtualized environment
patent, September 2014
- Glikson, Alexander; Israel, Assaf
- US Patent Document 8,826,272
Job scheduling on a multiprocessing system based on reliability and performance rankings of processors and weighted effect of detected errors
patent, October 2014
- Shivanna, Suhas; Krishnapuram Ranganathan, Karthik
- US Patent Document 8,875,142
Automated high resiliency system pool
patent, February 2015
- Sloma, Andrew J.; Triebenbach, Jonathan L.
- US Patent Document 8,959,223
Monitoring device, radio communication system, failure prediction method and non-temporary computer-readable medium in which a program is stored
patent, February 2018
- Kitahara, Yoshinori
- US Patent Document 9,900,791
Method and system for defining an efficient and reliable meshing of CP-CP sessions in an advanced peer to peer network
patent, March 2003
- Giroir, Didier
- US Patent Document 6,535,923
Methods, apparatus and system for selective duplication of subtasks
patent, March 2016
- Andrade Costa, Carlos H.; Cher, Chen-Yong; Park, Yoonho
- US Patent Document 9,298,553
Host swap hypervisor that provides high availability for a host of virtual machines
patent, March 2017
- Cao, Bin; Chen, Jim C.; Somers, Lauren A.
- US Patent Document 9,606,878
Increasing resilience of a network service
patent, October 2014
- Banerjee, Dipyaman; Madduri, Venkateswara R.; Srivatsa, Mudhakar
- US Patent Document 8,869,035
System and method for dependent failure-aware allocation of distributed data-processing systems
patent, February 2012
- Bansal, Nikhil; Bhagwan, Ranjita; Park, Yoonho
- US Patent Document 8,122,281