DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Checkpointing using compute node health information

Abstract

A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.

Inventors:
; ; ; ;
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1632496
Patent Number(s):
10545839
Application Number:
15/853,343
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
B599858
Resource Type:
Patent
Resource Relation:
Patent File Date: 12/22/2017
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Checkpointing using compute node health information. United States: N. p., 2020. Web.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., & Ryu, Kyung. Checkpointing using compute node health information. United States.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Tue . "Checkpointing using compute node health information". United States. https://www.osti.gov/servlets/purl/1632496.
@article{osti_1632496,
title = {Checkpointing using compute node health information},
author = {Andrade Costa, Carlos Henrique and Park, Yoonho and Cher, Chen-Yong and Rosenburg, Bryan S. and Ryu, Kyung},
abstractNote = {A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 28 00:00:00 EST 2020},
month = {Tue Jan 28 00:00:00 EST 2020}
}

Works referenced in this record:

Design and modeling of a non-blocking checkpointing system
conference, November 2012

  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2012.46

Controller placement for fast failover in the split architecture
patent, August 2014


Multilevel Diskless Checkpointing
journal, April 2013


Circuit Failure Prediction and Its Application to Transistor Aging
conference, May 2007


Methods, Apparatus and System for Selective Duplication of Subtasks
patent-application, August 2015


Availability Prediction Method for High Availability Cluster
patent-application, June 2009


Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
conference, January 2012

  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12
  • https://doi.org/10.1145/2150976.2150989

Dynamic Configuration of Processor Core Banks
patent-application, January 2008


A tunable holistic resiliency approach for high-performance computing systems
conference, February 2009

  • Scott, Stephen L.; Engelmann, Christian; VallĂ©e, Geoffroy R.
  • Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
  • https://doi.org/10.1145/1504176.1504227

A study of DRAM failures in the field
conference, November 2012

  • Sridharan, Vilas; Liberty, Dean
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2012.13

Toward Exascale Resilience
journal, September 2009


Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2010.18

Automated high resiliency system pool
patent, February 2015


Method of Predicting Availability of a System
patent-application, July 2008


Apparatus and Program Storage Device for Providing Triad Copy of Storage Data
patent-application, March 2009


Maintaining Routing Consistency within a Rendezvous Federation
patent-application, January 2008


Methods, apparatus and system for selective duplication of subtasks
patent, March 2016


Increasing resilience of a network service
patent, October 2014