Checkpointing using compute node health information
Abstract
A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.
- Inventors:
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1632496
- Patent Number(s):
- 10545839
- Application Number:
- 15/853,343
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- B599858
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 12/22/2017
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Checkpointing using compute node health information. United States: N. p., 2020.
Web.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., & Ryu, Kyung. Checkpointing using compute node health information. United States.
Andrade Costa, Carlos Henrique, Park, Yoonho, Cher, Chen-Yong, Rosenburg, Bryan S., and Ryu, Kyung. Tue .
"Checkpointing using compute node health information". United States. https://www.osti.gov/servlets/purl/1632496.
@article{osti_1632496,
title = {Checkpointing using compute node health information},
author = {Andrade Costa, Carlos Henrique and Park, Yoonho and Cher, Chen-Yong and Rosenburg, Bryan S. and Ryu, Kyung},
abstractNote = {A method is disclosed, as well as an associated apparatus and computer program product, for checkpointing using a plurality of communicatively coupled compute nodes. The method comprises acquiring health information for a first node of the plurality of compute nodes, and determining a first failure probability for the first node using the health information. The first failure probability corresponds to a predetermined time interval. The method further comprises selecting a second node of the plurality of compute nodes as a partner node for the first node. The second node has a second failure probability for the time interval. A composite failure probability of the first node and the second node is less than the first failure probability. The method further comprises copying checkpoint information from the first node to the partner node.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 28 00:00:00 EST 2020},
month = {Tue Jan 28 00:00:00 EST 2020}
}
Works referenced in this record:
Design and modeling of a non-blocking checkpointing system
conference, November 2012
- Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
- 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Controller placement for fast failover in the split architecture
patent, August 2014
- Tatipamula, Mallik; Beheshti-Zavareh, Neda; Zhang, Ying
- US Patent Document 8,804,490
Multilevel Diskless Checkpointing
journal, April 2013
- Hakkarinen, D.
- IEEE Transactions on Computers, Vol. 62, Issue 4
Circuit Failure Prediction and Its Application to Transistor Aging
conference, May 2007
- Agarwal, Mridul; Paul, Bipul C.; Zhang, Ming
- 25th IEEE VLSI Test Symmposium (VTS'07)
Methods, Apparatus and System for Selective Duplication of Subtasks
patent-application, August 2015
- Andrade Costa, Carlos H.; Cher, Chen-Yong; Park, Yoonho
- US Patent Application 14/176083; 20150227426
Availability Prediction Method for High Availability Cluster
patent-application, June 2009
- Lee, Yong-Ju; Min, Ok-Gee; Kim, Chang-Soo
- US Patent Application 12/184707; 20090150717
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
conference, January 2012
- Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
- Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12
Host swap hypervisor that provides high availability for a host of virtual machines
patent, March 2017
- Cao, Bin; Chen, Jim C.; Somers, Lauren A.
- US Patent Document 9,606,878
Dynamic Configuration of Processor Core Banks
patent-application, January 2008
- Apparao, Padmashree K.; Velhal, Ravindra V.
- US Patent Application 11/479573; 20080005538
A tunable holistic resiliency approach for high-performance computing systems
conference, February 2009
- Scott, Stephen L.; Engelmann, Christian; Vallée, Geoffroy R.
- Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A study of DRAM failures in the field
conference, November 2012
- Sridharan, Vilas; Liberty, Dean
- 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Planning a reliable migration in a limited stability virtualized environment
patent, September 2014
- Glikson, Alexander; Israel, Assaf
- US Patent Document 8,826,272
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
Job scheduling on a multiprocessing system based on reliability and performance rankings of processors and weighted effect of detected errors
patent, October 2014
- Shivanna, Suhas; Krishnapuram Ranganathan, Karthik
- US Patent Document 8,875,142
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010
- Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
- 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Automated high resiliency system pool
patent, February 2015
- Sloma, Andrew J.; Triebenbach, Jonathan L.
- US Patent Document 8,959,223
Method of Predicting Availability of a System
patent-application, July 2008
- Narayan, Ranjani; Varadarajan, Keshavan; Natanasabapathy, Gautham
- US Patent Application 11/885475; 20080168314
Monitoring device, radio communication system, failure prediction method and non-temporary computer-readable medium in which a program is stored
patent, February 2018
- Kitahara, Yoshinori
- US Patent Document 9,900,791
Apparatus and Program Storage Device for Providing Triad Copy of Storage Data
patent-application, March 2009
- Benhase, Michael T.; Hartuno, Michael H.; Hsu, Yu-Cheng
- US Patent Application 12/272645; 20090077414
Maintaining Routing Consistency within a Rendezvous Federation
patent-application, January 2008
- Kakivaya, Gopala Krishna R.; Hashna, Richard L.; Xun, Lu
- US Patent Application 11/549332; 20080005624
Method and system for defining an efficient and reliable meshing of CP-CP sessions in an advanced peer to peer network
patent, March 2003
- Giroir, Didier
- US Patent Document 6,535,923
Methods, apparatus and system for selective duplication of subtasks
patent, March 2016
- Andrade Costa, Carlos H.; Cher, Chen-Yong; Park, Yoonho
- US Patent Document 9,298,553
Increasing resilience of a network service
patent, October 2014
- Banerjee, Dipyaman; Madduri, Venkateswara R.; Srivatsa, Mudhakar
- US Patent Document 8,869,035
System and method for dependent failure-aware allocation of distributed data-processing systems
patent, February 2012
- Bansal, Nikhil; Bhagwan, Ranjita; Park, Yoonho
- US Patent Document 8,122,281