Checkpoint triggering in a computer system
Abstract
According to an aspect, a method for triggering creation of a checkpoint in a computer system includes executing a task in a processing node and determining whether it is time to read a monitor associated with a metric of the task. The monitor is read to determine a value of the metric based on determining that it is time to read the monitor. A threshold for triggering creation of the checkpoint is determined based on the metric. A monitoring block size is determined for the checkpoint. A checkpoint interval is determined based on the monitoring block size, a checkpoint bandwidth, and a failure rate of the computer system. Based on determining that the value of the metric has crossed the threshold and the checkpoint interval has elapsed, the checkpoint including state data of the task is created to enable restarting execution of the task upon a restart operation.
- Inventors:
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1489807
- Patent Number(s):
- 10089181
- Application Number:
- 15/194,884
- Assignee:
- INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- B599858
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 2016 Jun 28
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Cher, Chen-Yong. Checkpoint triggering in a computer system. United States: N. p., 2018.
Web.
Cher, Chen-Yong. Checkpoint triggering in a computer system. United States.
Cher, Chen-Yong. Tue .
"Checkpoint triggering in a computer system". United States. https://www.osti.gov/servlets/purl/1489807.
@article{osti_1489807,
title = {Checkpoint triggering in a computer system},
author = {Cher, Chen-Yong},
abstractNote = {According to an aspect, a method for triggering creation of a checkpoint in a computer system includes executing a task in a processing node and determining whether it is time to read a monitor associated with a metric of the task. The monitor is read to determine a value of the metric based on determining that it is time to read the monitor. A threshold for triggering creation of the checkpoint is determined based on the metric. A monitoring block size is determined for the checkpoint. A checkpoint interval is determined based on the monitoring block size, a checkpoint bandwidth, and a failure rate of the computer system. Based on determining that the value of the metric has crossed the threshold and the checkpoint interval has elapsed, the checkpoint including state data of the task is created to enable restarting execution of the task upon a restart operation.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {10}
}