DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Checkpointing for a hybrid computing node

Abstract

According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.

Inventors:
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1241311
Patent Number(s):
9280383
Application Number:
14/302,921
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
B599858
Resource Type:
Patent
Resource Relation:
Patent File Date: 2014 Jun 12
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Cher, Chen-Yong. Checkpointing for a hybrid computing node. United States: N. p., 2016. Web.
Cher, Chen-Yong. Checkpointing for a hybrid computing node. United States.
Cher, Chen-Yong. Tue . "Checkpointing for a hybrid computing node". United States. https://www.osti.gov/servlets/purl/1241311.
@article{osti_1241311,
title = {Checkpointing for a hybrid computing node},
author = {Cher, Chen-Yong},
abstractNote = {According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {3}
}

Works referenced in this record:

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
journal, July 2011


Checkpointing in hybrid distributed systems
conference, January 2004


Adaptive incremental checkpointing for massively parallel systems
conference, January 2004


Checkpointing strategies for parallel jobs
conference, January 2011

  • Bougeret, Marin; Casanova, Henri; Rabie, Mikael
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063428

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2010.18

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
conference, January 2010

  • Jones, William M.; Daly, John T.; DeBardeleben, Nathan
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10
  • https://doi.org/10.1145/1851476.1851509

Low-overhead diskless checkpoint for hybrid computing systems
conference, December 2010


MCREngine: A scalable checkpointing system using data-aware aggregation and compression
conference, November 2012

  • Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2012.77

Trace profiling: Scalable event tracing on high-end parallel systems
journal, April 2012


Apparatus, system, and method for caching data
patent, July 2013