DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Local rollback for fault-tolerance in parallel computing systems

Abstract

A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.

Inventors:
 [1];  [1];  [1];  [1];  [1];  [1];  [2];  [1]
  1. Yorktown Heights, NY
  2. Boeblingen, DE
Issue Date:
Research Org.:
International Business Machines Corp., Armonk, NY (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1034619
Patent Number(s):
8103910
Application Number:
12/696,780
Assignee:
International Business Machines Corporation (Armonk, NY)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
B554331
Resource Type:
Patent
Resource Relation:
Patent File Date: 2010 Jan 29
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, and Sugavanam, Krishnan. Local rollback for fault-tolerance in parallel computing systems. United States: N. p., 2012. Web.
Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, & Sugavanam, Krishnan. Local rollback for fault-tolerance in parallel computing systems. United States.
Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, and Sugavanam, Krishnan. Tue . "Local rollback for fault-tolerance in parallel computing systems". United States. https://www.osti.gov/servlets/purl/1034619.
@article{osti_1034619,
title = {Local rollback for fault-tolerance in parallel computing systems},
author = {Blumrich, Matthias A and Chen, Dong and Gara, Alan and Giampapa, Mark E and Heidelberger, Philip and Ohmacht, Martin and Steinmacher-Burow, Burkhard and Sugavanam, Krishnan},
abstractNote = {A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 24 00:00:00 EST 2012},
month = {Tue Jan 24 00:00:00 EST 2012}
}

Works referenced in this record:

Overview of the IBM Blue Gene/P project
journal, January 2008


Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors
journal, September 2005