Local rollback for fault-tolerance in parallel computing systems
Abstract
A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.
- Inventors:
-
- Yorktown Heights, NY
- Boeblingen, DE
- Issue Date:
- Research Org.:
- International Business Machines Corp., Armonk, NY (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1034619
- Patent Number(s):
- 8103910
- Application Number:
- 12/696,780
- Assignee:
- International Business Machines Corporation (Armonk, NY)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- B554331
- Resource Type:
- Patent
- Resource Relation:
- Patent File Date: 2010 Jan 29
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING
Citation Formats
Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, and Sugavanam, Krishnan. Local rollback for fault-tolerance in parallel computing systems. United States: N. p., 2012.
Web.
Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, & Sugavanam, Krishnan. Local rollback for fault-tolerance in parallel computing systems. United States.
Blumrich, Matthias A, Chen, Dong, Gara, Alan, Giampapa, Mark E, Heidelberger, Philip, Ohmacht, Martin, Steinmacher-Burow, Burkhard, and Sugavanam, Krishnan. Tue .
"Local rollback for fault-tolerance in parallel computing systems". United States. https://www.osti.gov/servlets/purl/1034619.
@article{osti_1034619,
title = {Local rollback for fault-tolerance in parallel computing systems},
author = {Blumrich, Matthias A and Chen, Dong and Gara, Alan and Giampapa, Mark E and Heidelberger, Philip and Ohmacht, Martin and Steinmacher-Burow, Burkhard and Sugavanam, Krishnan},
abstractNote = {A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2012},
month = {1}
}
Works referenced in this record:
Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors
journal, September 2005
- Garzarán, María Jesús; Prvulovic, Milos; Llabería, José María
- ACM Transactions on Architecture and Code Optimization, Vol. 2, Issue 3