Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Error recovery in shared memory multiprocessors using private caches

Journal Article · · IEEE Transactions on Parallel and Distributed Systems; (USA)
DOI:https://doi.org/10.1109/71.80134· OSTI ID:6569614
; ;  [1]
  1. Computer Systems Group, Coordinated Science Lab., Univ. of Illinois, Urbana, IL (US)
This paper examines the problem of recovering from processor transient faults in shared memory multiprocessor systems. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. New implementation techniques using checkpoint identifiers and recovery stacks are examined as a means to reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
OSTI ID:
6569614
Journal Information:
IEEE Transactions on Parallel and Distributed Systems; (USA), Journal Name: IEEE Transactions on Parallel and Distributed Systems; (USA) Vol. 1:2; ISSN ITDSE; ISSN 1045-9219
Country of Publication:
United States
Language:
English