Error recovery in shared memory multiprocessors using private caches
Journal Article
·
· IEEE Transactions on Parallel and Distributed Systems; (USA)
- Computer Systems Group, Coordinated Science Lab., Univ. of Illinois, Urbana, IL (US)
This paper examines the problem of recovering from processor transient faults in shared memory multiprocessor systems. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. New implementation techniques using checkpoint identifiers and recovery stacks are examined as a means to reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
- OSTI ID:
- 6569614
- Journal Information:
- IEEE Transactions on Parallel and Distributed Systems; (USA), Journal Name: IEEE Transactions on Parallel and Distributed Systems; (USA) Vol. 1:2; ISSN ITDSE; ISSN 1045-9219
- Country of Publication:
- United States
- Language:
- English
Similar Records
Cache-based error recovery for shared memory multiprocessor system
Error recovery in parallel systems of pipelined processors with caches
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Conference
·
Sat Oct 31 23:00:00 EST 1987
·
OSTI ID:5576304
Error recovery in parallel systems of pipelined processors with caches
Conference
·
Fri Dec 30 23:00:00 EST 1994
·
OSTI ID:98916
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Technical Report
·
Fri May 08 00:00:00 EDT 1992
·
OSTI ID:7026260