Cache-based error recovery for shared memory multiprocessor system
Conference
·
OSTI ID:5576304
The problem of recovering from processor failures in shared memory multiprocessor systems is examined. A cache-based checkpointing scheme is developed utilizing a checkpointing algorithm which guarantees that a consistent global state is always maintained. Processes can recover from errors due to a faulty processor by restarting from the consistent saved computation state. There are no difficulties with checkpoint propagation in that when a process p takes a checkpoint, no other process is forced to join p in the checkpoint. The recovery algorithm allows only those processes encountering errors to perform rollback recovery while other unaffected processes on fault free processors continue normal execution. The checkpointing recovery schemes are shown to be easily integrated into standard bus-based cache coherence protocols. An analytical model is used to estimate the checkpointing frequency and the performance degradation incurred by the checkpointing scheme during normal execution.
- Research Organization:
- Illinois Univ., Urbana (USA)
- OSTI ID:
- 5576304
- Report Number(s):
- N-88-11398; NASA-CR-181470; NAS-1.26:181470; CONF-8706255-
- Country of Publication:
- United States
- Language:
- English
Similar Records
Error recovery in shared memory multiprocessors using private caches
Error recovery in parallel systems of pipelined processors with caches
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Journal Article
·
Sat Mar 31 23:00:00 EST 1990
· IEEE Transactions on Parallel and Distributed Systems; (USA)
·
OSTI ID:6569614
Error recovery in parallel systems of pipelined processors with caches
Conference
·
Fri Dec 30 23:00:00 EST 1994
·
OSTI ID:98916
Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
Technical Report
·
Fri May 08 00:00:00 EDT 1992
·
OSTI ID:7026260