skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Cache-based error recovery for shared memory multiprocessor system

Conference ·
OSTI ID:5576304

The problem of recovering from processor failures in shared memory multiprocessor systems is examined. A cache-based checkpointing scheme is developed utilizing a checkpointing algorithm which guarantees that a consistent global state is always maintained. Processes can recover from errors due to a faulty processor by restarting from the consistent saved computation state. There are no difficulties with checkpoint propagation in that when a process p takes a checkpoint, no other process is forced to join p in the checkpoint. The recovery algorithm allows only those processes encountering errors to perform rollback recovery while other unaffected processes on fault free processors continue normal execution. The checkpointing recovery schemes are shown to be easily integrated into standard bus-based cache coherence protocols. An analytical model is used to estimate the checkpointing frequency and the performance degradation incurred by the checkpointing scheme during normal execution.

Research Organization:
Illinois Univ., Urbana (USA)
OSTI ID:
5576304
Report Number(s):
N-88-11398; NASA-CR-181470; NAS-1.26:181470; CONF-8706255-
Resource Relation:
Conference: FTCS 18 conference, Tokyo, Japan, 27 Jun 1987
Country of Publication:
United States
Language:
English