Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems

Technical Report ·
OSTI ID:7026260
This paper describes a checkpoint comparison and optimistic execution technique for error detection and recovery in distributed and parallel systems. The approach is based on lookahead execution and rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. Two schemes derived from this strategy are analyzed and compared with triplication and voting, and with two common backward recovery methods. The impact of checkpoint time, checkpoint validation time. and process restart time is also examined. An implementation on a Sun NFS network with six benchmark programs is presented. Compared with classic checkpointing and rollback techniques, our strategy provides rapid recovery and requires, on average, fewer processors than standard replication and voting methods. This strategy is useful in systems where spare processors are available at the time of recovery. fault tolerant computing, checkpointing, error detection, and error recovery. error recovery.
Research Organization:
Illinois Univ., Urbana, IL (United States). Coordinated Science Lab.
OSTI ID:
7026260
Report Number(s):
AD-A-251925/4/XAB; CNN: N00014-91-J-1283
Country of Publication:
United States
Language:
English

Similar Records

Error recovery in shared memory multiprocessors using private caches
Journal Article · Sat Mar 31 23:00:00 EST 1990 · IEEE Transactions on Parallel and Distributed Systems; (USA) · OSTI ID:6569614

Cache-based error recovery for shared memory multiprocessor system
Conference · Sat Oct 31 23:00:00 EST 1987 · OSTI ID:5576304

Coping with silent and fail-stop errors at scale by combining replication and checkpointing
Journal Article · Fri Nov 30 23:00:00 EST 2018 · Journal of Parallel and Distributed Computing · OSTI ID:1475194