skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems

Technical Report ·
OSTI ID:7026260

This paper describes a checkpoint comparison and optimistic execution technique for error detection and recovery in distributed and parallel systems. The approach is based on lookahead execution and rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. Two schemes derived from this strategy are analyzed and compared with triplication and voting, and with two common backward recovery methods. The impact of checkpoint time, checkpoint validation time. and process restart time is also examined. An implementation on a Sun NFS network with six benchmark programs is presented. Compared with classic checkpointing and rollback techniques, our strategy provides rapid recovery and requires, on average, fewer processors than standard replication and voting methods. This strategy is useful in systems where spare processors are available at the time of recovery. fault tolerant computing, checkpointing, error detection, and error recovery. error recovery.

Research Organization:
Illinois Univ., Urbana, IL (United States). Coordinated Science Lab.
OSTI ID:
7026260
Report Number(s):
AD-A-251925/4/XAB; CNN: N00014-91-J-1283
Country of Publication:
United States
Language:
English