Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems
This paper describes a checkpoint comparison and optimistic execution technique for error detection and recovery in distributed and parallel systems. The approach is based on lookahead execution and rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. Two schemes derived from this strategy are analyzed and compared with triplication and voting, and with two common backward recovery methods. The impact of checkpoint time, checkpoint validation time. and process restart time is also examined. An implementation on a Sun NFS network with six benchmark programs is presented. Compared with classic checkpointing and rollback techniques, our strategy provides rapid recovery and requires, on average, fewer processors than standard replication and voting methods. This strategy is useful in systems where spare processors are available at the time of recovery. fault tolerant computing, checkpointing, error detection, and error recovery. error recovery.
- Research Organization:
- Illinois Univ., Urbana, IL (United States). Coordinated Science Lab.
- OSTI ID:
- 7026260
- Report Number(s):
- AD-A-251925/4/XAB; CNN: N00014-91-J-1283
- Country of Publication:
- United States
- Language:
- English
Similar Records
Error recovery in shared memory multiprocessors using private caches
Cache-based error recovery for shared memory multiprocessor system
Related Subjects
DISTRIBUTED DATA PROCESSING
ERRORS
FAULT TOLERANT COMPUTERS
PARALLEL PROCESSING
COMPARATIVE EVALUATIONS
COMPUTER NETWORKS
DETECTION
OPTIMIZATION
STANDARDS
VALIDATION
COMPUTERS
DATA PROCESSING
DIGITAL COMPUTERS
EVALUATION
PROCESSING
PROGRAMMING
TESTING
990200* - Mathematics & Computers