HPC application fault-tolerance using transparent redundant computation.
Conference
·
OSTI ID:971418
As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.
- Research Organization:
- Sandia National Laboratories
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 971418
- Report Number(s):
- SAND2009-5267C
- Country of Publication:
- United States
- Language:
- English
Similar Records
rMPI
rMPI : increasing fault resiliency in a message-passing environment.
Fault-tolerance for exascale systems.
Software
·
Tue Aug 24 00:00:00 EDT 2010
·
OSTI ID:1231486
rMPI : increasing fault resiliency in a message-passing environment.
Technical Report
·
Fri Apr 01 00:00:00 EDT 2011
·
OSTI ID:1012733
Fault-tolerance for exascale systems.
Conference
·
Sun Aug 01 00:00:00 EDT 2010
·
OSTI ID:1028416