skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: HPC application fault-tolerance using transparent redundant computation.

Conference ·
OSTI ID:971418

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

Research Organization:
Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
971418
Report Number(s):
SAND2009-5267C; TRN: US201004%%5
Resource Relation:
Conference: Proposed for presentation at the International Conference for High Performance Computing, Networking, Storage, and Analysis held November 14-20, 2009 in Portland, OR.
Country of Publication:
United States
Language:
English