A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models
Recent trends in high-performance computing point towards increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean-time-between-failures (MTBF) ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. We present results from a computational chemistry application running at scale to show that our techniques provide applications with a high degree of fault tolerance and low (2%--4%) overhead for 2048 processors.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1010472
- Report Number(s):
- PNNL-SA-75835; KJ0402000; TRN: US201107%%174
- Resource Relation:
- Conference: Proceedings of the19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2011), February 9-11, 2011, Ayia Napa, Cyprus, 24-31
- Country of Publication:
- United States
- Language:
- English
Similar Records
2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.
Fault-tolerance for exascale systems.