A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

Ali, Nawab; Krishnamoorthy, Sriram; Govind, Niranjan; Palmer, B J

Title: A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

Conference · Wed Feb 09 00:00:00 EST 2011

OSTI ID:1010472

Ali, Nawab; Krishnamoorthy, Sriram; Govind, Niranjan; Palmer, B J

Recent trends in high-performance computing point towards increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean-time-between-failures (MTBF) ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. We present results from a computational chemistry application running at scale to show that our techniques provide applications with a high degree of fault tolerance and low (2%--4%) overhead for 2048 processors.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1010472

Report Number(s):: PNNL-SA-75835; KJ0402000; TRN: US201107%%174

Resource Relation:: Conference: Proceedings of the19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2011), February 9-11, 2011, Ayia Napa, Cyprus, 24-31

Country of Publication:: United States

Language:: English

Similar Records

Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples

Journal Article · Tue Jan 11 00:00:00 EST 2011 · Journal of Chemical Theory and Computation, 7(1):66-75 · OSTI ID:1010472

van Dam, Hubertus JJ; Vishnu, Abhinav; De Jong, Wibe A

2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

Technical Report · Sun Feb 01 00:00:00 EST 2009 · OSTI ID:1010472

Katz, D S; Daly, J; DeBardeleben, N; +12 more

Fault-tolerance for exascale systems.

Conference · Sun Aug 01 00:00:00 EDT 2010 · OSTI ID:1010472

Riesen, Rolf E; Varela, Maria Ruiz; Ferreira, Kurt Brian

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
CHEMISTRY
COMMUNICATIONS
PERFORMANCE
PROCESSING
PROGRAMMING
RELIABILITY
STORAGE
TOLERANCE
Fault tolerance
NWChem

Title: A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

Citation Formats

Similar Records

Related Subjects