skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples

Journal Article · · Journal of Chemical Theory and Computation, 7(1):66-75
DOI:https://doi.org/10.1021/ct100439u· OSTI ID:1002163

In the last couple of decades, the massive computational power provided by the most modern supercomputers has resulted in simulation of higher order computational chem- istry methods, previously considered intractable. As the system sizes continue to increase, computational chemistry domain continues to escalate this trend using parallel comput- ing with programming models such as Message Passing Interface (MPI) and Partitioned Global Address Space (PGAS) programming models such as Global Arrays. The ever increasing scale of these supercomputers comes at a cost of reduced mean time between failures, currently in the order of days, and projected to be in the order of hours for up- coming extreme scale systems. While traditional disk based checkpointing methods are ubiquitous for storing intermediate solutions, they suffer from high overhead of writing and recovering from checkpoints. In practice, checkpointing itself often brings the system down. Clearly, methods beyond checkpointing are imperative to handling the aggravating issue of reducing MTBF. In this paper, we address this challenge by designing and im- plementing an efficient fault tolerant version of coupled cluster method with NWChem, using in memory data redundancy. We present the challenges associated with our de- sign including efficient data storage model, maintenance of at least one consistent data copy and the recovery process. Our performance evaluation without faults shows that the current design exhibits negligible overhead. In the presence of a fault, the proposed design incurs negligible overhead in comparison to the state of the art implementation without faults.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1002163
Report Number(s):
PNNL-SA-74406; KP1704020; TRN: US201102%%615
Journal Information:
Journal of Chemical Theory and Computation, 7(1):66-75, Vol. 7, Issue 1
Country of Publication:
United States
Language:
English