Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples
In the last couple of decades, the massive computational power provided by the most modern supercomputers has resulted in simulation of higher order computational chem- istry methods, previously considered intractable. As the system sizes continue to increase, computational chemistry domain continues to escalate this trend using parallel comput- ing with programming models such as Message Passing Interface (MPI) and Partitioned Global Address Space (PGAS) programming models such as Global Arrays. The ever increasing scale of these supercomputers comes at a cost of reduced mean time between failures, currently in the order of days, and projected to be in the order of hours for up- coming extreme scale systems. While traditional disk based checkpointing methods are ubiquitous for storing intermediate solutions, they suffer from high overhead of writing and recovering from checkpoints. In practice, checkpointing itself often brings the system down. Clearly, methods beyond checkpointing are imperative to handling the aggravating issue of reducing MTBF. In this paper, we address this challenge by designing and im- plementing an efficient fault tolerant version of coupled cluster method with NWChem, using in memory data redundancy. We present the challenges associated with our de- sign including efficient data storage model, maintenance of at least one consistent data copy and the recovery process. Our performance evaluation without faults shows that the current design exhibits negligible overhead. In the presence of a fault, the proposed design incurs negligible overhead in comparison to the state of the art implementation without faults.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1002163
- Report Number(s):
- PNNL-SA-74406; KP1704020; TRN: US201102%%615
- Journal Information:
- Journal of Chemical Theory and Computation, 7(1):66-75, Vol. 7, Issue 1
- Country of Publication:
- United States
- Language:
- English
Similar Records
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
Scalable Heterogeneous Execution of a Coupled-Cluster Model with Perturbative Triples