skip to main content

SciTech ConnectSciTech Connect

Title: A Case for Soft Error Detection and Correction in Computational Chemistry

High performance computing platforms are expected to deliver 10(18) floating operations per second by the year 2022 through the deployment of millions of cores. Even if every core is highly reliable the sheer number of the them will mean that the mean time between failures will become so short that most applications runs will suffer at least one fault. In particular soft errors caused by intermittent incorrect behavior of the hardware are a concern as they lead to silent data corruption. In this paper we investigate the impact of soft errors on optimization algorithms using Hartree-Fock as a particular example. Optimization algorithms iteratively reduce the error in the initial guess to reach the intended solution. Therefore they may intuitively appear to be resilient to soft errors. Our results show that this is true for soft errors of small magnitudes but not for large errors. We suggest error detection and correction mechanisms for different classes of data structures. The results obtained with these mechanisms indicate that we can correct more than 95% of the soft errors at moderate increases in the computational cost.
Authors:
; ;
Publication Date:
OSTI Identifier:
1094941
Report Number(s):
PNNL-SA-96159
DOE Contract Number:
AC05-76RL01830
Resource Type:
Journal Article
Resource Relation:
Journal Name: Journal of Chemical Theory and Computation, 9(9):3995-4005
Research Org:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
Computational chemistry; high performance computing; Hartree-Fock