skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Conference ·

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1322529
Report Number(s):
PNNL-SA-117061; KJ0403000
Resource Relation:
Conference: Proceedings of the 25th ACM international Symposium on High-Performance and Distributed Computing (HPDC 2016), May 31-June 4, 2016, Kyoto, Japan, 43-55
Country of Publication:
United States
Language:
English

Similar Records

Resiliency in numerical algorithm design for extreme scale simulations
Journal Article · Fri Dec 10 00:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1322529

Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Journal Article · Thu Dec 31 00:00:00 EST 2020 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1322529

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms
Conference · Sat Dec 01 00:00:00 EST 2018 · OSTI ID:1322529