New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1322529
- Report Number(s):
- PNNL-SA-117061; KJ0403000
- Country of Publication:
- United States
- Language:
- English
Similar Records
Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms
Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Checksumming strategies for data in volatile memories
Conference
·
Fri Nov 30 23:00:00 EST 2018
·
OSTI ID:1493144
Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Journal Article
·
Wed Dec 30 19:00:00 EST 2020
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1775093
Checksumming strategies for data in volatile memories
Conference
·
Tue Sep 09 00:00:00 EDT 2014
·
OSTI ID:1236931