skip to main content

SciTech ConnectSciTech Connect

Title: New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme.
Authors:
; ; ; ; ; ; ;
Publication Date:
OSTI Identifier:
1322529
Report Number(s):
PNNL-SA-117061
KJ0403000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 25th ACM international Symposium on High-Performance and Distributed Computing (HPDC 2016), May 31-June 4, 2016, Kyoto, Japan, 43-55
Publisher:
ACM, NEW YORK, New York
Research Org:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
SMT; Memory Hierarchy; Instantaneous Footprint; SMTAware Optimization; Locality; Performance Tools