Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Journal Article
·
· IEEE Transactions on Parallel and Distributed Systems
- Univ. of California, Riverside, CA (United States)
- Argonne National Lab. (ANL), lemont, IL (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Here, experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States). Laboratory Computing Resource Center (LCRC)
- Sponsoring Organization:
- National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1775093
- Journal Information:
- IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 7 Vol. 32; ISSN 1045-9219
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators
|
conference | June 2018 |
Bit-Flip Attack: Crushing Neural Network With Progressive Bit Search
|
conference | October 2019 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal | June 1984 |
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
|
conference | January 2013 |
Silent Data Corruption Resilient Two-sided Matrix Factorizations
|
conference | January 2017 |
MaxNVM
|
conference | October 2019 |
An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks
|
conference | March 2019 |
| cuDNN: Efficient Primitives for Deep Learning | preprint | January 2014 |
| An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software | preprint | January 2017 |
Deep Learning in Drug Discovery
|
journal | December 2015 |
Artificial convolution neural network for medical image pattern recognition
|
journal | January 1995 |
A versatile method of discrete convolution and FFT (DC-FFT) for contact analyses
|
journal | August 2000 |
A survey of power and energy efficient techniques for high performance numerical linear algebra operations
|
journal | December 2014 |
ImageNet: A large-scale hierarchical image database
|
conference | June 2009 |
Fast Algorithms for Convolutional Neural Networks
|
conference | June 2016 |
Deep Residual Learning for Image Recognition
|
conference | June 2016 |
YOLO9000: Better, Faster, Stronger
|
conference | July 2017 |
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators
|
conference | June 2018 |
A practical characterization of a NASA SpaceCube application through fault emulation and laser testing
|
conference | June 2013 |
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing
|
conference | May 2015 |
Uncertainty Estimation for Deep Neural Object Detectors in Safety-Critical Applications
|
conference | November 2018 |
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks
|
journal | January 2017 |
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer
|
conference | November 2016 |
Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs
|
conference | November 2018 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal | June 1984 |
Selective replication
|
journal | December 2009 |
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
|
conference | January 2013 |
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
|
conference | January 2014 |
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
|
conference | January 2016 |
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
|
conference | May 2016 |
Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory
|
conference | May 2016 |
Correcting soft errors online in fast fourier transform
|
conference | January 2017 |
Understanding error propagation in deep learning neural network (DNN) accelerators and applications
|
conference | November 2017 |
Ares
|
conference | June 2018 |
Improving performance of iterative methods by lossy checkponting
|
conference | January 2018 |
FT-iSort
|
conference | November 2019 |
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
|
conference | June 2019 |
Dris-3
|
conference | June 2019 |
Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators
|
conference | June 2019 |
Building Robust Machine Learning Systems
|
conference | June 2019 |
Tsm2
|
conference | June 2019 |
Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity
|
conference | August 2020 |
Addressing failures in exascale computing
|
journal | March 2014 |
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research
|
journal | December 2018 |
Neural Network Methods for Natural Language Processing
|
journal | April 2017 |
Similar Records
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Parallel reduction to hessenberg form with algorithm-based fault tolerance
In-Place Zero-Space Memory Protection for CNN
Conference
·
Tue May 31 00:00:00 EDT 2016
·
OSTI ID:1322529
Parallel reduction to hessenberg form with algorithm-based fault tolerance
Conference
·
Mon Dec 31 23:00:00 EST 2012
· 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC)
·
OSTI ID:1567343
In-Place Zero-Space Memory Protection for CNN
Conference
·
Sat Nov 30 23:00:00 EST 2019
·
OSTI ID:1606858