Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [2];  [1];  [3];  [1];  [3];  [1];  [2];  [1]
  1. Univ. of California, Riverside, CA (United States)
  2. Argonne National Lab. (ANL), lemont, IL (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Here, experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States). Laboratory Computing Resource Center (LCRC)
Sponsoring Organization:
National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
1775093
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 7 Vol. 32; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (45)

ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators conference June 2018
Bit-Flip Attack: Crushing Neural Network With Progressive Bit Search conference October 2019
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods conference January 2013
Silent Data Corruption Resilient Two-sided Matrix Factorizations
  • Wu, Panruo; DeBardeleben, Nathan; Guan, Qiang
  • PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3018743.3018750
conference January 2017
MaxNVM conference October 2019
An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks conference March 2019
cuDNN: Efficient Primitives for Deep Learning preprint January 2014
An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software preprint January 2017
Deep Learning in Drug Discovery journal December 2015
Artificial convolution neural network for medical image pattern recognition journal January 1995
A versatile method of discrete convolution and FFT (DC-FFT) for contact analyses journal August 2000
A survey of power and energy efficient techniques for high performance numerical linear algebra operations journal December 2014
ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
Fast Algorithms for Convolutional Neural Networks conference June 2016
Deep Residual Learning for Image Recognition conference June 2016
YOLO9000: Better, Faster, Stronger conference July 2017
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators conference June 2018
A practical characterization of a NASA SpaceCube application through fault emulation and laser testing conference June 2013
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing conference May 2015
Uncertainty Estimation for Deep Neural Object Detectors in Safety-Critical Applications conference November 2018
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks journal January 2017
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer
  • Bautista-Gomez, Leonardo; Zyulkyarov, Ferad; Unsal, Osman
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.54
conference November 2016
Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs conference November 2018
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Selective replication journal December 2009
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
  • Chen, Tianshi; Du, Zidong; Sun, Ninghui
  • Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14 https://doi.org/10.1145/2541940.2541967
conference January 2014
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
  • Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram
  • Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907306
conference January 2016
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
  • Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907315
conference May 2016
Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory conference May 2016
Correcting soft errors online in fast fourier transform
  • Liang, Xin; Chen, Zizhong; Chen, Jieyang
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126915
conference January 2017
Understanding error propagation in deep learning neural network (DNN) accelerators and applications
  • Li, Guanpeng; Hari, Siva Kumar Sastry; Sullivan, Michael
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126964
conference November 2017
Ares conference June 2018
Improving performance of iterative methods by lossy checkponting conference January 2018
FT-iSort conference November 2019
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression
  • Jin, Sian; Di, Sheng; Liang, Xin
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3326608
conference June 2019
Dris-3 conference June 2019
Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators conference June 2019
Building Robust Machine Learning Systems conference June 2019
Tsm2 conference June 2019
Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity conference August 2020
Addressing failures in exascale computing journal March 2014
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research journal December 2018
Neural Network Methods for Natural Language Processing journal April 2017

Similar Records

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Conference · Tue May 31 00:00:00 EDT 2016 · OSTI ID:1322529

Parallel reduction to hessenberg form with algorithm-based fault tolerance
Conference · Mon Dec 31 23:00:00 EST 2012 · 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) · OSTI ID:1567343

In-Place Zero-Space Memory Protection for CNN
Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606858