Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Zhao, Kai; Di, Sheng; Li, Sihuan; Liang, Xin; Zhai, Yujia; Chen, Jieyang; Ouyang, Kaiming; Cappello, Franck; Chen, Zizhong

doi:10.1109/tpds.2020.3043449

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Journal Article · Thu Dec 31 00:00:00 EST 2020 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2020.3043449· OSTI ID:1775093

Zhao, Kai ^[1]; Di, Sheng ^[2]; Li, Sihuan ^[1]; Liang, Xin ^[3]; Zhai, Yujia ^[1]; Chen, Jieyang ^[3]; Ouyang, Kaiming ^[1]; Cappello, Franck ^[2]; Chen, Zizhong ^[1]

Univ. of California, Riverside, CA (United States)
Argonne National Lab. (ANL), lemont, IL (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Here, experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).

View Accepted Manuscript (DOE)

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States). Laboratory Computing Resource Center (LCRC)

Sponsoring Organization:: National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 1775093

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 7 Vol. 32; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (45)

ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators Zhang, Jeff; Rangineni, Kartheek; Ghodsi, Zahra 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) https://doi.org/10.1109/dac.2018.8465918	conference	June 2018
Bit-Flip Attack: Crushing Neural Network With Progressive Bit Search Rakin, Adnan Siraj; He, Zhezhi; Fan, Deliang 2019 IEEE/CVF International Conference on Computer Vision (ICCV) https://doi.org/10.1109/iccv.2019.00130	conference	October 2019
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/tc.1984.1676475	journal	June 1984
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods Chen, Zizhong Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442533	conference	January 2013
Silent Data Corruption Resilient Two-sided Matrix Factorizations Wu, Panruo; DeBardeleben, Nathan; Guan, Qiang PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3018743.3018750	conference	January 2017
MaxNVM Pentecost, Lillian; Donato, Marco; Reagen, Brandon Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture https://doi.org/10.1145/3352460.3358258	conference	October 2019
An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks Schorn, Christoph; Guntoro, Andre; Ascheid, Gerd 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE) https://doi.org/10.23919/date.2019.8714885	conference	March 2019
cuDNN: Efficient Primitives for Deep Learning Chetlur, Sharan; Woolley, Cliff; Vandermersch, Philippe arXiv https://doi.org/10.48550/arxiv.1410.0759	preprint	January 2014
An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software Salay, Rick; Queiroz, Rodrigo; Czarnecki, Krzysztof arXiv https://doi.org/10.48550/arxiv.1709.02435	preprint	January 2017
Deep Learning in Drug Discovery Gawehn, Erik; Hiss, Jan A.; Schneider, Gisbert Molecular Informatics, Vol. 35, Issue 1 https://doi.org/10.1002/minf.201501008	journal	December 2015
Artificial convolution neural network for medical image pattern recognition Lo, Shih-Chung B.; Chan, Heang-Ping; Lin, Jyh-Shyan Neural Networks, Vol. 8, Issue 7-8 https://doi.org/10.1016/0893-6080(95)00061-5	journal	January 1995
A versatile method of discrete convolution and FFT (DC-FFT) for contact analyses Liu, Shuangbiao; Wang, Qian; Liu, Geng Wear, Vol. 243, Issue 1-2 https://doi.org/10.1016/S0043-1648(00)00427-0	journal	August 2000
A survey of power and energy efficient techniques for high performance numerical linear algebra operations Tan, Li; Kothapalli, Shashank; Chen, Longxiang Parallel Computing, Vol. 40, Issue 10 https://doi.org/10.1016/j.parco.2014.09.001	journal	December 2014
ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
Fast Algorithms for Convolutional Neural Networks Lavin, Andrew; Gray, Scott 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.435	conference	June 2016
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
YOLO9000: Better, Faster, Stronger Redmon, Joseph; Farhadi, Ali 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2017.690	conference	July 2017
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators Zhang, Jeff; Rangineni, Kartheek; Ghodsi, Zahra 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) https://doi.org/10.1109/DAC.2018.8465918	conference	June 2018
A practical characterization of a NASA SpaceCube application through fault emulation and laser testing Walters, John Paul; Zick, Kenneth M.; French, Matthew 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2013.6575354	conference	June 2013
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing Tan, Li; Song, Shuaiwen Leon; Wu, Panruo 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.108	conference	May 2015
Uncertainty Estimation for Deep Neural Object Detectors in Safety-Critical Applications Le, Michael Truong; Diehl, Frederik; Brunner, Thomas 2018 21st International Conference on Intelligent Transportation Systems (ITSC) https://doi.org/10.1109/ITSC.2018.8569637	conference	November 2018
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Chen, Yu-Hsin; Krishna, Tushar; Emer, Joel S. IEEE Journal of Solid-State Circuits, Vol. 52, Issue 1 https://doi.org/10.1109/JSSC.2016.2616357	journal	January 2017
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer Bautista-Gomez, Leonardo; Zyulkyarov, Ferad; Unsal, Osman SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.54	conference	November 2016
Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs Chen, Jieyang; Li, Hongbo; Li, Sihuan SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00071	conference	November 2018
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
Selective replication Vera, Xavier; Abella, Jaume; Carretero, Javier ACM Transactions on Computer Systems, Vol. 27, Issue 4 https://doi.org/10.1145/1658357.1658359	journal	December 2009
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach Li, Dong; Chen, Zizhong; Wu, Panruo Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226	conference	January 2013
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning Chen, Tianshi; Du, Zidong; Sun, Ninghui Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14 https://doi.org/10.1145/2541940.2541967	conference	January 2014
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907306	conference	January 2016
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907315	conference	May 2016
Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory Wu, Panruo; Li, Dong; Chen, Zizhong Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907321	conference	May 2016
Correcting soft errors online in fast fourier transform Liang, Xin; Chen, Zizhong; Chen, Jieyang Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126915	conference	January 2017
Understanding error propagation in deep learning neural network (DNN) accelerators and applications Li, Guanpeng; Hari, Siva Kumar Sastry; Sullivan, Michael Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126964	conference	November 2017
Ares Reagen, Brandon; Gupta, Udit; Pentecost, Lillian Proceedings of the 55th Annual Design Automation Conference https://doi.org/10.1145/3195970.3195997	conference	June 2018
Improving performance of iterative methods by lossy checkponting Tao, Dingwen; Di, Sheng; Liang, Xin Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18 https://doi.org/10.1145/3208040.3208050	conference	January 2018
FT-iSort Li, Sihuan; Li, Hongbo; Liang, Xin Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356195	conference	November 2019
DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression Jin, Sian; Di, Sheng; Liang, Xin HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3326608	conference	June 2019
Dris-3 Kim, Jae-San; Yang, Joon-Sung Proceedings of the 56th Annual Design Automation Conference 2019 https://doi.org/10.1145/3316781.3317805	conference	June 2019
Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators Choi, Wonseok; Shin, Dongyeob; Park, Jongsun Proceedings of the 56th Annual Design Automation Conference 2019 https://doi.org/10.1145/3316781.3317908	conference	June 2019
Building Robust Machine Learning Systems Zhang, Jeff Jun; Liu, Kang; Khalid, Faiq Proceedings of the 56th Annual Design Automation Conference 2019 https://doi.org/10.1145/3316781.3323472	conference	June 2019
Tsm2 Chen, Jieyang; Xiong, Nan; Liang, Xin Proceedings of the ACM International Conference on Supercomputing https://doi.org/10.1145/3330345.3330355	conference	June 2019
Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity Hu, Zhenbo; Zou, Xiangyu; Xia, Wen 49th International Conference on Parallel Processing - ICPP https://doi.org/10.1145/3404397.3404408	conference	August 2020
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research Wozniak, Justin M.; Jain, Rajeev; Balaprakash, Prasanna BMC Bioinformatics, Vol. 19, Issue S18 https://doi.org/10.1186/s12859-018-2508-4	journal	December 2018
Neural Network Methods for Natural Language Processing Goldberg, Yoav Synthesis Lectures on Human Language Technologies, Vol. 10, Issue 1 https://doi.org/10.2200/S00762ED1V01Y201703HLT037	journal	April 2017

Similar Records

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Conference · Tue May 31 00:00:00 EDT 2016 · OSTI ID:1322529

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Conference · Mon Dec 31 23:00:00 EST 2012 · 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) · OSTI ID:1567343

In-Place Zero-Space Memory Protection for CNN

Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606858

Related Subjects

97 MATHEMATICS AND COMPUTING
Algorithm-Based Fault Tolerance
Deep Learning
High-Performance Computing
Reliability
Silent Data Corruption

Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Citation Formats

References (45)

Similar Records

Related Subjects