Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming

Alawad, Mohammed; Lin, Mingjie

doi:10.1109/TMSCS.2018.2886266

Title: Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming

Full Record
Other Related Research

Abstract

Here, FPGA-based heterogeneous computing platform, due to its extreme logic reconfigurability, emerges to be a strong contender as computing fabric in modern AI. As a result, various FPGA-based accelerators for deep CNN—the key driver of modern AI—have been proposed due to their advantages of high performance, reconfigurability, and fast development round, etc. In general, the consensus among researchers is that, although FPGA-based accelerator can achieve much higher energy efficiency, its raw computing performance lags behind when compared with GPUs with similar logic density. In this paper, we develop an alternative methodology to efficiently implement CNNs with FPGAs that outperform GPUs in terms of both power consumption and performance. Our key idea is to design a scalable hardware architecture and circuit design for large-scale CNNs that leverages a stochastic-based computing principle. Specifically, there are three major performance advantages. First, all key components of our deep learning CNN are designed and implemented to compute stochastically, thus achieving excellent computing performance and energy efficiency. Second, because our proposed CNN architecture enables a stream-mode computing, all of its stages can process even the partial results from preceding stages, therefore not incurring unnecessary latency due to data dependency. Finally, our FPGA-based deep CNN also providesmore »« less

Authors:

^[1];

^[2]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Central Florida, Orlando, FL (United States)

Publication Date:: Wed Dec 12 00:00:00 EST 2018

Research Org.:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 1493138

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: IEEE Transactions on Multi-Scale Computing Systems

Additional Journal Information:: Journal Volume: 4; Journal Issue: 4; Journal ID: ISSN 2372-207X

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; convolutional neural network; FPGA; stochastic computing

Citation Formats


                    Alawad, Mohammed, and Lin, Mingjie. Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming.  United States: N. p., 2018. 
Web.  doi:10.1109/TMSCS.2018.2886266.

Copy to clipboard


                    Alawad, Mohammed, & Lin, Mingjie. Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming.  United States.  https://doi.org/10.1109/TMSCS.2018.2886266

Copy to clipboard


                    Alawad, Mohammed, and Lin, Mingjie. Wed .  
"Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming".  United States.  https://doi.org/10.1109/TMSCS.2018.2886266.  https://www.osti.gov/servlets/purl/1493138.

Copy to clipboard


                    
@article{osti_1493138,

  title        = {Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming},

  author       = {Alawad, Mohammed and Lin, Mingjie},

  abstractNote = {Here, FPGA-based heterogeneous computing platform, due to its extreme logic reconfigurability, emerges to be a strong contender as computing fabric in modern AI. As a result, various FPGA-based accelerators for deep CNN—the key driver of modern AI—have been proposed due to their advantages of high performance, reconfigurability, and fast development round, etc. In general, the consensus among researchers is that, although FPGA-based accelerator can achieve much higher energy efficiency, its raw computing performance lags behind when compared with GPUs with similar logic density. In this paper, we develop an alternative methodology to efficiently implement CNNs with FPGAs that outperform GPUs in terms of both power consumption and performance. Our key idea is to design a scalable hardware architecture and circuit design for large-scale CNNs that leverages a stochastic-based computing principle. Specifically, there are three major performance advantages. First, all key components of our deep learning CNN are designed and implemented to compute stochastically, thus achieving excellent computing performance and energy efficiency. Second, because our proposed CNN architecture enables a stream-mode computing, all of its stages can process even the partial results from preceding stages, therefore not incurring unnecessary latency due to data dependency. Finally, our FPGA-based deep CNN also provides a superior hardware scalability when compared with conventional FPGA implementations by reducing the bandwidth requirement between layers. The results show that our proposed CNN architecture significantly outperforms all previous FPGA-based deep CNN implementation approaches. It achieves 1.58x more GOPS, 6.42x more GOPS/Slice, and 10.92x more GOPS/W when compared with state-of-the-art CNN architecture. The top-5 accuracy of stochastic VGG-16 CNN is 86.77 percent with 18.91 fps frame rate.},

  doi          = {10.1109/TMSCS.2018.2886266},

  journal      = {IEEE Transactions on Multi-Scale Computing Systems},

  number       = 4,

  volume       = 4,

  place        = {United States},

  year         = {Wed Dec 12 00:00:00 EST 2018},

  month        = {Wed Dec 12 00:00:00 EST 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/TMSCS.2018.2886266

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Similar Records in DOE PAGES and OSTI.GOV collections:

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

Journal Article Wang, Tianqi ; Geng, Tong ; Li, Ang ; ... - IEEE Transactions on Computers

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size.In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. Thismore »« less
https://doi.org/10.1109/TC.2020.3000118
Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design

Conference Zhang, Xingyao ; Song, Shuaiwen ; Xie, Chenhao ; ...

In recent years, the CNNs have achieved great successes in the image processing tasks, e.g., image recognition and object detection. Unfortunately, traditional CNN's classication is found to be easily misled by increasingly complex image features due to the usage of pooling operations, hence unable to preserve accurate position and pose information of the objects. To address this challenge, a novel neural network structure called Capsule Network has been proposed, which introduces equivariance through capsules to signicantly enhance the learning ability for image segmentation and object detection. Due to its requirement of performing a high volume of matrix operations, CapsNets havemore »« less
Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks

Journal Article Sanaullah, Ahmed ; Yang, Chen ; Alexeev, Yuri ; ... - BMC Bioinformatics

Real-time analysis of patient data during medical procedures can provide vital diagnostic feedback that significantly improves chances of success. With sensors becoming increasingly fast, frameworks such as Deep Neural Networks are required to perform calculations within the strict timing constraints for real-time operation. However, traditional computing platforms responsible for running these algorithms incur a large overhead due to communication protocols, memory accesses, and static (often generic) architectures. In this work, we implement a low-latency Multi-Layer Perceptron (MLP) processor using Field Programmable Gate Arrays (FPGAs). Unlike CPUs and Graphics Processing Units (GPUs), our FPGA-based design can directly interface sensors, storage devices,more »« less
Cited by 16
https://doi.org/10.1186/s12859-018-2505-7

Full Text Available
OpenACC to FPGA: A Framework for Directive-based High-Performance Reconfigurable Computing

Conference Lee, Seyong ; Kim, Jungwon ; Vetter, Jeffrey S.

This paper presents a directive-based, high-level programming framework for high-performance reconfigurable computing. It takes a standard, portable OpenACC C program as input and generates a hardware configuration file for execution on FPGAs. We implemented this prototype system using our open-source OpenARC compiler; it performs source-to-source translation and optimization of the input OpenACC program into an OpenCL code, which is further compiled into a FPGA program by the backend Altera Offline OpenCL compiler. Internally, the design of OpenARC uses a high- level intermediate representation that separates concerns of program representation from underlying architectures, which facilitates portability of OpenARC. In fact, thismore »« less
https://doi.org/10.1109/IPDPS.2016.28
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available

Similar Records