Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [1];  [2];  [2];  [3];  [4];  [5];  [1];  [6];  [1];  [1]
  1. Shenzhen Institutes of Advanced Technology (China). High Performance Computing Center
  2. National Inst. of Advanced Industrial Science and Technology (AIST), Tokyo (Japan). AIST/TokyoTech Open Innovation Lab.
  3. Johannes Gutenberg Univ., Mainz (Germany). High Performance Computing Center
  4. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  5. Shandong Univ., Jinan (China)
  6. Tencent, Shenzhen (China)
In this work, we present FastConv, a template-based code auto-generation open source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming convolution layers of convolutional neural networks. ARM CPUs cover a wide range designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. The experiments with layer-wise evaluation on the VGG--16 model confirms a 1.25x performance gains is got by tuning the Winograd library. Integrated comparison results shows 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, Arm NN, and FeatherCNN on the Kunpeng 920 beside few cases. Furthermore, problem size performance portability experiments with various convolution shapes shows that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920 . CPU performance portability evaluation on the VGG--16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x is observed on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
National Key R&D Program of China; National Natural Science Foundation of China (NSFC); USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1863284
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 1 Vol. 5; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (27)

Performance, Design, and Autotuning of Batched GEMM for GPUs book June 2016
Internet of Things (IoT): A vision, architectural elements, and future directions journal September 2013
IoT security: Review, blockchain solutions, and open challenges journal May 2018
Fast Algorithms for Convolutional Neural Networks conference June 2016
Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs conference February 2019
ARMv8-A next-generation vector architecture for HPC conference August 2016
A Memory-aware Performance Optimization of Tensor Programs for Embedded Devices conference November 2020
High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps conference August 2015
Fd-Mobilenet: Improved Mobilenet with a Fast Downsampling Strategy conference October 2018
Anatomy of High-Performance Many-Threaded Matrix Multiplication
  • Smith, Tyler M.; Geijn, Robert van de; Smelyanskiy, Mikhail
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.110
conference May 2014
Cache-aware Roofline model: Upgrading the loft journal January 2014
LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation
  • Heinecke, Alexander; Henry, Greg; Hutchinson, Maxwell
  • SC16: International Conference for High-Performance Computing, Networking, Storage and Analysis, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.83
conference November 2016
Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures
  • Georganas, Evangelos; Avancha, Sasikanth; Banerjee, Kunal
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00069
conference November 2018
Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels journal January 2020
FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures journal March 2020
Anatomy of high-performance matrix multiplication journal May 2008
Roofline: an insightful visual performance model for multicore architectures journal April 2009
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines journal June 2013
Modelling the ARMv8 architecture, operationally: concurrency and ISA
  • Flur, Shaked; Gray, Kathryn E.; Pulte, Christopher
  • POPL '16: The 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages https://doi.org/10.1145/2837614.2837615
conference January 2016
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
  • Jia, Zhen; Zlateski, Aleksandar; Durand, Fredo
  • PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178496
conference February 2018
Optimizing Deep Learning Workloads on ARM GPU with TVM
  • Zheng, Lanmin; Chen, Tianqi
  • ReQuEST '18: Reproducible Quality-Efficient Systems Tournament, Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning https://doi.org/10.1145/3229762.3229764
conference June 2018
Optimizing batched winograd convolution on GPUs
  • Yan, Da; Wang, Wei; Chu, Xiaowen
  • PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3332466.3374520
conference February 2020
I/O lower bounds for auto-tuning of convolutions in CNNs
  • Zhang, Xiaoyang; Xiao, Junmin; Tan, Guangming
  • PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3437801.3441609
conference February 2021
Fast Optimisation of Convolutional Neural Network Inference using System Performance Models
  • Mulder, Rik; Radu, Valentin; Dubach, Christophe
  • EuroSys '21: Sixteenth European Conference on Computer Systems, Proceedings of the 1st Workshop on Machine Learning and Systems https://doi.org/10.1145/3437984.3458840
conference April 2021
Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs
  • Kim, Sumin; Oh, Seunghwan; Yi, Youngmin
  • HotMobile '21: The 22nd International Workshop on Mobile Computing Systems and Applications, Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications https://doi.org/10.1145/3446382.3448606
conference February 2021
LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores
  • Yang, Weiling; Fang, Jianbin; Dong, Dezun
  • SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476217
conference November 2021
Batched matrix computations on hardware accelerators based on GPUs journal April 2014

Similar Records

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures
Conference · Fri Nov 01 00:00:00 EDT 2024 · OSTI ID:2480030

A Generalized Framework for Auto-tuning Stencil Computations
Conference · Fri May 01 00:00:00 EDT 2009 · OSTI ID:962935

A Generalized Framework for Auto-tuning Stencil Computations
Conference · Mon Aug 24 00:00:00 EDT 2009 · OSTI ID:1407077