Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [2];  [2];  [2];  [3];  [2]
  1. Korea Aerospace Univ., Gyeonggi (Korea, Republic of)
  2. Georgia Inst. of Technology, Atlanta, GA (United States)
  3. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Finally, our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
NA0003525
OSTI ID:
1820407
Report Number(s):
SAND--2021-9925J; 698422
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 4 Vol. 33; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (29)

SUMMA: scalable universal matrix multiplication algorithm journal April 1997
Performance, Design, and Autotuning of Batched GEMM for GPUs book June 2016
Automated empirical optimizations of software and the ATLAS project journal January 2001
A survey of direct methods for sparse linear systems journal May 2016
A high-performance, low-power linear algebra core
  • Pedram, Ardavan; Gerstlauer, Andreas; Geijn, Robert A. van de
  • 2011 IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors https://doi.org/10.1109/ASAP.2011.6043234
conference September 2011
Deep Residual Learning for Image Recognition conference June 2016
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training conference February 2020
Understanding the Impact of On-chip Communication on DNN Accelerator Performance conference November 2019
Domain-specific library generation for parallel software and hardware platforms
  • Franchetti, Franz; Voronenko, Yevgen; Milder, Peter A.
  • 2008 IEEE International Parallel & Distributed Processing Symposium, 2008 IEEE International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2008.4536398
conference April 2008
mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator conference March 2019
Timeloop: A Systematic Approach to DNN Accelerator Evaluation conference March 2019
Self-Adapting Linear Algebra Algorithms and Software journal February 2005
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks journal January 2017
Accelerating Scientific Applications With SambaNova Reconfigurable Dataflow Architecture journal March 2021
A Hardware–Software Blueprint for Flexible Deep Learning Specialization journal September 2019
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures journal December 2012
A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference conference June 2018
Anatomy of high-performance matrix multiplication journal May 2008
High-performance implementation of the level-3 BLAS journal July 2008
ShiDianNao: shifting vision processing closer to the sensor
  • Du, Zidong; Fasthuber, Robert; Chen, Tianshi
  • ISCA '15: The 42nd Annual International Symposium on Computer Architecture, Proceedings of the 42nd Annual International Symposium on Computer Architecture https://doi.org/10.1145/2749469.2750389
conference June 2015
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark journal September 1998
Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks journal October 2016
In-Datacenter Performance Analysis of a Tensor Processing Unit conference January 2017
Rethinking NoCs for Spatial Neural Network Accelerators
  • Kwon, Hyoukjun; Samajdar, Ananda; Krishna, Tushar
  • NOCS '17: International Symposium on Networks-on-Chip, Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip https://doi.org/10.1145/3130218.3130230
conference October 2017
A coordinated tiling and batching framework for efficient GEMM on GPUs
  • Li, Xiuhong; Liang, Yun; Yan, Shengen
  • PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3293883.3295734
conference February 2019
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects journal November 2018
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
  • Kwon, Hyoukjun; Chatarasi, Prasanth; Pellauer, Michael
  • MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture https://doi.org/10.1145/3352460.3358252
conference October 2019
dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators journal October 2019
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
  • Yang, Xuan; Gao, Mingyu; Liu, Qiaoyi
  • ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3373376.3378514
conference March 2020

Similar Records

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication
Technical Report · Wed Jul 14 00:00:00 EDT 2021 · OSTI ID:1808019

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures
Conference · Fri Nov 01 00:00:00 EDT 2024 · OSTI ID:2480030

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operationson Spatial Accelerators
Conference · Mon Oct 18 00:00:00 EDT 2021 · OSTI ID:1972822