There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Finally, our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.
Moon, Gordon Euhyun, et al. "Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.." IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 4, Aug. 2021. https://doi.org/10.1109/TPDS.2021.3104240
Moon, Gordon Euhyun, Kwon, Hyoukjun, Jeong, Geonhwa, et al., "Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.," IEEE Transactions on Parallel and Distributed Systems 33, no. 4 (2021), https://doi.org/10.1109/TPDS.2021.3104240
@article{osti_1820407,
author = {Moon, Gordon Euhyun and Kwon, Hyoukjun and Jeong, Geonhwa and Chatarasi, Prasanth and Rajamanickam, Sivasankaran and Krishna, Tushar},
title = {Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.},
annote = {There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Finally, our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.},
doi = {10.1109/TPDS.2021.3104240},
url = {https://www.osti.gov/biblio/1820407},
journal = {IEEE Transactions on Parallel and Distributed Systems},
issn = {ISSN 1045-9219},
number = {4},
volume = {33},
place = {United States},
publisher = {IEEE},
year = {2021},
month = {08}}
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
NA0003525
OSTI ID:
1820407
Report Number(s):
SAND--2021-9925J; 698422
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 4 Vol. 33; ISSN 1045-9219
NOCS '17: International Symposium on Networks-on-Chip, Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chiphttps://doi.org/10.1145/3130218.3130230
PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 24th Symposium on Principles and Practice of Parallel Programminghttps://doi.org/10.1145/3293883.3295734
Pedram, Ardavan; Gerstlauer, Andreas; Geijn, Robert A. van de
2011 IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processorshttps://doi.org/10.1109/ASAP.2011.6043234
Franchetti, Franz; Voronenko, Yevgen; Milder, Peter A.
2008 IEEE International Parallel & Distributed Processing Symposium, 2008 IEEE International Symposium on Parallel and Distributed Processinghttps://doi.org/10.1109/IPDPS.2008.4536398
ISCA '15: The 42nd Annual International Symposium on Computer Architecture, Proceedings of the 42nd Annual International Symposium on Computer Architecturehttps://doi.org/10.1145/2749469.2750389
ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systemshttps://doi.org/10.1145/3373376.3378514
Kwon, Hyoukjun; Chatarasi, Prasanth; Pellauer, Michael
MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecturehttps://doi.org/10.1145/3352460.3358252