autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures

Wu, Du; Meng, Jintao; Chen, Peng; Z hu, Wenxi; Deng, Minwen; Wahib, Mohamed; Luo, Tao; Wang, Xiao; Wei, Yanjie

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures

Conference · Fri Nov 01 04:00:00 EDT 2024

OSTI ID:2480030

Wu, Du ^[1]; Meng, Jintao ^[2]; Chen, Peng ^[3]; Z hu, Wenxi ^[4]; Deng, Minwen ^[4]; Wahib, Mohamed ^[5]; Luo, Tao ^[6]; Wang, Xiao ^[7]; Wei, Yanjie ^[8]

Tokyo Institute of Technology, Japan
Chinese Academy of Sciences (CAS)
AIST, Japan
Tencent AI
RIKEN Center for Computational Science
Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore
ORNL
Shenzhen Institutes of Advanced Technology

This paper presents an open-source library that pushes the limits of performance portability for irregular General Matrix Multiplication (GEMM) on the widely-used Arm architectures. Our library, autoGEMM, is designed to support a wide range of Arm processors: from edge devices to HPC-grade CPUs. autoGEMM generates optimized kernels for various hardware configurations by auto-combining fragments of autogenerated micro-kernels that employ hand-written optimizations to maximize computational efficiency. We optimize the kernel pipeline by tuning the register reuse and the data load/store overlapping. In addition, we use a dynamic tiling scheme to generate balanced tile shapes. Finally, we position autoGEMM on top of the TVM framework where our dynamic tiling scheme prunes the search space for TVM to identify the optimal combination of parameters for code optimization. Evaluations on five different classes of Arm chips demonstrate the advantages of autoGEMM. For small matrices, autoGEMM achieves 98% of peak and up to 2.0x speedup over state-of-the-art libraries such as LIBXSMM and LibShalom. For irregular matrices (i.e. tall skinny and long rectangles), autoGEMM is 1.3-2.0x faster than widely-used libraries such as OpenBLAS and Eigen. autoGEMM is publicly available at: https://github.com/wudu98/autoGEMM.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 2480030

Country of Publication:: United States

Language:: English

Similar Records

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Journal Article · Wed Jan 26 19:00:00 EST 2022 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1863284

A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs

Journal Article · Mon Jul 10 20:00:00 EDT 2023 · Computer Physics Communications · OSTI ID:2308837

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

Technical Report · Wed Jul 14 00:00:00 EDT 2021 · OSTI ID:1808019

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures

Citation Formats

Similar Records

Related Subjects