Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
Abstract
Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supportsmore »
- Authors:
-
- Tokyo Inst. of Technology (Japan)
- RIKEN Center for Computational Science, Kobe (Japan)
- Indiana Univ., Bloomington, IN (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); Japan Science and Technology Agency (JST); USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1559813
- Alternate Identifier(s):
- OSTI ID: 1692084
- Grant/Contract Number:
- AC02-05CH11231; JPMJCR1303; JPMJCR1687
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Parallel Computing
- Additional Journal Information:
- Journal Volume: 90; Journal Issue: C; Journal ID: ISSN 0167-8191
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Sparse matrix; SpGEMM; Intel KNL
Citation Formats
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. United States: N. p., 2019.
Web. doi:10.1016/j.parco.2019.102545.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, & Buluç, Aydın. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. United States. https://doi.org/10.1016/j.parco.2019.102545
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Fri .
"Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors". United States. https://doi.org/10.1016/j.parco.2019.102545. https://www.osti.gov/servlets/purl/1559813.
@article{osti_1559813,
title = {Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors},
author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın},
abstractNote = {Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.},
doi = {10.1016/j.parco.2019.102545},
journal = {Parallel Computing},
number = C,
volume = 90,
place = {United States},
year = {Fri Aug 30 00:00:00 EDT 2019},
month = {Fri Aug 30 00:00:00 EDT 2019}
}
Web of Science
Works referenced in this record:
HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018
- Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
- Nucleic Acids Research, Vol. 46, Issue 6
Near linear time algorithm to detect community structures in large-scale networks
journal, September 2007
- Raghavan, Usha Nandini; Albert, Réka; Kumara, Soundar
- Physical Review E, Vol. 76, Issue 3
Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid
journal, January 2016
- Ballard, Grey; Siefert, Christopher; Hu, Jonathan
- SIAM Journal on Scientific Computing, Vol. 38, Issue 3
Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit
journal, January 2016
- Bock, Nicolas; Challacombe, Matt; Kalé, Laxmikant V.
- SIAM Journal on Scientific Computing, Vol. 38, Issue 1
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978
- Gustavson, Fred G.
- ACM Transactions on Mathematical Software, Vol. 4, Issue 3
Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992
- Gilbert, John R.; Moler, Cleve; Schreiber, Robert
- SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015
- Dalton, Steven; Olson, Luke; Bell, Nathan
- ACM Transactions on Mathematical Software, Vol. 41, Issue 4
GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015
- Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
- SIAM Journal on Scientific Computing, Vol. 37, Issue 1
ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016
- Rupp, Karl; Tillet, Philippe; Rudolf, Florian
- SIAM Journal on Scientific Computing, Vol. 38, Issue 5
Benchmarking optimization software with performance profiles
journal, January 2002
- Dolan, Elizabeth D.; Moré, Jorge J.
- Mathematical Programming, Vol. 91, Issue 2
The Combinatorial BLAS: design, implementation, and applications
journal, May 2011
- Buluç, Aydın; Gilbert, John R.
- The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs
journal, August 2017
- Mueller-Roemer, J. S.; Altenhofen, C.; Stork, A.
- Computer Graphics Forum, Vol. 36, Issue 5
GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015
- Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
- SIAM Journal on Scientific Computing, Vol. 37, Issue 1
Works referencing / citing this record:
The parallelism motifs of genomic data analysis
journal, January 2020
- Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
The parallelism motifs of genomic data analysis
journal, January 2020
- Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking
preprint, January 2020
- Gu, Zhixiang; Moreira, Jose; Edelsohn, David
- arXiv
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
preprint, January 2020
- Selvitopi, Oguz; Ekanayake, Saliya; Guidi, Giulia
- arXiv
Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
preprint, January 2020
- Hussain, Md Taufique; Selvitopi, Oguz; Buluç, Aydin
- arXiv