DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supportsmore » the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.« less

Authors:
 [1];  [2];  [3];  [4]
  1. Tokyo Inst. of Technology (Japan)
  2. RIKEN Center for Computational Science, Kobe (Japan)
  3. Indiana Univ., Bloomington, IN (United States)
  4. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); Japan Science and Technology Agency (JST); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1559813
Alternate Identifier(s):
OSTI ID: 1692084
Grant/Contract Number:  
AC02-05CH11231; JPMJCR1303; JPMJCR1687
Resource Type:
Accepted Manuscript
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 90; Journal Issue: C; Journal ID: ISSN 0167-8191
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Sparse matrix; SpGEMM; Intel KNL

Citation Formats

Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. United States: N. p., 2019. Web. doi:10.1016/j.parco.2019.102545.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, & Buluç, Aydın. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. United States. https://doi.org/10.1016/j.parco.2019.102545
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Fri . "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors". United States. https://doi.org/10.1016/j.parco.2019.102545. https://www.osti.gov/servlets/purl/1559813.
@article{osti_1559813,
title = {Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors},
author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın},
abstractNote = {Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.},
doi = {10.1016/j.parco.2019.102545},
journal = {Parallel Computing},
number = C,
volume = 90,
place = {United States},
year = {Fri Aug 30 00:00:00 EDT 2019},
month = {Fri Aug 30 00:00:00 EDT 2019}
}

Journal Article:

Citation Metrics:
Cited by: 16 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

Near linear time algorithm to detect community structures in large-scale networks
journal, September 2007


Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid
journal, January 2016

  • Ballard, Grey; Siefert, Christopher; Hu, Jonathan
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 3
  • DOI: 10.1137/15M1028807

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit
journal, January 2016

  • Bock, Nicolas; Challacombe, Matt; Kalé, Laxmikant V.
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 1
  • DOI: 10.1137/140974602

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

  • Gustavson, Fred G.
  • ACM Transactions on Mathematical Software, Vol. 4, Issue 3
  • DOI: 10.1145/355791.355796

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

  • Gilbert, John R.; Moler, Cleve; Schreiber, Robert
  • SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
  • DOI: 10.1137/0613024

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

  • Dalton, Steven; Olson, Luke; Bell, Nathan
  • ACM Transactions on Mathematical Software, Vol. 41, Issue 4
  • DOI: 10.1145/2699470

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

  • Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 1
  • DOI: 10.1137/130948811

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016

  • Rupp, Karl; Tillet, Philippe; Rudolf, Florian
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1026419

Benchmarking optimization software with performance profiles
journal, January 2002

  • Dolan, Elizabeth D.; Moré, Jorge J.
  • Mathematical Programming, Vol. 91, Issue 2
  • DOI: 10.1007/s101070100263

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs
journal, August 2017

  • Mueller-Roemer, J. S.; Altenhofen, C.; Stork, A.
  • Computer Graphics Forum, Vol. 36, Issue 5
  • DOI: 10.1111/cgf.13245

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

  • Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 1
  • DOI: 10.1137/130948811

Works referencing / citing this record:

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
preprint, January 2020


Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
preprint, January 2020