skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We investigate their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. In conclusion, a critical finding is that hash-table-based SpGEMM gets a significant performancemore » boost if the nonzeros are not required to be sorted within each row of the output matrix.« less

Authors:
 [1];  [2];  [3];  [3]
  1. Tokyo Institute of Technology, Tokyo (Japan)
  2. RIKEN Center for Computational Science, Kobe (Japan)
  3. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1454499
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: 47. ICPP '18 Proceedings of the International Conference on Parallel Processing Companion, Eugene, OR (United States), 13-16 Aug 2018; Related Information: Also at https://arxiv.org/abs/1804.01698
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. United States: N. p., 2018. Web. doi:10.1145/3229710.3229720.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, & Buluç, Aydın. High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures. United States. doi:10.1145/3229710.3229720.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Mon . "High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures". United States. doi:10.1145/3229710.3229720. https://www.osti.gov/servlets/purl/1454499.
@article{osti_1454499,
title = {High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures},
author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın},
abstractNote = {Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We investigate their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. In conclusion, a critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.},
doi = {10.1145/3229710.3229720},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {8}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Figures / Tables:

Figure 1: Figure 1:: Pseudo code of Gustavson’s Row-wise SpGEMM algorithm. The in parallel keyword does not exist in the original algorithm but is used here to illustrate the common parallelization pattern of this algorithm used by all known implementations.

Save / Share:

Works referenced in this record:

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
conference, May 2014

  • Liu, Weifeng; Vinter, Brian
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.47

Parallel SimRank computation on large graphs with iterative aggregation
conference, January 2010

  • He, Guoming; Feng, Haijun; Li, Cuiping
  • Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10
  • DOI: 10.1145/1835804.1835874

Sparse matrix-matrix multiplication on modern architectures
conference, December 2012

  • Matam, Kiran; Krishna Bharadwaj Indarapu, Siva Rama; Kothapalli, Kishore
  • 2012 19th International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2012.6507483

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
journal, January 2016

  • Rupp, Karl; Tillet, Philippe; Rudolf, Florian
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1026419

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

  • Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
  • 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.8

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

  • Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 1
  • DOI: 10.1137/130948811

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit
journal, January 2016

  • Bock, Nicolas; Challacombe, Matt; Kalé, Laxmikant V.
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 1
  • DOI: 10.1137/140974602

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

  • Azad, Ariful; Ballard, Grey; Buluç, Aydin
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 6
  • DOI: 10.1137/15M104253X

Parallel Triangle Counting and Enumeration Using Matrix Algebra
conference, May 2015

  • Azad, Ariful; Buluc, Aydin; Gilbert, John
  • 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)
  • DOI: 10.1109/IPDPSW.2015.75

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

  • Gustavson, Fred G.
  • ACM Transactions on Mathematical Software, Vol. 4, Issue 3
  • DOI: 10.1145/355791.355796

Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid
journal, January 2016

  • Ballard, Grey; Siefert, Christopher; Hu, Jonathan
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 3
  • DOI: 10.1137/15M1028807

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
conference, August 2017

  • Nagasaka, Yusuke; Nukada, Akira; Matsuoka, Satoshi
  • 2017 46th International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2017.19

R-MAT: A Recursive Model for Graph Mining
conference, December 2013

  • Chakrabarti, Deepayan; Zhan, Yiping; Faloutsos, Christos
  • Proceedings of the 2004 SIAM International Conference on Data Mining
  • DOI: 10.1137/1.9781611972740.43

Benchmarking optimization software with performance profiles
journal, January 2002

  • Dolan, Elizabeth D.; Moré, Jorge J.
  • Mathematical Programming, Vol. 91, Issue 2
  • DOI: 10.1007/s101070100263

Exploiting accelerators for efficient high dimensional similarity search
conference, January 2016

  • Agrawal, Sandeep R.; Dee, Christopher M.; Lebeck, Alvin R.
  • Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '16
  • DOI: 10.1145/2851141.2851144

Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs
journal, August 2017

  • Mueller-Roemer, J. S.; Altenhofen, C.; Stork, A.
  • Computer Graphics Forum, Vol. 36, Issue 5
  • DOI: 10.1111/cgf.13245

    Works referencing / citing this record:

    Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
    journal, January 2019

    • Liu, Junhong; He, Xin; Liu, Weifeng
    • International Journal of Parallel Programming, Vol. 47, Issue 3
    • DOI: 10.1007/s10766-018-0604-8