HighPerformance Sparse MatrixMatrix Products on Intel KNL and Multicore Architectures
Abstract
Sparse matrixmatrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi and manycore processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi and manycore processors, we develop a hashtablebased algorithm and optimize a heapbased sharedmemory SpGEMM algorithm. We investigate their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multisource breadthfirst search or triangle counting. Our hashtable and heapbased algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up indepth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. In conclusion, a critical finding is that hashtablebased SpGEMM gets a significant performancemore »
 Authors:

 Tokyo Institute of Technology, Tokyo (Japan)
 RIKEN Center for Computational Science, Kobe (Japan)
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Publication Date:
 Research Org.:
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC21)
 OSTI Identifier:
 1454499
 DOE Contract Number:
 AC0205CH11231
 Resource Type:
 Conference
 Resource Relation:
 Conference: 47. ICPP '18 Proceedings of the International Conference on Parallel Processing Companion, Eugene, OR (United States), 1316 Aug 2018; Related Information: Also at https://arxiv.org/abs/1804.01698
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING
Citation Formats
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. HighPerformance Sparse MatrixMatrix Products on Intel KNL and Multicore Architectures. United States: N. p., 2018.
Web. doi:10.1145/3229710.3229720.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, & Buluç, Aydın. HighPerformance Sparse MatrixMatrix Products on Intel KNL and Multicore Architectures. United States. doi:10.1145/3229710.3229720.
Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın. Mon .
"HighPerformance Sparse MatrixMatrix Products on Intel KNL and Multicore Architectures". United States. doi:10.1145/3229710.3229720. https://www.osti.gov/servlets/purl/1454499.
@article{osti_1454499,
title = {HighPerformance Sparse MatrixMatrix Products on Intel KNL and Multicore Architectures},
author = {Nagasaka, Yusuke and Matsuoka, Satoshi and Azad, Ariful and Buluç, Aydın},
abstractNote = {Sparse matrixmatrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi and manycore processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi and manycore processors, we develop a hashtablebased algorithm and optimize a heapbased sharedmemory SpGEMM algorithm. We investigate their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multisource breadthfirst search or triangle counting. Our hashtable and heapbased algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up indepth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. In conclusion, a critical finding is that hashtablebased SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.},
doi = {10.1145/3229710.3229720},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {8}
}
Figures / Tables:
Works referenced in this record:
HipMCL: a highperformance parallel implementation of the Markov clustering algorithm for largescale networks
journal, January 2018
 Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
 Nucleic Acids Research, Vol. 46, Issue 6
An Efficient GPU General Sparse MatrixMatrix Multiplication for Irregular Data
conference, May 2014
 Liu, Weifeng; Vinter, Brian
 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Parallel SimRank computation on large graphs with iterative aggregation
conference, January 2010
 He, Guoming; Feng, Haijun; Li, Cuiping
 Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD '10
Sparse matrixmatrix multiplication on modern architectures
conference, December 2012
 Matam, Kiran; Krishna Bharadwaj Indarapu, Siva Rama; Kothapalli, Kishore
 2012 19th International Conference on High Performance Computing (HiPC)
ViennaCLLinear Algebra Library for Multi and ManyCore Architectures
journal, January 2016
 Rupp, Karl; Tillet, Philippe; Rudolf, Florian
 SIAM Journal on Scientific Computing, Vol. 38, Issue 5
Performanceportable sparse matrixmatrix multiplication for manycore architectures
conference, May 2017
 Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
GPUAccelerated Sparse MatrixMatrix Multiplication by Iterative Row Merging
journal, January 2015
 Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
 SIAM Journal on Scientific Computing, Vol. 37, Issue 1
Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit
journal, January 2016
 Bock, Nicolas; Challacombe, Matt; Kalé, Laxmikant V.
 SIAM Journal on Scientific Computing, Vol. 38, Issue 1
Exploiting Multiple Levels of Parallelism in Sparse MatrixMatrix Multiplication
journal, January 2016
 Azad, Ariful; Ballard, Grey; Buluç, Aydin
 SIAM Journal on Scientific Computing, Vol. 38, Issue 6
Parallel Triangle Counting and Enumeration Using Matrix Algebra
conference, May 2015
 Azad, Ariful; Buluc, Aydin; Gilbert, John
 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978
 Gustavson, Fred G.
 ACM Transactions on Mathematical Software, Vol. 4, Issue 3
Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid
journal, January 2016
 Ballard, Grey; Siefert, Christopher; Hu, Jonathan
 SIAM Journal on Scientific Computing, Vol. 38, Issue 3
The Combinatorial BLAS: design, implementation, and applications
journal, May 2011
 Buluç, Aydın; Gilbert, John R.
 The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
HighPerformance and MemorySaving Sparse General MatrixMatrix Multiplication for NVIDIA Pascal GPU
conference, August 2017
 Nagasaka, Yusuke; Nukada, Akira; Matsuoka, Satoshi
 2017 46th International Conference on Parallel Processing (ICPP)
RMAT: A Recursive Model for Graph Mining
conference, December 2013
 Chakrabarti, Deepayan; Zhan, Yiping; Faloutsos, Christos
 Proceedings of the 2004 SIAM International Conference on Data Mining
Benchmarking optimization software with performance profiles
journal, January 2002
 Dolan, Elizabeth D.; Moré, Jorge J.
 Mathematical Programming, Vol. 91, Issue 2
Exploiting accelerators for efficient high dimensional similarity search
conference, January 2016
 Agrawal, Sandeep R.; Dee, Christopher M.; Lebeck, Alvin R.
 Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming  PPoPP '16
Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs
journal, August 2017
 MuellerRoemer, J. S.; Altenhofen, C.; Stork, A.
 Computer Graphics Forum, Vol. 36, Issue 5
Works referencing / citing this record:
RegisterAware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019
 Liu, Junhong; He, Xin; Liu, Weifeng
 International Journal of Parallel Programming, Vol. 47, Issue 3
Figures / Tables found in this record: