Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Azad, Ariful; Ballard, Grey; Buluç, Aydin; Demmel, James; Grigori, Laura; Schwartz, Oded; Toledo, Sivan; Williams, Samuel

doi:10.1137/15M104253X

Title: Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Abstract

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

Authors:

Azad, Ariful ^[1]; Ballard, Grey ^[2]; Buluç, Aydin ^[1]; Demmel, James ^[3]; Grigori, Laura ^[4]; Schwartz, Oded ^[5]; Toledo, Sivan ^[6]; Williams, Samuel ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Wake Forest Univ., Winston Salem, NC (United States)
Univ. of California, Berkeley, CA (United States)
French Inst. for Research in Computer Science and Automation (INRIA), Paris (France)
Hebrew Univ. of Jerusalem (Israel)
Tel Aviv Univ., Ramat Aviv (Israel)

Publication Date:: Tue Nov 08 00:00:00 EST 2016

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1512883

Alternate Identifier(s):: OSTI ID: 1378775

Report Number(s):: SAND2015-8837J
Journal ID: ISSN 1064-8275; 664883

Grant/Contract Number:: AC04-94AL85000; AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: SIAM Journal on Scientific Computing

Additional Journal Information:: Journal Volume: 38; Journal Issue: 6; Journal ID: ISSN 1064-8275

Publisher:: SIAM

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; parallel computing; numerical linear algebra; sparse matrix-matrix multiplication; 2.5D algorithms; 3D algorithms; multithreading; SpGEMM; 2D decomposition; graph algorithms

Citation Formats


                    Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, and Williams, Samuel. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication.  United States: N. p., 2016. 
Web.  doi:10.1137/15M104253X.

Copy to clipboard


                    Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, & Williams, Samuel. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication.  United States.  https://doi.org/10.1137/15M104253X

Copy to clipboard


                    Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, and Williams, Samuel. Tue .  
"Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication".  United States.  https://doi.org/10.1137/15M104253X.  https://www.osti.gov/servlets/purl/1512883.

Copy to clipboard


                    
@article{osti_1512883,

  title        = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication},

  author       = {Azad, Ariful and Ballard, Grey and Buluç, Aydin and Demmel, James and Grigori, Laura and Schwartz, Oded and Toledo, Sivan and Williams, Samuel},

  abstractNote = {Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.},

  doi          = {10.1137/15M104253X},

  journal      = {SIAM Journal on Scientific Computing},

  number       = 6,

  volume       = 38,

  place        = {United States},

  year         = {Tue Nov 08 00:00:00 EST 2016},

  month        = {Tue Nov 08 00:00:00 EST 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1137/15M104253X

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 53 works

Citation information provided by
Web of Science

Figures / Tables:

Algorithm 1: Column-wise formulation of serial matrix multiplication

All figures and tables (25 total)

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Simultaneous Input and Output Matrix Partitioning for Outer-Product--Parallel Sparse Matrix-Matrix Multiplication
journal, January 2014

Akbudak, Kadir; Aykanat, Cevdet
SIAM Journal on Scientific Computing, Vol. 36, Issue 5
DOI: 10.1137/13092589X

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

Bell, Nathan; Dalton, Steven; Olson, Luke N.
SIAM Journal on Scientific Computing, Vol. 34, Issue 4
DOI: 10.1137/110838844

An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay
journal, January 2013

Bock, Nicolas; Challacombe, Matt
SIAM Journal on Scientific Computing, Vol. 35, Issue 1
DOI: 10.1137/120870761

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014

Borštnik, Urban; VandeVondele, Joost; Weber, Valéry
Parallel Computing, Vol. 40, Issue 5-6
DOI: 10.1016/j.parco.2014.03.012

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

Buluç, Aydın; Gilbert, John R.
The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
DOI: 10.1177/1094342011403516

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

Buluç, Aydin; Gilbert, John R.
SIAM Journal on Scientific Computing, Vol. 34, Issue 4
DOI: 10.1137/110848244

Collective communication: theory, practice, and experience
journal, January 2007

Chan, Ernie; Heimlich, Marcel; Purkayastha, Avi
Concurrency and Computation: Practice and Experience, Vol. 19, Issue 13
DOI: 10.1002/cpe.1206

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Dalton, Steven; Olson, Luke; Bell, Nathan
ACM Transactions on Mathematical Software, Vol. 41, Issue 4
DOI: 10.1145/2699470

Parallel Matrix and Graph Algorithms
journal, November 1981

Dekel, Eliezer; Nassimi, David; Sahni, Sartaj
SIAM Journal on Computing, Vol. 10, Issue 4
DOI: 10.1137/0210049

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Gilbert, John R.; Moler, Cleve; Schreiber, Robert
SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
DOI: 10.1137/0613024

A Unified Framework for Numerical and Combinatorial Computing
journal, March 2008

Gilbert, John R.; Reinhardt, Steve; Shah, Viral B.
Computing in Science & Engineering, Vol. 10, Issue 2
DOI: 10.1109/MCSE.2008.45

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
SIAM Journal on Scientific Computing, Vol. 37, Issue 1
DOI: 10.1137/130948811

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Gustavson, Fred G.
ACM Transactions on Mathematical Software, Vol. 4, Issue 3
DOI: 10.1145/355791.355796

An overview of the Trilinos project
journal, September 2005

Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G.
ACM Transactions on Mathematical Software, Vol. 31, Issue 3
DOI: 10.1145/1089014.1089021

Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

Irony, Dror; Toledo, Sivan; Tiskin, Alexander
Journal of Parallel and Distributed Computing, Vol. 64, Issue 9
DOI: 10.1016/j.jpdc.2004.03.021

Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture
journal, November 2013

Kogge, Peter; Shalf, John
Computing in Science & Engineering, Vol. 15, Issue 6
DOI: 10.1109/MCSE.2013.95

Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms
journal, April 1996

Kohn, W.
Physical Review Letters, Vol. 76, Issue 17
DOI: 10.1103/PhysRevLett.76.3168

Analyzing Scalability of Parallel Algorithms and Architectures
journal, September 1994

Kumar, V. P.; Gupta, A.
Journal of Parallel and Distributed Computing, Vol. 22, Issue 3
DOI: 10.1006/jpdc.1994.1099

Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos
journal, December 2014

Lin, Paul; Bettencourt, Matthew; Domino, Stefan
Parallel Processing Letters, Vol. 24, Issue 04
DOI: 10.1142/S0129626414420055

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015

Liu, Weifeng; Vinter, Brian
Journal of Parallel and Distributed Computing, Vol. 85
DOI: 10.1016/j.jpdc.2015.06.010

A Simple Parallel Algorithm for the Maximal Independent Set Problem
journal, November 1986

Luby, Michael
SIAM Journal on Computing, Vol. 15, Issue 4
DOI: 10.1137/0215074

Parallel processing of filtered queries in attributed semantic graphs
journal, May 2015

Lugowski, Adam; Kamil, Shoaib; Buluç, Aydın
Journal of Parallel and Distributed Computing, Vol. 79-80
DOI: 10.1016/j.jpdc.2014.08.010

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

McCourt, Michael; Smith, Barry; Zhang, Hong
SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
DOI: 10.1137/13093426X

Works referencing / citing this record:

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

Liu, Junhong; He, Xin; Liu, Weifeng
International Journal of Parallel Programming, Vol. 47, Issue 3
DOI: 10.1007/s10766-018-0604-8

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
posted_content, March 2020

Guidi, Giulia; Ellis, Marquita; Rokhsar, Daniel
DOI: 10.1101/464420

Numerical algorithms for high-performance computational science
journal, January 2020

Dongarra, Jack; Grigori, Laura; Higham, Nicholas J.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0066

The parallelism motifs of genomic data analysis
journal, January 2020

Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0394

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
book, January 2021

Guidi, Giulia; Ellis, Marquita; Rokhsar, Daniel
SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)
DOI: 10.1137/1.9781611976830.12

Figures / Tables found in this record:

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.

Similar Records in DOE PAGES and OSTI.GOV collections:

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Journal Article Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ... - Parallel Computing

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm.more »« less
Cited by 16
https://doi.org/10.1016/j.parco.2019.102545

Full Text Available
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Conference Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ...

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore »« less
https://doi.org/10.1145/3229710.3229720

Full Text Available
Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Conference Koanantakool, Penporn ; Azad, Ariful ; Buluc, Aydin ; ... - Proceedings - 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants bothmore »« less
https://doi.org/10.1109/ipdps.2016.117

Full Text Available
A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Journal Article Azad, Ariful ; Buluc, Aydin - Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required.more »« less
Cited by 21
https://doi.org/10.1109/IPDPS.2017.76

Full Text Available
Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

Technical Report Ballard, Grey Malone ; Hu, Jonathan Joseph ; Siefert, Christopher

We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and wemore »« less
https://doi.org/10.2172/1504845

Full Text Available

Similar Records

Title: Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Abstract

Citation Formats

Figures / Tables:

Simultaneous Input and Output Matrix Partitioning for Outer-Product--Parallel Sparse Matrix-Matrix Multiplication journal, January 2014

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods journal, January 2012

An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay journal, January 2013

Sparse matrix multiplication: The distributed block-compressed sparse row library journal, May 2014

The Combinatorial BLAS: design, implementation, and applications journal, May 2011

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments journal, January 2012

Collective communication: theory, practice, and experience journal, January 2007

Optimizing Sparse Matrix—Matrix Multiplication for the GPU journal, October 2015

Parallel Matrix and Graph Algorithms journal, November 1981

Sparse Matrices in MATLAB: Design and Implementation journal, January 1992

A Unified Framework for Numerical and Combinatorial Computing journal, March 2008

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging journal, January 2015

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition journal, September 1978

An overview of the Trilinos project journal, September 2005

Communication lower bounds for distributed-memory matrix multiplication journal, September 2004

Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture journal, November 2013

Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms journal, April 1996

Analyzing Scalability of Parallel Algorithms and Architectures journal, September 1994

Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos journal, December 2014

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors journal, November 2015

A Simple Parallel Algorithm for the Maximal Independent Set Problem journal, November 1986

Parallel processing of filtered queries in attributed semantic graphs journal, May 2015

Sparse Matrix-Matrix Products Executed Through Coloring journal, January 2015

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication journal, January 2019

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper posted_content, March 2020

Numerical algorithms for high-performance computational science journal, January 2020

The parallelism motifs of genomic data analysis journal, January 2020

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper book, January 2021

Simultaneous Input and Output Matrix Partitioning for Outer-Product--Parallel Sparse Matrix-Matrix Multiplication
journal, January 2014

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay
journal, January 2013

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014

The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

Collective communication: theory, practice, and experience
journal, January 2007

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Parallel Matrix and Graph Algorithms
journal, November 1981

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

A Unified Framework for Numerical and Combinatorial Computing
journal, March 2008

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

An overview of the Trilinos project
journal, September 2005

Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture
journal, November 2013

Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms
journal, April 1996

Analyzing Scalability of Parallel Algorithms and Architectures
journal, September 1994

Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos
journal, December 2014

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015

A Simple Parallel Algorithm for the Maximal Independent Set Problem
journal, November 1986

Parallel processing of filtered queries in attributed semantic graphs
journal, May 2015

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
posted_content, March 2020

Numerical algorithms for high-performance computational science
journal, January 2020

The parallelism motifs of genomic data analysis
journal, January 2020

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
book, January 2021