DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Abstract

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

Authors:
 [1];  [2];  [1];  [3];  [4];  [5];  [6];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Wake Forest Univ., Winston Salem, NC (United States)
  3. Univ. of California, Berkeley, CA (United States)
  4. French Inst. for Research in Computer Science and Automation (INRIA), Paris (France)
  5. Hebrew Univ. of Jerusalem (Israel)
  6. Tel Aviv Univ., Ramat Aviv (Israel)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1512883
Alternate Identifier(s):
OSTI ID: 1378775
Report Number(s):
SAND2015-8837J
Journal ID: ISSN 1064-8275; 664883
Grant/Contract Number:  
AC04-94AL85000; AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
SIAM Journal on Scientific Computing
Additional Journal Information:
Journal Volume: 38; Journal Issue: 6; Journal ID: ISSN 1064-8275
Publisher:
SIAM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; parallel computing; numerical linear algebra; sparse matrix-matrix multiplication; 2.5D algorithms; 3D algorithms; multithreading; SpGEMM; 2D decomposition; graph algorithms

Citation Formats

Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, and Williams, Samuel. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. United States: N. p., 2016. Web. doi:10.1137/15M104253X.
Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, & Williams, Samuel. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. United States. https://doi.org/10.1137/15M104253X
Azad, Ariful, Ballard, Grey, Buluç, Aydin, Demmel, James, Grigori, Laura, Schwartz, Oded, Toledo, Sivan, and Williams, Samuel. Tue . "Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication". United States. https://doi.org/10.1137/15M104253X. https://www.osti.gov/servlets/purl/1512883.
@article{osti_1512883,
title = {Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication},
author = {Azad, Ariful and Ballard, Grey and Buluç, Aydin and Demmel, James and Grigori, Laura and Schwartz, Oded and Toledo, Sivan and Williams, Samuel},
abstractNote = {Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.},
doi = {10.1137/15M104253X},
journal = {SIAM Journal on Scientific Computing},
number = 6,
volume = 38,
place = {United States},
year = {Tue Nov 08 00:00:00 EST 2016},
month = {Tue Nov 08 00:00:00 EST 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 53 works
Citation information provided by
Web of Science

Figures / Tables:

Algorithm 1 Algorithm 1: Column-wise formulation of serial matrix multiplication

Save / Share:

Works referenced in this record:

Simultaneous Input and Output Matrix Partitioning for Outer-Product--Parallel Sparse Matrix-Matrix Multiplication
journal, January 2014

  • Akbudak, Kadir; Aykanat, Cevdet
  • SIAM Journal on Scientific Computing, Vol. 36, Issue 5
  • DOI: 10.1137/13092589X

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
journal, January 2012

  • Bell, Nathan; Dalton, Steven; Olson, Luke N.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110838844

An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay
journal, January 2013

  • Bock, Nicolas; Challacombe, Matt
  • SIAM Journal on Scientific Computing, Vol. 35, Issue 1
  • DOI: 10.1137/120870761

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014


The Combinatorial BLAS: design, implementation, and applications
journal, May 2011

  • Buluç, Aydın; Gilbert, John R.
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342011403516

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

  • Buluç, Aydin; Gilbert, John R.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110848244

Collective communication: theory, practice, and experience
journal, January 2007

  • Chan, Ernie; Heimlich, Marcel; Purkayastha, Avi
  • Concurrency and Computation: Practice and Experience, Vol. 19, Issue 13
  • DOI: 10.1002/cpe.1206

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

  • Dalton, Steven; Olson, Luke; Bell, Nathan
  • ACM Transactions on Mathematical Software, Vol. 41, Issue 4
  • DOI: 10.1145/2699470

Parallel Matrix and Graph Algorithms
journal, November 1981

  • Dekel, Eliezer; Nassimi, David; Sahni, Sartaj
  • SIAM Journal on Computing, Vol. 10, Issue 4
  • DOI: 10.1137/0210049

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

  • Gilbert, John R.; Moler, Cleve; Schreiber, Robert
  • SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
  • DOI: 10.1137/0613024

A Unified Framework for Numerical and Combinatorial Computing
journal, March 2008

  • Gilbert, John R.; Reinhardt, Steve; Shah, Viral B.
  • Computing in Science & Engineering, Vol. 10, Issue 2
  • DOI: 10.1109/MCSE.2008.45

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging
journal, January 2015

  • Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 1
  • DOI: 10.1137/130948811

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

  • Gustavson, Fred G.
  • ACM Transactions on Mathematical Software, Vol. 4, Issue 3
  • DOI: 10.1145/355791.355796

An overview of the Trilinos project
journal, September 2005

  • Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G.
  • ACM Transactions on Mathematical Software, Vol. 31, Issue 3
  • DOI: 10.1145/1089014.1089021

Communication lower bounds for distributed-memory matrix multiplication
journal, September 2004

  • Irony, Dror; Toledo, Sivan; Tiskin, Alexander
  • Journal of Parallel and Distributed Computing, Vol. 64, Issue 9
  • DOI: 10.1016/j.jpdc.2004.03.021

Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture
journal, November 2013

  • Kogge, Peter; Shalf, John
  • Computing in Science & Engineering, Vol. 15, Issue 6
  • DOI: 10.1109/MCSE.2013.95

Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms
journal, April 1996


Analyzing Scalability of Parallel Algorithms and Architectures
journal, September 1994

  • Kumar, V. P.; Gupta, A.
  • Journal of Parallel and Distributed Computing, Vol. 22, Issue 3
  • DOI: 10.1006/jpdc.1994.1099

Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos
journal, December 2014

  • Lin, Paul; Bettencourt, Matthew; Domino, Stefan
  • Parallel Processing Letters, Vol. 24, Issue 04
  • DOI: 10.1142/S0129626414420055

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015


A Simple Parallel Algorithm for the Maximal Independent Set Problem
journal, November 1986

  • Luby, Michael
  • SIAM Journal on Computing, Vol. 15, Issue 4
  • DOI: 10.1137/0215074

Parallel processing of filtered queries in attributed semantic graphs
journal, May 2015

  • Lugowski, Adam; Kamil, Shoaib; Buluç, Aydın
  • Journal of Parallel and Distributed Computing, Vol. 79-80
  • DOI: 10.1016/j.jpdc.2014.08.010

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

  • McCourt, Michael; Smith, Barry; Zhang, Hong
  • SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
  • DOI: 10.1137/13093426X

Works referencing / citing this record:

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

  • Liu, Junhong; He, Xin; Liu, Weifeng
  • International Journal of Parallel Programming, Vol. 47, Issue 3
  • DOI: 10.1007/s10766-018-0604-8

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
posted_content, March 2020


Numerical algorithms for high-performance computational science
journal, January 2020

  • Dongarra, Jack; Grigori, Laura; Higham, Nicholas J.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0066

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
book, January 2021

  • Guidi, Giulia; Ellis, Marquita; Rokhsar, Daniel
  • SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)
  • DOI: 10.1137/1.9781611976830.12