Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Hussain, Md Taufique; Selvitopi, Oguz; Buluc, Aydin; Azad, Ariful

doi:10.1109/ipdps49936.2021.00018

Title: Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. Furthermore, we address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.

Authors:

Hussain, Md Taufique ^[1]; Selvitopi, Oguz ^[2]; Buluc, Aydin ^[2]; Azad, Ariful ^[1]

Indiana Univ., Bloomington, IN (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Mon May 17 00:00:00 EDT 2021

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1817306

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Additional Journal Information:: Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2021; Conference: 2021 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), Portland, OR (United States), 17-21 May 2021; Journal ID: ISSN 1530-2075

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; proteins; three-dimensional displays; social networking; scientific computing; memory management; genomics; parallel processing; graph theory; mathematics computing; matrix algebra; matrix multiplication; multiprocessing systems; parallel machines; resource allocations; sparse matrices

Citation Formats


                    Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, and Azad, Ariful. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale.  United States: N. p., 2021. 
Web.  doi:10.1109/ipdps49936.2021.00018.

Copy to clipboard


                    Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, & Azad, Ariful. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale.  United States.  https://doi.org/10.1109/ipdps49936.2021.00018

Copy to clipboard


                    Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, and Azad, Ariful. Mon .  
"Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale".  United States.  https://doi.org/10.1109/ipdps49936.2021.00018.  https://www.osti.gov/servlets/purl/1817306.

Copy to clipboard


                    
@article{osti_1817306,

  title        = {Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale},

  author       = {Hussain, Md Taufique and Selvitopi, Oguz and Buluc, Aydin and Azad, Ariful},

  abstractNote = {Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. Furthermore, we address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.},

  doi          = {10.1109/ipdps49936.2021.00018},

  journal      = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},

  number       = ,

  volume       = 2021,

  place        = {United States},

  year         = {Mon May 17 00:00:00 EDT 2021},

  month        = {Mon May 17 00:00:00 EDT 2021}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/ipdps49936.2021.00018

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

McCourt, Michael; Smith, Barry; Zhang, Hong
SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
DOI: 10.1137/13093426X

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Nucleic Acids Research, Vol. 46, Issue 6
DOI: 10.1093/nar/gkx1313

Parallel SimRank computation on large graphs with iterative aggregation
conference, January 2010

He, Guoming; Feng, Haijun; Li, Cuiping
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10
DOI: 10.1145/1835804.1835874

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
conference, February 2020

Qin, Eric; Samajdar, Ananda; Kwon, Hyoukjun
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
DOI: 10.1109/HPCA47549.2020.00015

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Gilbert, John R.; Moler, Cleve; Schreiber, Robert
SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
DOI: 10.1137/0613024

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

Catalyurek, U. V.; Aykanat, C.
IEEE Transactions on Parallel and Distributed Systems, Vol. 10, Issue 7
DOI: 10.1109/71.780863

Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking
conference, July 2020

Gu, Zhixiang; Moreira, Jose; Edelsohn, David
SPAA '20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures
DOI: 10.1145/3350755.3400216

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

Akbudak, Kadir; Selvitopi, Oguz; Aykanat, Cevdet
ACM Transactions on Parallel Computing, Vol. 4, Issue 3
DOI: 10.1145/3155292

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

Azad, Ariful; Ballard, Grey; Buluç, Aydin
SIAM Journal on Scientific Computing, Vol. 38, Issue 6
DOI: 10.1137/15M104253X

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019

Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful
Parallel Computing, Vol. 90
DOI: 10.1016/j.parco.2019.102545

Parallel Triangle Counting and Enumeration Using Matrix Algebra
conference, May 2015

Azad, Ariful; Buluc, Aydin; Gilbert, John
2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)
DOI: 10.1109/IPDPSW.2015.75

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Gustavson, Fred G.
ACM Transactions on Mathematical Software, Vol. 4, Issue 3
DOI: 10.1145/355791.355796

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014

Borštnik, Urban; VandeVondele, Joost; Weber, Valéry
Parallel Computing, Vol. 40, Issue 5-6
DOI: 10.1016/j.parco.2014.03.012

The university of Florida sparse matrix collection
journal, November 2011

Davis, Timothy A.; Hu, Yifan
ACM Transactions on Mathematical Software, Vol. 38, Issue 1
DOI: 10.1145/2049662.2049663

Matrix Algebra Framework for Portable, Scalable and Efficient Query Engines for RDF Graphs
conference, March 2019

Jamour, Fuad; Abdelaziz, Ibrahim; Chen, Yuanzhao
EuroSys '19: Fourteenth EuroSys Conference 2019, Proceedings of the Fourteenth EuroSys Conference 2019
DOI: 10.1145/3302424.3303962

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication
journal, December 2016

Ballard, Grey; Druinsky, Alex; Knight, Nicholas
ACM Transactions on Parallel Computing, Vol. 3, Issue 3
DOI: 10.1145/3015144

Multilevel hypergraph partitioning: applications in VLSI domain
journal, March 1999

Karypis, G.; Aggarwal, R.; Kumar, V.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 7, Issue 1
DOI: 10.1109/92.748202

Parallel hypergraph partitioning for scientific computing
conference, January 2006

Devine, K. D.; Boman, E. G.; Heaphy, R. T.
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
DOI: 10.1109/IPDPS.2006.1639359

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

Buluç, Aydin; Gilbert, John R.
SIAM Journal on Scientific Computing, Vol. 34, Issue 4
DOI: 10.1137/110848244

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

Gremse, Felix; Küpper, Kerstin; Naumann, Uwe
SIAM Journal on Scientific Computing, Vol. 40, Issue 4
DOI: 10.1137/17M1121378

The parallelism motifs of genomic data analysis
journal, January 2020

Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0394

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
DOI: 10.1109/IPDPSW.2017.8

Scaling betweenness centrality using communication-efficient sparse matrix multiplication
conference, November 2017

Solomonik, Edgar; Besta, Maciej; Vella, Flavio
SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/3126908.3126971

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019

Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful
Parallel Computing, Vol. 90
DOI: 10.1016/j.parco.2019.102545

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Nucleic Acids Research, Vol. 46, Issue 6
DOI: 10.1093/nar/gkx1313

The parallelism motifs of genomic data analysis
journal, January 2020

Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0394

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

Catalyurek, U. V.; Aykanat, C.
IEEE Transactions on Parallel and Distributed Systems, Vol. 10, Issue 7
DOI: 10.1109/71.780863

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
conference, August 2017

Nagasaka, Yusuke; Nukada, Akira; Matsuoka, Satoshi
2017 46th International Conference on Parallel Processing (ICPP)
DOI: 10.1109/icpp.2017.19

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
text, January 2011

Buluc, Aydin; Gilbert, John
arXiv
DOI: 10.48550/arxiv.1109.3739

Works referencing / citing this record:

The parallelism motifs of genomic data analysis
journal, January 2020

Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0394

Similar Records in DOE PAGES and OSTI.GOV collections:

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Journal Article Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ... - Parallel Computing

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm.more »« less
Cited by 16
https://doi.org/10.1016/j.parco.2019.102545

Full Text Available
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Conference Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ...

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore »« less
https://doi.org/10.1145/3229710.3229720

Full Text Available
A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs

Journal Article Azad, Ariful ; Buluç, Aydın - Parallel Computing

We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more »« less
Cited by 4
https://doi.org/10.1016/j.parco.2016.05.007

Full Text Available
Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication

Journal Article Akbudak, Kadir ; Selvitopi, Oguz ; Aykanat, Cevdet - ACM Transactions on Parallel Computing

We investigate outer-product--parallel, inner-product--parallel, and row-by-row-product--parallel formulations of sparse matrix-matrix multiplication (SpGEMM) on distributed memory architectures. For each of these three formulations, we propose a hypergraph model and a bipartite graph model for distributing SpGEMM computations based on one-dimensional (1D) partitioning of input matrices. Here, we also propose a communication hypergraph model for each formulation for distributing communication operations. The computational graph and hypergraph models adopted in the first phase aim at minimizing the total message volume and balancing the computational loads of processors, whereas the communication hypergraph models adopted in the second phase aim at minimizing the total messagemore »« less
https://doi.org/10.1145/3155292

Full Text Available
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

Journal Article Williams, Samuel ; Oliker, Leonid ; Vuduc, Richard ; ... - Parallel Computing

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one ofmore »« less
Full Text Available

Similar Records

Title: Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Abstract

Citation Formats

Sparse Matrix-Matrix Products Executed Through Coloring journal, January 2015

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal, January 2018

Parallel SimRank computation on large graphs with iterative aggregation conference, January 2010

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training conference, February 2020

Sparse Matrices in MATLAB: Design and Implementation journal, January 1992

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication journal, July 1999

Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking conference, July 2020

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication journal, April 2018

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication journal, January 2016

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors journal, December 2019

Parallel Triangle Counting and Enumeration Using Matrix Algebra conference, May 2015

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition journal, September 1978

Sparse matrix multiplication: The distributed block-compressed sparse row library journal, May 2014

The university of Florida sparse matrix collection journal, November 2011

Matrix Algebra Framework for Portable, Scalable and Efficient Query Engines for RDF Graphs conference, March 2019

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication journal, December 2016

Multilevel hypergraph partitioning: applications in VLSI domain journal, March 1999

Parallel hypergraph partitioning for scientific computing conference, January 2006

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments journal, January 2012

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures journal, January 2018

The parallelism motifs of genomic data analysis journal, January 2020

Performance-portable sparse matrix-matrix multiplication for many-core architectures conference, May 2017

Scaling betweenness centrality using communication-efficient sparse matrix multiplication conference, November 2017

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors journal, December 2019

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks journal, January 2018

The parallelism motifs of genomic data analysis journal, January 2020

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication journal, July 1999

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU conference, August 2017

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments text, January 2011

The parallelism motifs of genomic data analysis journal, January 2020

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

Parallel SimRank computation on large graphs with iterative aggregation
conference, January 2010

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
conference, February 2020

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking
conference, July 2020

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019

Parallel Triangle Counting and Enumeration Using Matrix Algebra
conference, May 2015

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014

The university of Florida sparse matrix collection
journal, November 2011

Matrix Algebra Framework for Portable, Scalable and Efficient Query Engines for RDF Graphs
conference, March 2019

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication
journal, December 2016

Multilevel hypergraph partitioning: applications in VLSI domain
journal, March 1999

Parallel hypergraph partitioning for scientific computing
conference, January 2006

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

The parallelism motifs of genomic data analysis
journal, January 2020

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

Scaling betweenness centrality using communication-efficient sparse matrix multiplication
conference, November 2017

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

The parallelism motifs of genomic data analysis
journal, January 2020

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
conference, August 2017

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
text, January 2011

The parallelism motifs of genomic data analysis
journal, January 2020