DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. Furthermore, we address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.

Authors:
 [1];  [2];  [2];  [1]
  1. Indiana Univ., Bloomington, IN (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1817306
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Additional Journal Information:
Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2021; Conference: 2021 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), Portland, OR (United States), 17-21 May 2021; Journal ID: ISSN 1530-2075
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; proteins; three-dimensional displays; social networking; scientific computing; memory management; genomics; parallel processing; graph theory; mathematics computing; matrix algebra; matrix multiplication; multiprocessing systems; parallel machines; resource allocations; sparse matrices

Citation Formats

Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, and Azad, Ariful. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale. United States: N. p., 2021. Web. doi:10.1109/ipdps49936.2021.00018.
Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, & Azad, Ariful. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale. United States. https://doi.org/10.1109/ipdps49936.2021.00018
Hussain, Md Taufique, Selvitopi, Oguz, Buluc, Aydin, and Azad, Ariful. Mon . "Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale". United States. https://doi.org/10.1109/ipdps49936.2021.00018. https://www.osti.gov/servlets/purl/1817306.
@article{osti_1817306,
title = {Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale},
author = {Hussain, Md Taufique and Selvitopi, Oguz and Buluc, Aydin and Azad, Ariful},
abstractNote = {Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. Furthermore, we address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.},
doi = {10.1109/ipdps49936.2021.00018},
journal = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = 2021,
place = {United States},
year = {Mon May 17 00:00:00 EDT 2021},
month = {Mon May 17 00:00:00 EDT 2021}
}

Works referenced in this record:

Sparse Matrix-Matrix Products Executed Through Coloring
journal, January 2015

  • McCourt, Michael; Smith, Barry; Zhang, Hong
  • SIAM Journal on Matrix Analysis and Applications, Vol. 36, Issue 1
  • DOI: 10.1137/13093426X

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

Parallel SimRank computation on large graphs with iterative aggregation
conference, January 2010

  • He, Guoming; Feng, Haijun; Li, Cuiping
  • Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10
  • DOI: 10.1145/1835804.1835874

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
conference, February 2020

  • Qin, Eric; Samajdar, Ananda; Kwon, Hyoukjun
  • 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA47549.2020.00015

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

  • Gilbert, John R.; Moler, Cleve; Schreiber, Robert
  • SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
  • DOI: 10.1137/0613024

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

  • Catalyurek, U. V.; Aykanat, C.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 10, Issue 7
  • DOI: 10.1109/71.780863

Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking
conference, July 2020

  • Gu, Zhixiang; Moreira, Jose; Edelsohn, David
  • SPAA '20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures
  • DOI: 10.1145/3350755.3400216

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
journal, April 2018

  • Akbudak, Kadir; Selvitopi, Oguz; Aykanat, Cevdet
  • ACM Transactions on Parallel Computing, Vol. 4, Issue 3
  • DOI: 10.1145/3155292

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
journal, January 2016

  • Azad, Ariful; Ballard, Grey; Buluç, Aydin
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 6
  • DOI: 10.1137/15M104253X

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019


Parallel Triangle Counting and Enumeration Using Matrix Algebra
conference, May 2015

  • Azad, Ariful; Buluc, Aydin; Gilbert, John
  • 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)
  • DOI: 10.1109/IPDPSW.2015.75

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

  • Gustavson, Fred G.
  • ACM Transactions on Mathematical Software, Vol. 4, Issue 3
  • DOI: 10.1145/355791.355796

Sparse matrix multiplication: The distributed block-compressed sparse row library
journal, May 2014


The university of Florida sparse matrix collection
journal, November 2011

  • Davis, Timothy A.; Hu, Yifan
  • ACM Transactions on Mathematical Software, Vol. 38, Issue 1
  • DOI: 10.1145/2049662.2049663

Matrix Algebra Framework for Portable, Scalable and Efficient Query Engines for RDF Graphs
conference, March 2019

  • Jamour, Fuad; Abdelaziz, Ibrahim; Chen, Yuanzhao
  • EuroSys '19: Fourteenth EuroSys Conference 2019, Proceedings of the Fourteenth EuroSys Conference 2019
  • DOI: 10.1145/3302424.3303962

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication
journal, December 2016

  • Ballard, Grey; Druinsky, Alex; Knight, Nicholas
  • ACM Transactions on Parallel Computing, Vol. 3, Issue 3
  • DOI: 10.1145/3015144

Multilevel hypergraph partitioning: applications in VLSI domain
journal, March 1999

  • Karypis, G.; Aggarwal, R.; Kumar, V.
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 7, Issue 1
  • DOI: 10.1109/92.748202

Parallel hypergraph partitioning for scientific computing
conference, January 2006

  • Devine, K. D.; Boman, E. G.; Heaphy, R. T.
  • Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2006.1639359

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
journal, January 2012

  • Buluç, Aydin; Gilbert, John R.
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 4
  • DOI: 10.1137/110848244

Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures
journal, January 2018

  • Gremse, Felix; Küpper, Kerstin; Naumann, Uwe
  • SIAM Journal on Scientific Computing, Vol. 40, Issue 4
  • DOI: 10.1137/17M1121378

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394

Performance-portable sparse matrix-matrix multiplication for many-core architectures
conference, May 2017

  • Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
  • 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.8

Scaling betweenness centrality using communication-efficient sparse matrix multiplication
conference, November 2017

  • Solomonik, Edgar; Besta, Maciej; Vella, Flavio
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/3126908.3126971

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
journal, December 2019


HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks
journal, January 2018

  • Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
  • Nucleic Acids Research, Vol. 46, Issue 6
  • DOI: 10.1093/nar/gkx1313

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication
journal, July 1999

  • Catalyurek, U. V.; Aykanat, C.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 10, Issue 7
  • DOI: 10.1109/71.780863

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
conference, August 2017

  • Nagasaka, Yusuke; Nukada, Akira; Matsuoka, Satoshi
  • 2017 46th International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/icpp.2017.19

Works referencing / citing this record:

The parallelism motifs of genomic data analysis
journal, January 2020

  • Yelick, Katherine; Buluç, Aydın; Awan, Muaaz
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0394