A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Azad, Ariful; Buluc, Aydin

doi:10.1109/IPDPS.2017.76

Title: A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Abstract

We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required. The key insight is to avoid each thread individually scan the list of matrix columns. Our algorithm is simple to implement and operates on existing column-based sparse matrix formats. It performs well on diverse matrices and vectors with heterogeneous sparsity patterns. A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor. In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs.

Authors:

Azad, Ariful ^[1]; Buluc, Aydin ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Mon Jul 03 00:00:00 EDT 2017

Research Org.:: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1525227

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Additional Journal Information:: Journal Name: Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2017; Conference: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL (United States), 29 May - 2 Jun 2017; Journal ID: ISSN 1530-2075

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Azad, Ariful, and Buluc, Aydin. A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm.  United States: N. p., 2017. 
Web.  doi:10.1109/IPDPS.2017.76.

Copy to clipboard


                    Azad, Ariful, & Buluc, Aydin. A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm.  United States.  https://doi.org/10.1109/IPDPS.2017.76

Copy to clipboard


                    Azad, Ariful, and Buluc, Aydin. Mon .  
"A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm".  United States.  https://doi.org/10.1109/IPDPS.2017.76.  https://www.osti.gov/servlets/purl/1525227.

Copy to clipboard


                    
@article{osti_1525227,

  title        = {A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm},

  author       = {Azad, Ariful and Buluc, Aydin},

  abstractNote = {We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required. The key insight is to avoid each thread individually scan the list of matrix columns. Our algorithm is simple to implement and operates on existing column-based sparse matrix formats. It performs well on diverse matrices and vectors with heterogeneous sparsity patterns. A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor. In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs.},

  doi          = {10.1109/IPDPS.2017.76},

  journal      = {Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)},

  number       = ,

  volume       = 2017,

  place        = {United States},

  year         = {Mon Jul 03 00:00:00 EDT 2017},

  month        = {Mon Jul 03 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/IPDPS.2017.76

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 21 works

Citation information provided by
Web of Science

Figures / Tables:

TABLE I: Classification of parallel SpMSpV algorithms. t denotes the number of threads. SpMSpV-bucket is presented in this paper.

All figures and tables (10 total)

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Evaluation Criteria for Sparse Matrix Storage Formats
journal, February 2016

Langr, Daniel; Tvrdik, Pavel
IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 2
DOI: 10.1109/TPDS.2015.2401575

GraphMat: high performance graph analytics made productive
journal, July 2015

Sundaram, Narayanan; Satish, Nadathur; Patwary, Md Mostofa Ali
Proceedings of the VLDB Endowment, Vol. 8, Issue 11
DOI: 10.14778/2809974.2809983

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU
conference, May 2015

Yang, Carl; Wang, Yangzihao; Owens, John D.
2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)
DOI: 10.1109/IPDPSW.2015.77

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Gilbert, John R.; Moler, Cleve; Schreiber, Robert
SIAM Journal on Matrix Analysis and Applications, Vol. 13, Issue 1
DOI: 10.1137/0613024

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics
conference, June 2016

Buono, Daniele; Petrini, Fabrizio; Checconi, Fabio
ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputing
DOI: 10.1145/2925426.2926278

Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs
conference, May 2016

Azad, Ariful; Buluc, Aydin
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2016.103

Graph programming interface (GPI): a linear algebra programming model for large scale graph computations
conference, May 2016

Ekanadham, K.; Horn, W. P.; Kumar, Manoj
CF'16: Computing Frontiers Conference, Proceedings of the ACM International Conference on Computing Frontiers
DOI: 10.1145/2903150.2903164

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning
journal, January 2013

Spielman, Daniel A.; Teng, Shang-Hua
SIAM Journal on Computing, Vol. 42, Issue 1
DOI: 10.1137/080744888

Mathematical foundations of the GraphBLAS
conference, September 2016

Kepner, Jeremy; Aaltonen, Peter; Bader, David
2016 IEEE High-Performance Extreme Computing Conference (HPEC), 2016 IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/HPEC.2016.7761646

Parallel graph analytics
journal, April 2016

Lenharth, Andrew; Nguyen, Donald; Pingali, Keshav
Communications of the ACM, Vol. 59, Issue 5
DOI: 10.1145/2901919

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Gustavson, Fred G.
ACM Transactions on Mathematical Software, Vol. 4, Issue 3
DOI: 10.1145/355791.355796

Local Graph Partitioning using PageRank Vectors
conference, October 2006

Andersen, Reid; Chung, Fan; Lang, Kevin
2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
DOI: 10.1109/FOCS.2006.44

The university of Florida sparse matrix collection
journal, November 2011

Davis, Timothy A.; Hu, Yifan
ACM Transactions on Mathematical Software, Vol. 38, Issue 1
DOI: 10.1145/2049662.2049663

Figures / Tables found in this record:

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.

Similar Records in DOE PAGES and OSTI.GOV collections:

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Journal Article Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ... - Parallel Computing

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. In this work, we firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm.more »« less
Cited by 16
https://doi.org/10.1016/j.parco.2019.102545

Full Text Available
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Technical Report Kim, Kyungjoo ; Rajamanickam, Sivasankaran ; Stelle, George Widgery ; ...

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore »« less
https://doi.org/10.2172/1237520

Full Text Available
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Conference Nagasaka, Yusuke ; Matsuoka, Satoshi ; Azad, Ariful ; ...

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is vastly used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. While many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore »« less
https://doi.org/10.1145/3229710.3229720

Full Text Available
Approximate Weighted Matching On Emerging Manycore and Multithreaded Architectures

Journal Article Halappanavar, Mahantesh ; Feo, John T ; Villa, Oreste ; ... - International Journal of High Performance Computing Applications, 26 (4 ):413-430

Graph matching is a prototypical combinatorial problem with many applications in computer science and scientific computing, but algorithms for computing optimal matchings are challenging to parallelize. Approximate matching algorithms provide an alternate route for parallelization, and in many contexts compute near-optimal matchings for large-scale graphs. We present sharedmemory parallel implementations for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi) and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first implementation uses shared work queues, and is suited to all these platforms; the second implementation is based on dataflowmore »« less
https://doi.org/10.1177/1094342012452893
Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

Journal Article Booth, Joshua Dennis ; Ellingwood, Nathan David ; Thornquist, Heidi K. ; ... - Parallel Computing

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure thatmore »« less
Cited by 9
https://doi.org/10.1016/j.parco.2017.06.003

Full Text Available

Similar Records

Title: A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Abstract

Citation Formats

Figures / Tables:

Evaluation Criteria for Sparse Matrix Storage Formats journal, February 2016

GraphMat: high performance graph analytics made productive journal, July 2015

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU conference, May 2015

Sparse Matrices in MATLAB: Design and Implementation journal, January 1992

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics conference, June 2016

Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs conference, May 2016

Graph programming interface (GPI): a linear algebra programming model for large scale graph computations conference, May 2016

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning journal, January 2013

Mathematical foundations of the GraphBLAS conference, September 2016

Parallel graph analytics journal, April 2016

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition journal, September 1978

Local Graph Partitioning using PageRank Vectors conference, October 2006

The university of Florida sparse matrix collection journal, November 2011

Evaluation Criteria for Sparse Matrix Storage Formats
journal, February 2016

GraphMat: high performance graph analytics made productive
journal, July 2015

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU
conference, May 2015

Sparse Matrices in MATLAB: Design and Implementation
journal, January 1992

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics
conference, June 2016

Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs
conference, May 2016

Graph programming interface (GPI): a linear algebra programming model for large scale graph computations
conference, May 2016

A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning
journal, January 2013

Mathematical foundations of the GraphBLAS
conference, September 2016

Parallel graph analytics
journal, April 2016

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition
journal, September 1978

Local Graph Partitioning using PageRank Vectors
conference, October 2006

The university of Florida sparse matrix collection
journal, November 2011