Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides
Abstract
Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device,more »
- Authors:
-
- Niels Bohr Institute University of Copenhagen Copenhagen Denmark, Scientific Computing Department STFC Rutherford Appleton Laboratory UK, Department of Computer Science Norwegian University of Science and Technology Trondheim Norway
- Pacific Northwest National Lab Richland USA
- Scientific Computing Department STFC Rutherford Appleton Laboratory UK
- Niels Bohr Institute University of Copenhagen Copenhagen Denmark
- Publication Date:
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1398070
- Grant/Contract Number:
- 66150
- Resource Type:
- Publisher's Accepted Manuscript
- Journal Name:
- Concurrency and Computation. Practice and Experience
- Additional Journal Information:
- Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 29 Journal Issue: 21; Journal ID: ISSN 1532-0626
- Publisher:
- Wiley Blackwell (John Wiley & Sons)
- Country of Publication:
- United Kingdom
- Language:
- English
Citation Formats
Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides. United Kingdom: N. p., 2017.
Web. doi:10.1002/cpe.4244.
Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., & Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides. United Kingdom. https://doi.org/10.1002/cpe.4244
Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Sun .
"Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides". United Kingdom. https://doi.org/10.1002/cpe.4244.
@article{osti_1398070,
title = {Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides},
author = {Liu, Weifeng and Li, Ang and Hogg, Jonathan D. and Duff, Iain S. and Vinter, Brian},
abstractNote = {Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.},
doi = {10.1002/cpe.4244},
journal = {Concurrency and Computation. Practice and Experience},
number = 21,
volume = 29,
place = {United Kingdom},
year = {Sun Jul 30 00:00:00 EDT 2017},
month = {Sun Jul 30 00:00:00 EDT 2017}
}
https://doi.org/10.1002/cpe.4244
Web of Science
Works referenced in this record:
Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
journal, November 2015
- Liu, Weifeng; Vinter, Brian
- Parallel Computing, Vol. 49
Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs
conference, December 2016
- Picciau, Andrea; Inggs, Gordon E.; Wickerson, John
- 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
A Fast Dense Triangular Solve in CUDA
journal, January 2013
- Hogg, J. D.
- SIAM Journal on Scientific Computing, Vol. 35, Issue 3
Adapting Sparse Triangular Solution to GPUs
conference, September 2012
- Suchoski, Brad; Severn, Caleb; Shantharam, Manu
- 2012 41st International Conference on Parallel Processing Workshops (ICPPW)
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002
- Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan
- ACM Transactions on Mathematical Software, Vol. 28, Issue 2
Domain Overlap for Iterative Sparse Triangular Solves on GPUs
book, January 2016
- Anzt, Hartwig; Chow, Edmond; Szyld, Daniel B.
- Lecture Notes in Computational Science and Engineering
Parallel Transposition of Sparse Data Structures
conference, January 2016
- Wang, Hao; Liu, Weifeng; Hou, Kaixi
- Proceedings of the 2016 International Conference on Supercomputing - ICS '16
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
conference, January 2015
- Liu, Weifeng; Vinter, Brian
- Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors
journal, January 1990
- Saltz, Joel H.
- SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 1
Numerical Methods for Least Squares Problems
book, January 1996
- Björck, Åke
- Society for Industrial and Applied Mathematics
A Cross-Platform SpMV Framework on Many-Core Architectures
journal, October 2016
- Zhang, Yunquan; Li, Shigang; Yan, Shengen
- ACM Transactions on Architecture and Code Optimization, Vol. 13, Issue 4
A Fast Tridiagonal Solver for Intel MIC Architecture
conference, May 2016
- Wang, Xinliang; Xue, Wei; Zhai, Jidong
- 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015
- Chow, Edmond; Patel, Aftab
- SIAM Journal on Scientific Computing, Vol. 37, Issue 2
GPU-accelerated preconditioned iterative linear solvers
journal, October 2012
- Li, Ruipeng; Saad, Yousef
- The Journal of Supercomputing, Vol. 63, Issue 2
Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
conference, November 2016
- Anzt, Hartwig; Chow, Edmond; Huckle, Thomas
- 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
Callback: efficient synchronization without invalidation with a directory just for spin-waiting
conference, January 2015
- Ros, Alberto; Kaxiras, Stefanos
- Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
conference, January 2013
- Yan, Shengen; Long, Guoping; Zhang, Yunquan
- Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
The university of Florida sparse matrix collection
journal, November 2011
- Davis, Timothy A.; Hu, Yifan
- ACM Transactions on Mathematical Software, Vol. 38, Issue 1
Sparse triangular solves for ILU revisited: data layout crucial to better performance
journal, December 2010
- Smith, Barry; Zhang, Hong
- The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
Locality-Aware CTA Clustering for Modern GPUs
conference, January 2017
- Li, Ang; Song, Shuaiwen Leon; Liu, Weifeng
- Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17
A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015
- Liu, Weifeng; Vinter, Brian
- Journal of Parallel and Distributed Computing, Vol. 85
Parallel algorithms for solving linear systems with sparse triangular matrices
journal, September 2009
- Mayer, Jan
- Computing, Vol. 86, Issue 4
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
conference, January 2013
- Li, Jiajia; Tan, Guangming; Chen, Mingyu
- Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
STS-k: a multilevel sparse triangular solution scheme for NUMA multicores
conference, January 2015
- Kabir, Humayun; Booth, Joshua Dennis; Aupy, Guillaume
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Dymaxion: optimizing memory access patterns for heterogeneous systems
conference, January 2011
- Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin
- Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations
journal, June 1996
- Duff, I. S.; Reid, J. K.
- ACM Transactions on Mathematical Software, Vol. 22, Issue 2
Fast segmented sort on GPUs
conference, January 2017
- Hou, Kaixi; Liu, Weifeng; Wang, Hao
- Proceedings of the International Conference on Supercomputing - ICS '17
Structure-adaptive parallel solution of sparse triangular linear systems
journal, October 2014
- Totoni, Ehsan; Heath, Michael T.; Kale, Laxmikant V.
- Parallel Computing, Vol. 40, Issue 9
Adaptive and transparent cache bypassing for GPUs
conference, January 2015
- Li, Ang; van den Braak, Gert-Jan; Kumar, Akash
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Solving Sparse Triangular Linear Systems on Parallel Computers
journal, May 1989
- Anderson, Edward; Saad, Youcef
- International Journal of High Speed Computing, Vol. 01, Issue 01
Fine-Grained Synchronizations and Dataflow Programming on GPUs
conference, January 2015
- Li, Ang; van den Braak, Gert-Jan; Corporaal, Henk
- Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
Scaling synchronization in multicore programs
journal, October 2016
- Morrison, Adam
- Communications of the ACM, Vol. 59, Issue 11
MiSAR: minimalistic synchronization accelerator with resource overflow management
conference, January 2015
- Liang, Ching-Kai; Prvulovic, Milos
- Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
conference, January 2015
- Scogland, Thomas R. W.; Feng, Wu-chun
- Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15
Iterative Sparse Triangular Solves for Preconditioning
book, January 2015
- Anzt, Hartwig; Chow, Edmond; Dongarra, Jack
- Lecture Notes in Computer Science
Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels
conference, January 2017
- Li, Ang; Liu, Weifeng; Kristensen, Mads R. B.
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17