DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Abstract

Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device,more » shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.« less

Authors:
ORCiD logo [1];  [2];  [3];  [3];  [4]
  1. Niels Bohr Institute University of Copenhagen Copenhagen Denmark, Scientific Computing Department STFC Rutherford Appleton Laboratory UK, Department of Computer Science Norwegian University of Science and Technology Trondheim Norway
  2. Pacific Northwest National Lab Richland USA
  3. Scientific Computing Department STFC Rutherford Appleton Laboratory UK
  4. Niels Bohr Institute University of Copenhagen Copenhagen Denmark
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1398070
Grant/Contract Number:  
66150
Resource Type:
Publisher's Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 29 Journal Issue: 21; Journal ID: ISSN 1532-0626
Publisher:
Wiley Blackwell (John Wiley & Sons)
Country of Publication:
United Kingdom
Language:
English

Citation Formats

Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides. United Kingdom: N. p., 2017. Web. doi:10.1002/cpe.4244.
Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., & Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides. United Kingdom. https://doi.org/10.1002/cpe.4244
Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Sun . "Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides". United Kingdom. https://doi.org/10.1002/cpe.4244.
@article{osti_1398070,
title = {Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides},
author = {Liu, Weifeng and Li, Ang and Hogg, Jonathan D. and Duff, Iain S. and Vinter, Brian},
abstractNote = {Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.},
doi = {10.1002/cpe.4244},
journal = {Concurrency and Computation. Practice and Experience},
number = 21,
volume = 29,
place = {United Kingdom},
year = {Sun Jul 30 00:00:00 EDT 2017},
month = {Sun Jul 30 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.1002/cpe.4244

Citation Metrics:
Cited by: 25 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
journal, November 2015


Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs
conference, December 2016

  • Picciau, Andrea; Inggs, Gordon E.; Wickerson, John
  • 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2016.030

A Fast Dense Triangular Solve in CUDA
journal, January 2013

  • Hogg, J. D.
  • SIAM Journal on Scientific Computing, Vol. 35, Issue 3
  • DOI: 10.1137/12088358X

Adapting Sparse Triangular Solution to GPUs
conference, September 2012

  • Suchoski, Brad; Severn, Caleb; Shantharam, Manu
  • 2012 41st International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2012.23

An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002

  • Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan
  • ACM Transactions on Mathematical Software, Vol. 28, Issue 2
  • DOI: 10.1145/567806.567810

Domain Overlap for Iterative Sparse Triangular Solves on GPUs
book, January 2016


Parallel Transposition of Sparse Data Structures
conference, January 2016

  • Wang, Hao; Liu, Weifeng; Hou, Kaixi
  • Proceedings of the 2016 International Conference on Supercomputing - ICS '16
  • DOI: 10.1145/2925426.2926291

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
conference, January 2015

  • Liu, Weifeng; Vinter, Brian
  • Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
  • DOI: 10.1145/2751205.2751209

Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors
journal, January 1990

  • Saltz, Joel H.
  • SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 1
  • DOI: 10.1137/0911008

Numerical Methods for Least Squares Problems
book, January 1996


A Cross-Platform SpMV Framework on Many-Core Architectures
journal, October 2016

  • Zhang, Yunquan; Li, Shigang; Yan, Shengen
  • ACM Transactions on Architecture and Code Optimization, Vol. 13, Issue 4
  • DOI: 10.1145/2994148

A Fast Tridiagonal Solver for Intel MIC Architecture
conference, May 2016

  • Wang, Xinliang; Xue, Wei; Zhai, Jidong
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.83

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

  • Chow, Edmond; Patel, Aftab
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 2
  • DOI: 10.1137/140968896

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012


Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
conference, November 2016

  • Anzt, Hartwig; Chow, Edmond; Huckle, Thomas
  • 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
  • DOI: 10.1109/ScalA.2016.011

Callback: efficient synchronization without invalidation with a directory just for spin-waiting
conference, January 2015

  • Ros, Alberto; Kaxiras, Stefanos
  • Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
  • DOI: 10.1145/2749469.2750405

StreamScan: fast scan algorithms for GPUs without global barrier synchronization
conference, January 2013

  • Yan, Shengen; Long, Guoping; Zhang, Yunquan
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
  • DOI: 10.1145/2442516.2442539

The university of Florida sparse matrix collection
journal, November 2011

  • Davis, Timothy A.; Hu, Yifan
  • ACM Transactions on Mathematical Software, Vol. 38, Issue 1
  • DOI: 10.1145/2049662.2049663

Sparse triangular solves for ILU revisited: data layout crucial to better performance
journal, December 2010

  • Smith, Barry; Zhang, Hong
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
  • DOI: 10.1177/1094342010389857

Locality-Aware CTA Clustering for Modern GPUs
conference, January 2017

  • Li, Ang; Song, Shuaiwen Leon; Liu, Weifeng
  • Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17
  • DOI: 10.1145/3037697.3037709

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015


Parallel algorithms for solving linear systems with sparse triangular matrices
journal, September 2009


SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
conference, January 2013

  • Li, Jiajia; Tan, Guangming; Chen, Mingyu
  • Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
  • DOI: 10.1145/2491956.2462181

STS-k: a multilevel sparse triangular solution scheme for NUMA multicores
conference, January 2015

  • Kabir, Humayun; Booth, Joshua Dennis; Aupy, Guillaume
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807667

Dymaxion: optimizing memory access patterns for heterogeneous systems
conference, January 2011

  • Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063401

The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations
journal, June 1996

  • Duff, I. S.; Reid, J. K.
  • ACM Transactions on Mathematical Software, Vol. 22, Issue 2
  • DOI: 10.1145/229473.229476

Fast segmented sort on GPUs
conference, January 2017

  • Hou, Kaixi; Liu, Weifeng; Wang, Hao
  • Proceedings of the International Conference on Supercomputing - ICS '17
  • DOI: 10.1145/3079079.3079105

Structure-adaptive parallel solution of sparse triangular linear systems
journal, October 2014


Adaptive and transparent cache bypassing for GPUs
conference, January 2015

  • Li, Ang; van den Braak, Gert-Jan; Kumar, Akash
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807606

Solving Sparse Triangular Linear Systems on Parallel Computers
journal, May 1989

  • Anderson, Edward; Saad, Youcef
  • International Journal of High Speed Computing, Vol. 01, Issue 01
  • DOI: 10.1142/S0129053389000056

Fine-Grained Synchronizations and Dataflow Programming on GPUs
conference, January 2015

  • Li, Ang; van den Braak, Gert-Jan; Corporaal, Henk
  • Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
  • DOI: 10.1145/2751205.2751232

Scaling synchronization in multicore programs
journal, October 2016

  • Morrison, Adam
  • Communications of the ACM, Vol. 59, Issue 11
  • DOI: 10.1145/2980987

MiSAR: minimalistic synchronization accelerator with resource overflow management
conference, January 2015

  • Liang, Ching-Kai; Prvulovic, Milos
  • Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
  • DOI: 10.1145/2749469.2750396

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
conference, January 2015

  • Scogland, Thomas R. W.; Feng, Wu-chun
  • Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15
  • DOI: 10.1145/2668930.2688048

Iterative Sparse Triangular Solves for Preconditioning
book, January 2015


Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels
conference, January 2017

  • Li, Ang; Liu, Weifeng; Kristensen, Mads R. B.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126931