Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.4244· OSTI ID:1398070
 [1];  [2];  [3];  [3];  [4]
  1. Niels Bohr Institute University of Copenhagen Copenhagen Denmark, Scientific Computing Department STFC Rutherford Appleton Laboratory UK, Department of Computer Science Norwegian University of Science and Technology Trondheim Norway
  2. Pacific Northwest National Lab Richland USA
  3. Scientific Computing Department STFC Rutherford Appleton Laboratory UK
  4. Niels Bohr Institute University of Copenhagen Copenhagen Denmark
Summary

The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.

Sponsoring Organization:
USDOE
OSTI ID:
1398070
Alternate ID(s):
OSTI ID: 1557091
Journal Information:
Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Journal Issue: 21 Vol. 29; ISSN 1532-0626
Publisher:
Wiley Blackwell (John Wiley & Sons)Copyright Statement
Country of Publication:
United Kingdom
Language:
English

References (39)

Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver book January 2014
Domain Overlap for Iterative Sparse Triangular Solves on GPUs book January 2016
Factors Impacting Performance of Multithreaded Sparse Triangular Solve book January 2011
Iterative Sparse Triangular Solves for Preconditioning book January 2015
Parallel algorithms for solving linear systems with sparse triangular matrices journal September 2009
GPU-accelerated preconditioned iterative linear solvers journal October 2012
A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors journal November 2015
Structure-adaptive parallel solution of sparse triangular linear systems journal October 2014
Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors journal November 2015
Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs conference December 2016
Adapting Sparse Triangular Solution to GPUs conference September 2012
A Fast Tridiagonal Solver for Intel MIC Architecture conference May 2016
Batched Generation of Incomplete Sparse Approximate Inverses on GPUs conference November 2016
Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors journal January 1990
Iterative Methods for Sparse Linear Systems book January 2003
Numerical Methods for Least Squares Problems book January 1996
A Fast Dense Triangular Solve in CUDA journal January 2013
Fine-Grained Parallel Incomplete LU Factorization journal January 2015
Solving Sparse Triangular Linear Systems on Parallel Computers journal May 1989
The university of Florida sparse matrix collection journal November 2011
Dymaxion: optimizing memory access patterns for heterogeneous systems
  • Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063401
conference January 2011
The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations journal June 1996
StreamScan: fast scan algorithms for GPUs without global barrier synchronization conference January 2013
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication conference January 2013
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures conference January 2015
MiSAR: minimalistic synchronization accelerator with resource overflow management conference January 2015
Callback: efficient synchronization without invalidation with a directory just for spin-waiting conference January 2015
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication conference January 2015
Fine-Grained Synchronizations and Dataflow Programming on GPUs conference January 2015
Adaptive and transparent cache bypassing for GPUs
  • Li, Ang; van den Braak, Gert-Jan; Kumar, Akash
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807606
conference January 2015
STS-k: a multilevel sparse triangular solution scheme for NUMA multicores
  • Kabir, Humayun; Booth, Joshua Dennis; Aupy, Guillaume
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807667
conference January 2015
Parallel Transposition of Sparse Data Structures conference January 2016
Scaling synchronization in multicore programs journal October 2016
A Cross-Platform SpMV Framework on Many-Core Architectures journal October 2016
Locality-Aware CTA Clustering for Modern GPUs
  • Li, Ang; Song, Shuaiwen Leon; Liu, Weifeng
  • Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17 https://doi.org/10.1145/3037697.3037709
conference January 2017
Fast segmented sort on GPUs conference January 2017
Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels
  • Li, Ang; Liu, Weifeng; Kristensen, Mads R. B.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126931
conference January 2017
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum journal June 2002
Sparse triangular solves for ILU revisited: data layout crucial to better performance journal December 2010

Similar Records

Related Subjects