Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Liu, Weifeng; Li, Ang; Hogg, Jonathan D.; Duff, Iain S.; Vinter, Brian

doi:10.1002/cpe.4244

Title: Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Abstract

Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device,more »« less

Authors:

^[1]; Li, Ang ^[2]; Hogg, Jonathan D. ^[3]; Duff, Iain S. ^[3]; Vinter, Brian ^[4]

Niels Bohr Institute University of Copenhagen Copenhagen Denmark, Scientific Computing Department STFC Rutherford Appleton Laboratory UK, Department of Computer Science Norwegian University of Science and Technology Trondheim Norway
Pacific Northwest National Lab Richland USA
Scientific Computing Department STFC Rutherford Appleton Laboratory UK
Niels Bohr Institute University of Copenhagen Copenhagen Denmark

Publication Date:: Sun Jul 30 00:00:00 EDT 2017

Sponsoring Org.:: USDOE

OSTI Identifier:: 1398070

Grant/Contract Number:: 66150

Resource Type:: Publisher's Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 29 Journal Issue: 21; Journal ID: ISSN 1532-0626

Publisher:: Wiley Blackwell (John Wiley & Sons)

Country of Publication:: United Kingdom

Language:: English

Citation Formats


                    Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides.  United Kingdom: N. p., 2017. 
Web.  doi:10.1002/cpe.4244.

Copy to clipboard


                    Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., & Vinter, Brian. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides.  United Kingdom.  https://doi.org/10.1002/cpe.4244

Copy to clipboard


                    Liu, Weifeng, Li, Ang, Hogg, Jonathan D., Duff, Iain S., and Vinter, Brian. Sun .  
"Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides".  United Kingdom.  https://doi.org/10.1002/cpe.4244.

Copy to clipboard


                    
@article{osti_1398070,

  title        = {Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides},

  author       = {Liu, Weifeng and Li, Ang and Hogg, Jonathan D. and Duff, Iain S. and Vinter, Brian},

  abstractNote = {Summary The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level‐sets or colour‐sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data‐parallelism, we also develop an adaptive scheme for efficiently processing multiple right‐hand sides in SpTRSM. A comparison with a state‐of‐the‐art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.},

  doi          = {10.1002/cpe.4244},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 21,

  volume       = 29,

  place        = {United Kingdom},

  year         = {Sun Jul 30 00:00:00 EDT 2017},

  month        = {Sun Jul 30 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Publisher's Version of Record
https://doi.org/10.1002/cpe.4244

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 25 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
journal, November 2015

Liu, Weifeng; Vinter, Brian
Parallel Computing, Vol. 49
DOI: 10.1016/j.parco.2015.04.004

Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs
conference, December 2016

Picciau, Andrea; Inggs, Gordon E.; Wickerson, John
2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
DOI: 10.1109/HiPC.2016.030

A Fast Dense Triangular Solve in CUDA
journal, January 2013

Hogg, J. D.
SIAM Journal on Scientific Computing, Vol. 35, Issue 3
DOI: 10.1137/12088358X

Adapting Sparse Triangular Solution to GPUs
conference, September 2012

Suchoski, Brad; Severn, Caleb; Shantharam, Manu
2012 41st International Conference on Parallel Processing Workshops (ICPPW)
DOI: 10.1109/ICPPW.2012.23

An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002

Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan
ACM Transactions on Mathematical Software, Vol. 28, Issue 2
DOI: 10.1145/567806.567810

Domain Overlap for Iterative Sparse Triangular Solves on GPUs
book, January 2016

Anzt, Hartwig; Chow, Edmond; Szyld, Daniel B.
Lecture Notes in Computational Science and Engineering
DOI: 10.1007/978-3-319-40528-5_24

Parallel Transposition of Sparse Data Structures
conference, January 2016

Wang, Hao; Liu, Weifeng; Hou, Kaixi
Proceedings of the 2016 International Conference on Supercomputing - ICS '16
DOI: 10.1145/2925426.2926291

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
conference, January 2015

Liu, Weifeng; Vinter, Brian
Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
DOI: 10.1145/2751205.2751209

Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors
journal, January 1990

Saltz, Joel H.
SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 1
DOI: 10.1137/0911008

Numerical Methods for Least Squares Problems
book, January 1996

Björck, Åke
Society for Industrial and Applied Mathematics
DOI: 10.1137/1.9781611971484

A Cross-Platform SpMV Framework on Many-Core Architectures
journal, October 2016

Zhang, Yunquan; Li, Shigang; Yan, Shengen
ACM Transactions on Architecture and Code Optimization, Vol. 13, Issue 4
DOI: 10.1145/2994148

A Fast Tridiagonal Solver for Intel MIC Architecture
conference, May 2016

Wang, Xinliang; Xue, Wei; Zhai, Jidong
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2016.83

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

Chow, Edmond; Patel, Aftab
SIAM Journal on Scientific Computing, Vol. 37, Issue 2
DOI: 10.1137/140968896

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012

Li, Ruipeng; Saad, Yousef
The Journal of Supercomputing, Vol. 63, Issue 2
DOI: 10.1007/s11227-012-0825-3

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
conference, November 2016

Anzt, Hartwig; Chow, Edmond; Huckle, Thomas
2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
DOI: 10.1109/ScalA.2016.011

Callback: efficient synchronization without invalidation with a directory just for spin-waiting
conference, January 2015

Ros, Alberto; Kaxiras, Stefanos
Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
DOI: 10.1145/2749469.2750405

StreamScan: fast scan algorithms for GPUs without global barrier synchronization
conference, January 2013

Yan, Shengen; Long, Guoping; Zhang, Yunquan
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
DOI: 10.1145/2442516.2442539

The university of Florida sparse matrix collection
journal, November 2011

Davis, Timothy A.; Hu, Yifan
ACM Transactions on Mathematical Software, Vol. 38, Issue 1
DOI: 10.1145/2049662.2049663

Sparse triangular solves for ILU revisited: data layout crucial to better performance
journal, December 2010

Smith, Barry; Zhang, Hong
The International Journal of High Performance Computing Applications, Vol. 25, Issue 4
DOI: 10.1177/1094342010389857

Locality-Aware CTA Clustering for Modern GPUs
conference, January 2017

Li, Ang; Song, Shuaiwen Leon; Liu, Weifeng
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17
DOI: 10.1145/3037697.3037709

Iterative Methods for Sparse Linear Systems
book, January 2003

Saad, Yousef
DOI: 10.1137/1.9780898718003

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015

Liu, Weifeng; Vinter, Brian
Journal of Parallel and Distributed Computing, Vol. 85
DOI: 10.1016/j.jpdc.2015.06.010

Parallel algorithms for solving linear systems with sparse triangular matrices
journal, September 2009

Mayer, Jan
Computing, Vol. 86, Issue 4
DOI: 10.1007/s00607-009-0066-3

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
conference, January 2013

Li, Jiajia; Tan, Guangming; Chen, Mingyu
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
DOI: 10.1145/2491956.2462181

STS-k: a multilevel sparse triangular solution scheme for NUMA multicores
conference, January 2015

Kabir, Humayun; Booth, Joshua Dennis; Aupy, Guillaume
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807667

Dymaxion: optimizing memory access patterns for heterogeneous systems
conference, January 2011

Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063401

The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations
journal, June 1996

Duff, I. S.; Reid, J. K.
ACM Transactions on Mathematical Software, Vol. 22, Issue 2
DOI: 10.1145/229473.229476

Fast segmented sort on GPUs
conference, January 2017

Hou, Kaixi; Liu, Weifeng; Wang, Hao
Proceedings of the International Conference on Supercomputing - ICS '17
DOI: 10.1145/3079079.3079105

Structure-adaptive parallel solution of sparse triangular linear systems
journal, October 2014

Totoni, Ehsan; Heath, Michael T.; Kale, Laxmikant V.
Parallel Computing, Vol. 40, Issue 9
DOI: 10.1016/j.parco.2014.06.006

Adaptive and transparent cache bypassing for GPUs
conference, January 2015

Li, Ang; van den Braak, Gert-Jan; Kumar, Akash
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807606

Solving Sparse Triangular Linear Systems on Parallel Computers
journal, May 1989

Anderson, Edward; Saad, Youcef
International Journal of High Speed Computing, Vol. 01, Issue 01
DOI: 10.1142/S0129053389000056

Fine-Grained Synchronizations and Dataflow Programming on GPUs
conference, January 2015

Li, Ang; van den Braak, Gert-Jan; Corporaal, Henk
Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
DOI: 10.1145/2751205.2751232

Scaling synchronization in multicore programs
journal, October 2016

Morrison, Adam
Communications of the ACM, Vol. 59, Issue 11
DOI: 10.1145/2980987

MiSAR: minimalistic synchronization accelerator with resource overflow management
conference, January 2015

Liang, Ching-Kai; Prvulovic, Milos
Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA '15
DOI: 10.1145/2749469.2750396

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
conference, January 2015

Scogland, Thomas R. W.; Feng, Wu-chun
Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15
DOI: 10.1145/2668930.2688048

Iterative Sparse Triangular Solves for Preconditioning
book, January 2015

Anzt, Hartwig; Chow, Edmond; Dongarra, Jack
Lecture Notes in Computer Science
DOI: 10.1007/978-3-662-48096-0_50

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels
conference, January 2017

Li, Ang; Liu, Weifeng; Kristensen, Mads R. B.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
DOI: 10.1145/3126908.3126931

Similar Records in DOE PAGES and OSTI.GOV collections:

Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides

Journal Article Liu, Weifeng ; Li, Ang ; Hogg, Jonathan D. ; ... - Concurrency and Computation: Practice and Experience

The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's many-core platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level-sets or colour-sets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requiresmore »« less
https://doi.org/10.1002/cpe.4244
Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Conference Xie, Chenhao ; Chen, Jieyang ; Firoz, Jesun S. ; ...

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for \textit{Sparse Triangular Solver (SpTRSV)} which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking viamore »« less
https://doi.org/10.1145/3472456.3472478
Parallel beamlet dose calculation via beamlet contexts in a distributed multi‐GPU framework

Journal Article Neph, Ryan ; Ouyang, Cheng ; Neylon, John ; ... - Medical Physics

Purpose Dose calculation is one of the most computationally intensive, yet essential tasks in the treatment planning process. With the recent interest in automatic beam orientation and arc trajectory optimization techniques, there is a great need for more efficient model‐based dose calculation algorithms that can accommodate hundreds to thousands of beam candidates at once. Foundational work has shown the translation of dose calculation algorithms to graphical processing units ( GPU s), lending to remarkable gains in processing efficiency. But these methods provide parallelization of dose for only a single beamlet, serializing the calculation of multiple beamlets and under‐utilizing the potentialmore »« less
Cited by 7
https://doi.org/10.1002/mp.13651
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Journal Article Azad, Ariful ; Buluc, Aydin - Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required.more »« less
Cited by 21
https://doi.org/10.1109/IPDPS.2017.76

Full Text Available

Similar Records

Title: Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Abstract

Citation Formats

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors journal, November 2015

Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs conference, December 2016

A Fast Dense Triangular Solve in CUDA journal, January 2013

Adapting Sparse Triangular Solution to GPUs conference, September 2012

An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum journal, June 2002

Domain Overlap for Iterative Sparse Triangular Solves on GPUs book, January 2016

Parallel Transposition of Sparse Data Structures conference, January 2016

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication conference, January 2015

Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors journal, January 1990

Numerical Methods for Least Squares Problems book, January 1996

A Cross-Platform SpMV Framework on Many-Core Architectures journal, October 2016

A Fast Tridiagonal Solver for Intel MIC Architecture conference, May 2016

Fine-Grained Parallel Incomplete LU Factorization journal, January 2015

GPU-accelerated preconditioned iterative linear solvers journal, October 2012

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs conference, November 2016

Callback: efficient synchronization without invalidation with a directory just for spin-waiting conference, January 2015

StreamScan: fast scan algorithms for GPUs without global barrier synchronization conference, January 2013

The university of Florida sparse matrix collection journal, November 2011

Sparse triangular solves for ILU revisited: data layout crucial to better performance journal, December 2010

Locality-Aware CTA Clustering for Modern GPUs conference, January 2017

Iterative Methods for Sparse Linear Systems book, January 2003

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors journal, November 2015

Parallel algorithms for solving linear systems with sparse triangular matrices journal, September 2009

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication conference, January 2013

STS-k: a multilevel sparse triangular solution scheme for NUMA multicores conference, January 2015

Dymaxion: optimizing memory access patterns for heterogeneous systems conference, January 2011

The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations journal, June 1996

Fast segmented sort on GPUs conference, January 2017

Structure-adaptive parallel solution of sparse triangular linear systems journal, October 2014

Adaptive and transparent cache bypassing for GPUs conference, January 2015

Solving Sparse Triangular Linear Systems on Parallel Computers journal, May 1989

Fine-Grained Synchronizations and Dataflow Programming on GPUs conference, January 2015

Scaling synchronization in multicore programs journal, October 2016

MiSAR: minimalistic synchronization accelerator with resource overflow management conference, January 2015

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures conference, January 2015

Iterative Sparse Triangular Solves for Preconditioning book, January 2015

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels conference, January 2017

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
journal, November 2015

Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs
conference, December 2016

A Fast Dense Triangular Solve in CUDA
journal, January 2013

Adapting Sparse Triangular Solution to GPUs
conference, September 2012

An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum
journal, June 2002

Domain Overlap for Iterative Sparse Triangular Solves on GPUs
book, January 2016

Parallel Transposition of Sparse Data Structures
conference, January 2016

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
conference, January 2015

Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors
journal, January 1990

Numerical Methods for Least Squares Problems
book, January 1996

A Cross-Platform SpMV Framework on Many-Core Architectures
journal, October 2016

A Fast Tridiagonal Solver for Intel MIC Architecture
conference, May 2016

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs
conference, November 2016

Callback: efficient synchronization without invalidation with a directory just for spin-waiting
conference, January 2015

StreamScan: fast scan algorithms for GPUs without global barrier synchronization
conference, January 2013

The university of Florida sparse matrix collection
journal, November 2011

Sparse triangular solves for ILU revisited: data layout crucial to better performance
journal, December 2010

Locality-Aware CTA Clustering for Modern GPUs
conference, January 2017

Iterative Methods for Sparse Linear Systems
book, January 2003

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
journal, November 2015

Parallel algorithms for solving linear systems with sparse triangular matrices
journal, September 2009

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
conference, January 2013

STS-k: a multilevel sparse triangular solution scheme for NUMA multicores
conference, January 2015

Dymaxion: optimizing memory access patterns for heterogeneous systems
conference, January 2011

The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations
journal, June 1996

Fast segmented sort on GPUs
conference, January 2017

Structure-adaptive parallel solution of sparse triangular linear systems
journal, October 2014

Adaptive and transparent cache bypassing for GPUs
conference, January 2015

Solving Sparse Triangular Linear Systems on Parallel Computers
journal, May 1989

Fine-Grained Synchronizations and Dataflow Programming on GPUs
conference, January 2015

Scaling synchronization in multicore programs
journal, October 2016

MiSAR: minimalistic synchronization accelerator with resource overflow management
conference, January 2015

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
conference, January 2015

Iterative Sparse Triangular Solves for Preconditioning
book, January 2015

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels
conference, January 2017