Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

Booth, Joshua Dennis; Ellingwood, Nathan David; Thornquist, Heidi K.; Rajamanickam, Sivasankaran

doi:10.1016/j.parco.2017.06.003

Title: Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

Abstract

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. Here, we present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32more »« less

Authors:

Booth, Joshua Dennis ^[1]; Ellingwood, Nathan David ^[2]; Thornquist, Heidi K. ^[2]; Rajamanickam, Sivasankaran ^[2]

Bucknell Univ., Lewisburg, PA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Publication Date:: Sat Jun 03 00:00:00 EDT 2017

Research Org.:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1499033

Alternate Identifier(s):: OSTI ID: 1550153

Report Number(s):: SAND-2019-2046J
Journal ID: ISSN 0167-8191; 672871

Grant/Contract Number:: AC04-94AL85000; NA-0003525

Resource Type:: Accepted Manuscript

Journal Name:: Parallel Computing

Additional Journal Information:: Journal Volume: 68; Journal Issue: C; Journal ID: ISSN 0167-8191

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Parallel LU factorization; Multithreaded solvers; Circuit simulation; Solvers on Intel Xeon Phi

Citation Formats


                    Booth, Joshua Dennis, Ellingwood, Nathan David, Thornquist, Heidi K., and Rajamanickam, Sivasankaran. Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts.  United States: N. p., 2017. 
Web.  doi:10.1016/j.parco.2017.06.003.

Copy to clipboard


                    Booth, Joshua Dennis, Ellingwood, Nathan David, Thornquist, Heidi K., & Rajamanickam, Sivasankaran. Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts.  United States.  https://doi.org/10.1016/j.parco.2017.06.003

Copy to clipboard


                    Booth, Joshua Dennis, Ellingwood, Nathan David, Thornquist, Heidi K., and Rajamanickam, Sivasankaran. Sat .  
"Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts".  United States.  https://doi.org/10.1016/j.parco.2017.06.003.  https://www.osti.gov/servlets/purl/1499033.

Copy to clipboard


                    
@article{osti_1499033,

  title        = {Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts},

  author       = {Booth, Joshua Dennis and Ellingwood, Nathan David and Thornquist, Heidi K. and Rajamanickam, Sivasankaran},

  abstractNote = {Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. Here, we present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.},

  doi          = {10.1016/j.parco.2017.06.003},

  journal      = {Parallel Computing},

  number       = C,

  volume       = 68,

  place        = {United States},

  year         = {Sat Jun 03 00:00:00 EDT 2017},

  month        = {Sat Jun 03 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.parco.2017.06.003

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 9 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

A survey of direct methods for sparse linear systems
journal, May 2016

Davis, Timothy A.; Rajamanickam, Sivasankaran; Sid-Lakhdar, Wissam M.
Acta Numerica, Vol. 25
DOI: 10.1017/S0962492916000076

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003

Li, Xiaoye S.; Demmel, James W.
ACM Transactions on Mathematical Software, Vol. 29, Issue 2
DOI: 10.1145/779359.779361

PARDISO: a high-performance serial and parallel sparse linear solver in semiconductor device simulation
journal, September 2001

Schenk, Olaf; Gärtner, Klaus; Fichtner, Wolfgang
Future Generation Computer Systems, Vol. 18, Issue 1
DOI: 10.1016/S0167-739X(00)00076-5

PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems
journal, February 2002

Hénon, P.; Ramet, P.; Roman, J.
Parallel Computing, Vol. 28, Issue 2
DOI: 10.1016/S0167-8191(01)00141-7

A Supernodal Approach to Sparse Partial Pivoting
journal, January 1999

Demmel, James W.; Eisenstat, Stanley C.; Gilbert, John R.
SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 3
DOI: 10.1137/S0895479895291765

An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination
journal, January 1999

Demmel, James W.; Gilbert, John R.; Li, Xiaoye S.
SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 4
DOI: 10.1137/S0895479897317685

Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems
journal, September 2010

Davis, Timothy A.; Palamadai Natarajan, Ekanathan
ACM Transactions on Mathematical Software, Vol. 37, Issue 3
DOI: 10.1145/1824801.1824814

Sparse Partial Pivoting in Time Proportional to Arithmetic Operations
journal, September 1988

Gilbert, John R.; Peierls, Tim
SIAM Journal on Scientific and Statistical Computing, Vol. 9, Issue 5
DOI: 10.1137/0909058

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
Journal of Parallel and Distributed Computing, Vol. 74, Issue 12
DOI: 10.1016/j.jpdc.2014.07.003

An Approximate Minimum Degree Ordering Algorithm
journal, October 1996

Amestoy, Patrick R.; Davis, Timothy A.; Duff, Iain S.
SIAM Journal on Matrix Analysis and Applications, Vol. 17, Issue 4
DOI: 10.1137/S0895479894278952

On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix
journal, January 2001

Duff, I. S.; Koster, J.
SIAM Journal on Matrix Analysis and Applications, Vol. 22, Issue 4
DOI: 10.1137/S0895479899358443

Computing the block triangular form of a sparse matrix
journal, December 1990

Pothen, Alex; Fan, Chin-Ju
ACM Transactions on Mathematical Software (TOMS), Vol. 16, Issue 4
DOI: 10.1145/98267.98287

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
report, January 2016

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery
DOI: 10.2172/1237520

The Role of Elimination Trees in Sparse Factorization
journal, January 1990

Liu, Joseph W. H.
SIAM Journal on Matrix Analysis and Applications, Vol. 11, Issue 1
DOI: 10.1137/0611010

The Theory of Elimination Trees for Sparse Unsymmetric Matrices
journal, January 2005

Eisenstat, Stanley C.; Liu, Joseph W. H.
SIAM Journal on Matrix Analysis and Applications, Vol. 26, Issue 3
DOI: 10.1137/S089547980240563X

Algorithmic Aspects of Vertex Elimination on Directed Graphs
journal, January 1978

Rose, Donald J.; Tarjan, Robert Endre
SIAM Journal on Applied Mathematics, Vol. 34, Issue 1
DOI: 10.1137/0134014

Algorithmic Aspects of Vertex Elimination on Graphs
journal, June 1976

Rose, Donald J.; Tarjan, R. Endre; Lueker, George S.
SIAM Journal on Computing, Vol. 5, Issue 2
DOI: 10.1137/0205021

The university of Florida sparse matrix collection
journal, November 2011

Davis, Timothy A.; Hu, Yifan
ACM Transactions on Mathematical Software, Vol. 38, Issue 1
DOI: 10.1145/2049662.2049663

Works referencing / citing this record:

Preparing sparse solvers for exascale computing
journal, January 2020

Anzt, Hartwig; Boman, Erik; Falgout, Rob
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0053

Similar Records in DOE PAGES and OSTI.GOV collections:

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Technical Report Kim, Kyungjoo ; Rajamanickam, Sivasankaran ; Stelle, George Widgery ; ...

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore »« less
https://doi.org/10.2172/1237520

Full Text Available
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

Journal Article Ghysels, Pieter ; Li, Xiaoye S. ; Rouet, Francois -Henry ; ... - SIAM Journal on Scientific Computing

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to sevenfold for problems in our test suite. The implementation targetsmore »« less
Cited by 68
https://doi.org/10.1137/15M1010117

Full Text Available
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Journal Article Aktulga, Hasan Metin ; Afibuzzaman, Md. ; Williams, Samuel ; ... - IEEE Transactions on Parallel and Distributed Systems

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore »« less
Cited by 7
https://doi.org/10.1109/TPDS.2016.2630699

Full Text Available
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

Conference Sitaraman, Hariswaran ; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore »« less
https://doi.org/10.2514/6.2017-4410
Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Technical Report Edwards, Harold Carter ; Mackey, Greg Edward ; Olivier, Stephen Lecler ; ...

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factorization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG. Major challenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respectmore »« less
https://doi.org/10.2172/1562647

Full Text Available

Similar Records

Title: Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

Abstract

Citation Formats

A survey of direct methods for sparse linear systems journal, May 2016

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems journal, June 2003

PARDISO: a high-performance serial and parallel sparse linear solver in semiconductor device simulation journal, September 2001

PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems journal, February 2002

A Supernodal Approach to Sparse Partial Pivoting journal, January 1999

An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination journal, January 1999

Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems journal, September 2010

Sparse Partial Pivoting in Time Proportional to Arithmetic Operations journal, September 1988

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns journal, December 2014

An Approximate Minimum Degree Ordering Algorithm journal, October 1996

On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix journal, January 2001

Computing the block triangular form of a sparse matrix journal, December 1990

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout report, January 2016

The Role of Elimination Trees in Sparse Factorization journal, January 1990

The Theory of Elimination Trees for Sparse Unsymmetric Matrices journal, January 2005

Algorithmic Aspects of Vertex Elimination on Directed Graphs journal, January 1978

Algorithmic Aspects of Vertex Elimination on Graphs journal, June 1976

The university of Florida sparse matrix collection journal, November 2011

Preparing sparse solvers for exascale computing journal, January 2020

A survey of direct methods for sparse linear systems
journal, May 2016

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
journal, June 2003

PARDISO: a high-performance serial and parallel sparse linear solver in semiconductor device simulation
journal, September 2001

PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems
journal, February 2002

A Supernodal Approach to Sparse Partial Pivoting
journal, January 1999

An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination
journal, January 1999

Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems
journal, September 2010

Sparse Partial Pivoting in Time Proportional to Arithmetic Operations
journal, September 1988

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

An Approximate Minimum Degree Ordering Algorithm
journal, October 1996

On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix
journal, January 2001

Computing the block triangular form of a sparse matrix
journal, December 1990

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
report, January 2016

The Role of Elimination Trees in Sparse Factorization
journal, January 1990

The Theory of Elimination Trees for Sparse Unsymmetric Matrices
journal, January 2005

Algorithmic Aspects of Vertex Elimination on Directed Graphs
journal, January 1978

Algorithmic Aspects of Vertex Elimination on Graphs
journal, June 1976

The university of Florida sparse matrix collection
journal, November 2011

Preparing sparse solvers for exascale computing
journal, January 2020