Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
Abstract
We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.
- Authors:
-
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1237520
- Report Number(s):
- SAND-2016-0637R
619072
- DOE Contract Number:
- AC04-94AL85000
- Resource Type:
- Technical Report
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; sparse factorization; algorithm-by-block; 2D layout; task parallelism
Citation Formats
Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., and Olivier, Stephen Lecler. Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout. United States: N. p., 2016.
Web. doi:10.2172/1237520.
Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., & Olivier, Stephen Lecler. Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout. United States. doi:10.2172/1237520.
Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., and Olivier, Stephen Lecler. Fri .
"Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout". United States.
doi:10.2172/1237520. https://www.osti.gov/servlets/purl/1237520.
@article{osti_1237520,
title = {Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout},
author = {Kim, Kyungjoo and Rajamanickam, Sivasankaran and Stelle, George Widgery and Edwards, Harold C. and Olivier, Stephen Lecler},
abstractNote = {We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.},
doi = {10.2172/1237520},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Jan 01 00:00:00 EST 2016},
month = {Fri Jan 01 00:00:00 EST 2016}
}
-
In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-gained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task-scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overallmore »
-
Shifted incomplete Cholesky factorization
A technique for solving the large sparse linear systems that arise from the application of finite element methods is described. The technique combines an incomplete factorization method called the shifted incomplete Cholesky factorization with the method of generalized conjugate gradients. The shifted incomplete Cholesky factorization produces a splitting of the matrix A that is dependent upon a parameter ..cap alpha... It is shown that, if A is positive definite, then there is some ..cap alpha.. for which this splitting is possible and that this splitting is at least as good as the Jacobi splitting. The method is shown to bemore » -
Highly parallel sparse Cholesky factorization
The paper develops and compares several fine-grained parallel algorithms to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed-memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special-purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a two-dimensional grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the keymore » -
Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization
Cholesky factorization of large sparse matrices is an extremely important computation, arising in a wide range of domains including linear programming, finite element analysis, and circuit simulation. This thesis investigates crucial issues for obtaining high performance for this computation on sequential and parallel machines with hierarchical memory systems. The thesis begins by providing the first thorough analysis of the interaction between sequential sparse Cholesky factorization methods and memory hierarchies. The authors look at popular existing methods and find that they produce relatively poor memory hierarchy performance. The methods are extended, using blocking techniques, to reuse data in the fast levelsmore » -
The design and implementation of the parallel out-of-core ScaLAPACK LU, QR and Cholesky factorization routines
This paper describes the design and implementation of three core factorization routines--LU, QR and Cholesky--included in the out-of-core extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. An image of the full matrix is maintained on disk and the factorization routines transfer sub-matrices into memory. The left-looking column-oriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high performance ScaLAPACK factorization routines as in-core computational kernels. The authors present themore »