skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Abstract

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1237520
Report Number(s):
SAND-2016-0637R
619072
DOE Contract Number:
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; sparse factorization; algorithm-by-block; 2D layout; task parallelism

Citation Formats

Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., and Olivier, Stephen Lecler. Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout. United States: N. p., 2016. Web. doi:10.2172/1237520.
Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., & Olivier, Stephen Lecler. Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout. United States. doi:10.2172/1237520.
Kim, Kyungjoo, Rajamanickam, Sivasankaran, Stelle, George Widgery, Edwards, Harold C., and Olivier, Stephen Lecler. Fri . "Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout". United States. doi:10.2172/1237520. https://www.osti.gov/servlets/purl/1237520.
@article{osti_1237520,
title = {Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout},
author = {Kim, Kyungjoo and Rajamanickam, Sivasankaran and Stelle, George Widgery and Edwards, Harold C. and Olivier, Stephen Lecler},
abstractNote = {We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.},
doi = {10.2172/1237520},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Jan 01 00:00:00 EST 2016},
month = {Fri Jan 01 00:00:00 EST 2016}
}

Technical Report:

Save / Share:
  • In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-gained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task-scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overallmore » scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.« less
  • A technique for solving the large sparse linear systems that arise from the application of finite element methods is described. The technique combines an incomplete factorization method called the shifted incomplete Cholesky factorization with the method of generalized conjugate gradients. The shifted incomplete Cholesky factorization produces a splitting of the matrix A that is dependent upon a parameter ..cap alpha... It is shown that, if A is positive definite, then there is some ..cap alpha.. for which this splitting is possible and that this splitting is at least as good as the Jacobi splitting. The method is shown to bemore » more efficient on a set of test problems than either direct methods or explicit iteration schemes.« less
  • The paper develops and compares several fine-grained parallel algorithms to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed-memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special-purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a two-dimensional grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the keymore » subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. It also presents a performance model and uses it to analyze the algorithms. It finds that asymptotic analysis combined with experimental measurement of parameters is accurate enough to be useful in choosing among alternative algorithms for a complicated problem.« less
  • Cholesky factorization of large sparse matrices is an extremely important computation, arising in a wide range of domains including linear programming, finite element analysis, and circuit simulation. This thesis investigates crucial issues for obtaining high performance for this computation on sequential and parallel machines with hierarchical memory systems. The thesis begins by providing the first thorough analysis of the interaction between sequential sparse Cholesky factorization methods and memory hierarchies. The authors look at popular existing methods and find that they produce relatively poor memory hierarchy performance. The methods are extended, using blocking techniques, to reuse data in the fast levelsmore » of the memory hierarchy. This increased reuse is shown to provide a three-fold speedup over popular existing approaches (e.g., SPARSPAK) on modem workstations. The thesis then considers the use of blocking techniques in parallel sparse factorization. The authors first describe parallel methods we have developed that are natural extensions of the sequential approach described above. These methods distribute panels (sets of contiguous columns with nearly identical non-zero structures) among the processors. The thesis shows that for small parallel machines, the resulting methods again produce substantial performance improvements over existing methods. A framework is provided for understanding the performance of these methods, and also for understanding the limitations inherent in them. Using this framework, the thesis shows that panel methods are inappropriate for large-scale parallel machines because they do not expose enough concurrency.... Hierarchical-memory machines, Sparse Cholesky factorization, Parallel processing.« less
  • This paper describes the design and implementation of three core factorization routines--LU, QR and Cholesky--included in the out-of-core extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. An image of the full matrix is maintained on disk and the factorization routines transfer sub-matrices into memory. The left-looking column-oriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high performance ScaLAPACK factorization routines as in-core computational kernels. The authors present themore » details of the implementation for the out-of-core ScaLAPACK factorization routines, as well as performance and scalability results on the Intel Paragon.« less