A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; Buluc, Aydin; Shao, Meiyue; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

doi:10.1109/TPDS.2016.2630699

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Journal Article · Thu Jun 01 00:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/TPDS.2016.2630699· OSTI ID:1379875

Aktulga, Hasan Metin ^[1]; Afibuzzaman, Md. ^[1]; Williams, Samuel ^[2]; Buluc, Aydin ^[2]; Shao, Meiyue ^[2]; Yang, Chao ^[2]; Ng, Esmond G. ^[2]; Maris, Pieter ^[3]; Vary, James P. ^[3]

Michigan State Univ., East Lansing, MI (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
Iowa State Univ., Ames, IA (United States). Dept. of Physics and Astronomy

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using the compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); USDOE Office of Science (SC), Nuclear Physics (NP) (SC-26)

Grant/Contract Number:: AC02-05CH11231; FG02-87ER40371; SC0008485

OSTI ID:: 1379875

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 6 Vol. 28; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

Similar Records

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations

Conference · Thu Aug 14 00:00:00 EDT 2014 · OSTI ID:1407214

On the performance and energy efficiency of sparse linear algebra on GPUs

Journal Article · Tue Oct 04 20:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1437692

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Technical Report · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1237520

Related Subjects

97 MATHEMATICS AND COMPUTING
Sparse matrix multiplication
block eigensolver
configuration interaction
extended roofline model
tall-skinny matrices

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Citation Formats

Similar Records

Related Subjects