A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations
Abstract
As onnode parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in largescale scientific applications requires an architectureaware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tallskinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using the compressed sparse blocks (CSB) format. We achieve 34× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup overmore »
 Authors:

 Michigan State Univ., East Lansing, MI (United States)
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
 Iowa State Univ., Ames, IA (United States). Dept. of Physics and Astronomy
 Publication Date:
 Research Org.:
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), Nuclear Physics (NP); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
 OSTI Identifier:
 1379875
 Grant/Contract Number:
 AC0205CH11231; SC0008485; FG0287ER40371
 Resource Type:
 Accepted Manuscript
 Journal Name:
 IEEE Transactions on Parallel and Distributed Systems
 Additional Journal Information:
 Journal Volume: 28; Journal Issue: 6; Journal ID: ISSN 10459219
 Publisher:
 IEEE
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING; Sparse matrix multiplication; block eigensolver; configuration interaction; extended roofline model; tallskinny matrices
Citation Formats
Aktulga, Hasan Metin, Afibuzzaman, Md., Williams, Samuel, Buluc, Aydin, Shao, Meiyue, Yang, Chao, Ng, Esmond G., Maris, Pieter, and Vary, James P. A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations. United States: N. p., 2017.
Web. doi:10.1109/TPDS.2016.2630699.
Aktulga, Hasan Metin, Afibuzzaman, Md., Williams, Samuel, Buluc, Aydin, Shao, Meiyue, Yang, Chao, Ng, Esmond G., Maris, Pieter, & Vary, James P. A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations. United States. https://doi.org/10.1109/TPDS.2016.2630699
Aktulga, Hasan Metin, Afibuzzaman, Md., Williams, Samuel, Buluc, Aydin, Shao, Meiyue, Yang, Chao, Ng, Esmond G., Maris, Pieter, and Vary, James P. Thu .
"A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations". United States. https://doi.org/10.1109/TPDS.2016.2630699. https://www.osti.gov/servlets/purl/1379875.
@article{osti_1379875,
title = {A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations},
author = {Aktulga, Hasan Metin and Afibuzzaman, Md. and Williams, Samuel and Buluc, Aydin and Shao, Meiyue and Yang, Chao and Ng, Esmond G. and Maris, Pieter and Vary, James P.},
abstractNote = {As onnode parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in largescale scientific applications requires an architectureaware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tallskinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using the compressed sparse blocks (CSB) format. We achieve 34× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on highend multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.},
doi = {10.1109/TPDS.2016.2630699},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 6,
volume = 28,
place = {United States},
year = {2017},
month = {6}
}
Web of Science