Sparse MatrixMatrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments
Abstract
Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multilevel memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunkingbased algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the autocaching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multimemoryaware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multimemoryaware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the softwaremanaged cache mechanisms.
 Authors:

 Sandia National Lab. (SNLNM), Albuquerque, NM (United States)
 Publication Date:
 Research Org.:
 Sandia National Lab. (SNLNM), Albuquerque, NM (United States)
 Sponsoring Org.:
 USDOE National Nuclear Security Administration (NNSA); USDOE Laboratory Directed Research and Development (LDRD) Program
 OSTI Identifier:
 1435688
 Report Number(s):
 SAND20183428R
662552
 DOE Contract Number:
 AC0494AL85000; NA0003525
 Resource Type:
 Technical Report
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING
Citation Formats
Deveci, Mehmet, Hammond, Simon David, Wolf, Michael M., and Rajamanickam, Sivasankaran. Sparse MatrixMatrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments. United States: N. p., 2018.
Web. doi:10.2172/1435688.
Deveci, Mehmet, Hammond, Simon David, Wolf, Michael M., & Rajamanickam, Sivasankaran. Sparse MatrixMatrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments. United States. doi:10.2172/1435688.
Deveci, Mehmet, Hammond, Simon David, Wolf, Michael M., and Rajamanickam, Sivasankaran. Mon .
"Sparse MatrixMatrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments". United States. doi:10.2172/1435688. https://www.osti.gov/servlets/purl/1435688.
@article{osti_1435688,
title = {Sparse MatrixMatrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments},
author = {Deveci, Mehmet and Hammond, Simon David and Wolf, Michael M. and Rajamanickam, Sivasankaran},
abstractNote = {Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multilevel memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunkingbased algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the autocaching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multimemoryaware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multimemoryaware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the softwaremanaged cache mechanisms.},
doi = {10.2172/1435688},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {4}
}