Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Journal Article · · ACM Transactions on Architecture and Code Optimization
DOI:https://doi.org/10.1145/3355396· OSTI ID:1697985
 [1];  [1];  [1];  [1];  [2];  [2]
  1. Georgia Inst. of Technology, Atlanta, GA (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning, and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This article presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.

Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000; NA0003525
OSTI ID:
1697985
Report Number(s):
SAND--2019-9508J; 678475
Journal Information:
ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 4 Vol. 16; ISSN 1544-3566
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (39)

Design space exploration for sparse matrix-matrix multiplication on FPGAs: SPARSE MATRIX-MATRIX MULTIPLICATION ON FPGAS
  • Lin, Colin Yu; Wong, Ngai; So, Hayden Kwok-Hay
  • International Journal of Circuit Theory and Applications, Vol. 41, Issue 2 https://doi.org/10.1002/cta.796
journal October 2011
Parallel depth first search. Part I. Implementation journal December 1987
Order-N tight-binding molecular dynamics on parallel computers journal August 1995
Maximum matchings in general graphs through randomization journal December 1989
The anatomy of a large-scale hypertextual Web search engine journal April 1998
Sparse matrix multiplication: The distributed block-compressed sparse row library journal May 2014
Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms journal April 1996
Fine-grained accelerators for sparse machine learning workloads conference January 2017
A sparse matrix vector multiply accelerator for support vector machine conference October 2015
OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator conference February 2018
D4M: Bringing associative arrays to database engines conference September 2015
Algorithm and hardware co-optimized solution for large SpMV problems
  • Sadi, Fazle; Fileggi, Larry; Franchetti, Franz
  • 2017 IEEE High-Performance Extreme Computing Conference (HPEC), 2017 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2017.8091096
conference September 2017
Sparse matrix-matrix multiplication on modern architectures conference December 2012
Computationally-redundant energy-efficient processing for y'all (CREEPY) conference October 2016
The Superstrider Architecture: Integrating Logic and Memory Towards Non-Von Neumann Computing conference November 2017
Merge Network for a Non-Von Neumann Accumulate Accelerator in a 3D Chip conference November 2018
An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
  • Liu, Weifeng; Vinter, Brian
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.47
conference May 2014
Parallel Triangle Counting and Enumeration Using Matrix Algebra conference May 2015
Performance-portable sparse matrix-matrix multiplication for many-core architectures
  • Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran
  • 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.8
conference May 2017
In-Memory Intelligence journal January 2017
Towards Optimal Performance-Area Trade-Off in Adders by Synthesis of Parallel Prefix Structures journal October 2014
Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures journal August 2017
Improved Dynamic Reachability Algorithms for Directed Graphs journal January 2008
Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods journal January 2012
GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging journal January 2015
Pseudo-randomly interleaved memory conference January 1991
The residue number system conference January 1959
The gem5 simulator journal August 2011
The university of Florida sparse matrix collection journal November 2011
X-Stream: edge-centric graph processing using streaming partitions conference January 2013
Optimizing Sparse Matrix—Matrix Multiplication for the GPU journal October 2015
Structured Pruning of Deep Convolutional Neural Networks journal May 2017
CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories
  • Balasubramonian, Rajeev; Kahng, Andrew B.; Muralimanohar, Naveen
  • ACM Transactions on Architecture and Code Optimization, Vol. 14, Issue 2 https://doi.org/10.1145/3085572
journal July 2017
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures conference January 2018
Tackling memory access latency through DRAM row management
  • Srikanth, Sriseshan; Subramanian, Lavanya; Subramoney, Sreenivas
  • MEMSYS '18: The International Symposium on Memory Systems, Proceedings of the International Symposium on Memory Systems https://doi.org/10.1145/3240302.3240314
conference October 2018
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition journal September 1978
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality conference January 2000
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum journal June 2002
The Combinatorial BLAS: design, implementation, and applications journal May 2011