MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Srikanth, Sriseshan; Jain, Anirudh; Lennon, Joseph M.; Conte, Thomas M.; Debenedictis, Erik; Cook, Jeanine

doi:10.1145/3355396

MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Journal Article · Tue Oct 01 00:00:00 EDT 2019 · ACM Transactions on Architecture and Code Optimization

DOI:https://doi.org/10.1145/3355396· OSTI ID:1697985

Srikanth, Sriseshan ^[1]; Jain, Anirudh ^[1]; Lennon, Joseph M. ^[1]; Conte, Thomas M. ^[1]; Debenedictis, Erik ^[2]; Cook, Jeanine ^[2]

Georgia Inst. of Technology, Atlanta, GA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning, and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This article presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.

View Accepted Manuscript (DOE)

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC04-94AL85000; NA0003525

OSTI ID:: 1697985

Report Number(s):: SAND--2019-9508J; 678475

Journal Information:: ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 4 Vol. 16; ISSN 1544-3566

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

References (39)

Design space exploration for sparse matrix-matrix multiplication on FPGAs: SPARSE MATRIX-MATRIX MULTIPLICATION ON FPGAS Lin, Colin Yu; Wong, Ngai; So, Hayden Kwok-Hay International Journal of Circuit Theory and Applications, Vol. 41, Issue 2 https://doi.org/10.1002/cta.796	journal	October 2011
Parallel depth first search. Part I. Implementation Rao, V. Nageshwara; Kumar, Vipin International Journal of Parallel Programming, Vol. 16, Issue 6 https://doi.org/10.1007/BF01389000	journal	December 1987
Order-N tight-binding molecular dynamics on parallel computers Itoh, Satoshi; Ordejón, Pablo; Martin, Richard M. Computer Physics Communications, Vol. 88, Issue 2-3 https://doi.org/10.1016/0010-4655(95)00031-A	journal	August 1995
Maximum matchings in general graphs through randomization Rabin, Michael O.; Vazirani, Vijay V. Journal of Algorithms, Vol. 10, Issue 4 https://doi.org/10.1016/0196-6774(89)90005-9	journal	December 1989
The anatomy of a large-scale hypertextual Web search engine Brin, Sergey; Page, Lawrence Computer Networks and ISDN Systems, Vol. 30, Issue 1-7 https://doi.org/10.1016/S0169-7552(98)00110-X	journal	April 1998
Sparse matrix multiplication: The distributed block-compressed sparse row library Borštnik, Urban; VandeVondele, Joost; Weber, Valéry Parallel Computing, Vol. 40, Issue 5-6 https://doi.org/10.1016/j.parco.2014.03.012	journal	May 2014
Density Functional and Density Matrix Method Scaling Linearly with the Number of Atoms Kohn, W. Physical Review Letters, Vol. 76, Issue 17 https://doi.org/10.1103/PhysRevLett.76.3168	journal	April 1996
Fine-grained accelerators for sparse machine learning workloads Mishra, Asit K.; Nurvitadhi, Eriko; Venkatesh, Ganesh 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC) https://doi.org/10.1109/ASPDAC.2017.7858395	conference	January 2017
A sparse matrix vector multiply accelerator for support vector machine Nurvitadhi, Eriko; Mishra, Asit; Marr, Debbie 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) https://doi.org/10.1109/CASES.2015.7324551	conference	October 2015
OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator Pal, Subhankar; Beaumont, Jonathan; Park, Dong-Hyeon 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2018.00067	conference	February 2018
D4M: Bringing associative arrays to database engines Gadepally, Vijay; Kepner, Jeremy; Arcand, William 2015 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2015.7322472	conference	September 2015
Algorithm and hardware co-optimized solution for large SpMV problems Sadi, Fazle; Fileggi, Larry; Franchetti, Franz 2017 IEEE High-Performance Extreme Computing Conference (HPEC), 2017 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2017.8091096	conference	September 2017
Sparse matrix-matrix multiplication on modern architectures Matam, Kiran; Krishna Bharadwaj Indarapu, Siva Rama; Kothapalli, Kishore 2012 19th International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2012.6507483	conference	December 2012
Computationally-redundant energy-efficient processing for y'all (CREEPY) Deng, Bobin; Srikanth, Sriseshan; Hein, Eric R. 2016 IEEE International Conference on Rebooting Computing (ICRC) https://doi.org/10.1109/ICRC.2016.7738714	conference	October 2016
The Superstrider Architecture: Integrating Logic and Memory Towards Non-Von Neumann Computing Srikanth, Sriseshan; Conte, Thomas M.; DeBenedictis, Erik P. 2017 IEEE International Conference on Rebooting Computing (ICRC) https://doi.org/10.1109/ICRC.2017.8123669	conference	November 2017
Merge Network for a Non-Von Neumann Accumulate Accelerator in a 3D Chip Jain, Anirudh; Srikanth, Sriseshan; DeBenedictis, Erik P. 2018 IEEE International Conference on Rebooting Computing (ICRC) https://doi.org/10.1109/ICRC.2018.8638619	conference	November 2018
An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data Liu, Weifeng; Vinter, Brian 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.47	conference	May 2014
Parallel Triangle Counting and Enumeration Using Matrix Algebra Azad, Ariful; Buluc, Aydin; Gilbert, John 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW) https://doi.org/10.1109/IPDPSW.2015.75	conference	May 2015
Performance-portable sparse matrix-matrix multiplication for many-core architectures Deveci, Mehmet; Trott, Christian; Rajamanickam, Sivasankaran 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.8	conference	May 2017
In-Memory Intelligence Finkbeiner, Tim; Hush, Glen; Larsen, Troy IEEE Micro, Vol. 37, Issue 4 https://doi.org/10.1109/MM.2017.3211117	journal	January 2017
Towards Optimal Performance-Area Trade-Off in Adders by Synthesis of Parallel Prefix Structures Roy, Subhendu; Choudhury, Mihir; Puri, Ruchir IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 33, Issue 10 https://doi.org/10.1109/TCAD.2014.2341926	journal	October 2014
Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures Akbudak, Kadir; Aykanat, Cevdet IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 8 https://doi.org/10.1109/TPDS.2017.2656893	journal	August 2017
Improved Dynamic Reachability Algorithms for Directed Graphs Roditty, Liam; Zwick, Uri SIAM Journal on Computing, Vol. 37, Issue 5 https://doi.org/10.1137/060650271	journal	January 2008
Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods Bell, Nathan; Dalton, Steven; Olson, Luke N. SIAM Journal on Scientific Computing, Vol. 34, Issue 4 https://doi.org/10.1137/110838844	journal	January 2012
GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole SIAM Journal on Scientific Computing, Vol. 37, Issue 1 https://doi.org/10.1137/130948811	journal	January 2015
Pseudo-randomly interleaved memory Rau, B. Ramakrishna Proceedings of the 18th annual international symposium on Computer architecture - ISCA '91 https://doi.org/10.1145/115952.115961	conference	January 1991
The residue number system Garner, Harvey L. Papers presented at the the March 3-5, 1959, western joint computer conference on XX - IRE-AIEE-ACM '59 (Western) https://doi.org/10.1145/1457838.1457864	conference	January 1959
The gem5 simulator Binkert, Nathan; Sardashti, Somayeh; Sen, Rathijit ACM SIGARCH Computer Architecture News, Vol. 39, Issue 2 https://doi.org/10.1145/2024716.2024718	journal	August 2011
The university of Florida sparse matrix collection Davis, Timothy A.; Hu, Yifan ACM Transactions on Mathematical Software, Vol. 38, Issue 1 https://doi.org/10.1145/2049662.2049663	journal	November 2011
X-Stream: edge-centric graph processing using streaming partitions Roy, Amitabha; Mihailovic, Ivo; Zwaenepoel, Willy Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP '13 https://doi.org/10.1145/2517349.2522740	conference	January 2013
Optimizing Sparse Matrix—Matrix Multiplication for the GPU Dalton, Steven; Olson, Luke; Bell, Nathan ACM Transactions on Mathematical Software, Vol. 41, Issue 4 https://doi.org/10.1145/2699470	journal	October 2015
Structured Pruning of Deep Convolutional Neural Networks Anwar, Sajid; Hwang, Kyuyeon; Sung, Wonyong ACM Journal on Emerging Technologies in Computing Systems, Vol. 13, Issue 3 https://doi.org/10.1145/3005348	journal	May 2017
CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories Balasubramonian, Rajeev; Kahng, Andrew B.; Muralimanohar, Naveen ACM Transactions on Architecture and Code Optimization, Vol. 14, Issue 2 https://doi.org/10.1145/3085572	journal	July 2017
High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18 https://doi.org/10.1145/3229710.3229720	conference	January 2018
Tackling memory access latency through DRAM row management Srikanth, Sriseshan; Subramanian, Lavanya; Subramoney, Sreenivas MEMSYS '18: The International Symposium on Memory Systems, Proceedings of the International Symposium on Memory Systems https://doi.org/10.1145/3240302.3240314	conference	October 2018
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition Gustavson, Fred G. ACM Transactions on Mathematical Software, Vol. 4, Issue 3 https://doi.org/10.1145/355791.355796	journal	September 1978
A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality Zhang, Zhao; Zhu, Zhichun; Zhang, Xiaodong Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture - MICRO 33 https://doi.org/10.1145/360128.360134	conference	January 2000
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum Duff, Iain S.; Heroux, Michael A.; Pozo, Roldan ACM Transactions on Mathematical Software, Vol. 28, Issue 2 https://doi.org/10.1145/567806.567810	journal	June 2002
The Combinatorial BLAS: design, implementation, and applications Buluç, Aydın; Gilbert, John R. The International Journal of High Performance Computing Applications, Vol. 25, Issue 4 https://doi.org/10.1177/1094342011403516	journal	May 2011

Similar Records

Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures

Journal Article · Mon Jul 09 00:00:00 EDT 2018 · Parallel Computing · OSTI ID:1466997

Distributed-Memory Sparse Deep Neural Network Inference Using Global Arrays

Conference · Mon Sep 23 00:00:00 EDT 2024 · OSTI ID:2563594

MODA A Framework for Memory Centric Performance Characterization

Conference · Fri Jun 29 00:00:00 EDT 2012 · OSTI ID:1148644

Related Subjects

97 MATHEMATICS AND COMPUTING
DRAM
computer systems organization
hardware
memory and dense storage
memory-centric architectures
shared memory algorithms
sparse
special purpose systems
theory of computation

MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams

Citation Formats

References (39)

Similar Records

Related Subjects