ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing

Tan, Cheng; Xie, Chenhao; Geng, Tong; Marquez, Andres; Tumeo, Antonino; Barker, Kevin J.; Li, Ang

doi:10.1109/tpds.2021.3081074

Title: ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing

Journal Article · Fri Mar 19 00:00:00 EDT 2021 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2021.3081074· OSTI ID:1811825

^[1]; Xie, Chenhao ^[1]; Geng, Tong ^[1]; Marquez, Andres ^[1];

^[1]; Barker, Kevin J. ^[1];

^[1]

Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this work, we propose ARENA – an asynchronous reconfigurable accelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using the coarse-grained reconfigurable arrays (CGRAs) as the substrate platform, our key contribution is not only the CGRA-cluster design itself, but also the ensemble of a new architecture and programming model that enables asynchronous tasking across a cluster of reconfigurable nodes, so as to bring specialized computation to the data rather than the reverse. We presume distributed data storage without asserting any prior knowledge on the data distribution. Hardware specialization occurs at runtime when a task finds the majority of data it requires are available at the present node. In other words, we dynamically generate specialized CGRA accelerators where the data reside. The asynchronous tasking for bringing computation to data is achieved by circulating the task token, which describes the dataflow graphs to be executed for a task, among the CGRA cluster connected by a fast ring network. Evaluations on a set of HPC and data-driven applications across different domains show that ARENA can provide better parallel scalability with reduced data movement (53.9 percent). Compared with contemporary compute-centric parallel models, ARENA can bring on average 4.37× speedup. The synthesized CGRAs and their task-dispatchers only occupy 2.93mm ² chip area under 45nm process technology and can run at 800MHz with on average 759.8mW power consumption. ARENA also supports the concurrent execution of multi-applications, offering ideal architectural support for future high-performance parallel computing and data analytics systems.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC05-76RL01830; 66150

OSTI ID:: 1811825

Report Number(s):: PNNL-SA-152862

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Vol. 32, Issue 12; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (52)

Chimaera Ye, Zhi Alex; Moshovos, Andreas; Hauck, Scott Proceedings of the 27th annual international symposium on Computer architecture - ISCA '00 https://doi.org/10.1145/339647.339687	conference	January 2000
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS Szilárd, Páll; Abraham, Mark James; Kutzner, Carsten arXiv https://doi.org/10.48550/arxiv.1506.00716	text	January 2015
A lightweight infrastructure for graph analytics Nguyen, Donald; Lenharth, Andrew; Pingali, Keshav Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP '13 https://doi.org/10.1145/2517349.2522739	conference	January 2013
High-Resolution Simulation of Pore-Scale Reactive Transport Processes Associated with Carbon Sequestration Trebotich, David; Adams, Mark F.; Molins, Sergi Computing in Science & Engineering, Vol. 16, Issue 6 https://doi.org/10.1109/MCSE.2014.77	journal	November 2014
Reconfigurable Computing Architectures Tessier, Russell; Pocek, Kenneth; DeHon, Andre Proceedings of the IEEE, Vol. 103, Issue 3 https://doi.org/10.1109/JPROC.2014.2386883	journal	March 2015
Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Tarafdar, Naif; Lin, Thomas; Fukuda, Eric Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays https://doi.org/10.1145/3020078.3021742	conference	February 2017
Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity Vetter, Jeffrey S.; Brightwell, Ron; Gokhale, Maya https://doi.org/10.2172/1473756	report	December 2018
Achieving Flexible Global Reconfiguration in NoCs Using Reconfigurable Rings Wang, Liang; Liu, Leibo; Han, Jie IEEE Transactions on Parallel and Distributed Systems, Vol. 31, Issue 3 https://doi.org/10.1109/TPDS.2019.2940190	journal	March 2020
A bridging model for parallel computation Valiant, Leslie G. Communications of the ACM, Vol. 33, Issue 8 https://doi.org/10.1145/79173.79181	journal	August 1990
Implementation of a volume rendering on coarse-grained reconfigurable multiprocessor Jin, Seunghun; Lee, Sangheon; Chung, Moo-Kyoung 2012 International Conference on Field-Programmable Technology https://doi.org/10.1109/FPT.2012.6412142	conference	December 2012
An MTL Theory Approach for the Simulation of MIMO Power-Line Communication Channels Versolatto, Fabio; Tonello, Andrea M. IEEE Transactions on Power Delivery, Vol. 26, Issue 3 https://doi.org/10.1109/TPWRD.2011.2126608	journal	July 2011
PyMTL3: A Python Framework for Open-Source Hardware Modeling, Generation, Simulation, and Verification Jiang, Shunning; Pan, Peitian; Ou, Yanghui IEEE Micro, Vol. 40, Issue 4 https://doi.org/10.1109/MM.2020.2997638	journal	July 2020
Divide-and-conquer quantum mechanical material simulations with exascale supercomputers Wang, Lin-Wang National Science Review, Vol. 1, Issue 4 https://doi.org/10.1093/nsr/nwu060	journal	December 2014
Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster Zhang, Chen; Wu, Di; Sun, Jiayu Proceedings of the 2016 International Symposium on Low Power Electronics and Design https://doi.org/10.1145/2934583.2934644	conference	August 2016
FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters Geng, Tong; Wang, Tianqi; Sanaullah, Ahmed 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) https://doi.org/10.1109/FCCM.2018.00021	conference	April 2018
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Gabriel, Edgar; Fagg, Graham E.; Bosilca, George Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/978-3-540-30218-6_19	book	January 2004
A Configurable Cloud-Scale DNN Processor for Real-Time AI Fowers, Jeremy; Ovtcharov, Kalin; Papamichael, Michael 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) https://doi.org/10.1109/ISCA.2018.00012	conference	June 2018
On-Chip Networks Jerger, Natalie Enright; Peh, Li-Shiuan Synthesis Lectures on Computer Architecture, Vol. 4, Issue 1 https://doi.org/10.2200/S00209ED1V01Y200907CAC008	journal	January 2009
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective Hazelwood, Kim; Bird, Sarah; Brooks, David 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2018.00059	conference	February 2018
Handling task dependencies under strided and aliased references Perez, Josep M.; Badia, Rosa M.; Labarta, Jesus Proceedings of the 24th ACM International Conference on Supercomputing https://doi.org/10.1145/1810085.1810122	conference	June 2010
Polymorphic pipeline array Park, Hyunchul; Park, Yongjun; Mahlke, Scott Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture https://doi.org/10.1145/1669112.1669160	conference	December 2009
Accelerating Scientific Applications With SambaNova Reconfigurable Dataflow Architecture Emani, Murali; Vishwanath, Venkatram; Adams, Corey Computing in Science & Engineering, Vol. 23, Issue 2 https://doi.org/10.1109/MCSE.2021.3057203	journal	March 2021
Plasticine Prabhakar, Raghu; Zhang, Yaqi; Koeplinger, David Proceedings of the 44th Annual International Symposium on Computer Architecture https://doi.org/10.1145/3079856.3080256	conference	June 2017
Numerical algorithms for high-performance computational science Dongarra, Jack; Grigori, Laura; Higham, Nicholas J. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0066	journal	January 2020
A reconfigurable fabric for accelerating large-scale datacenter services Putnam, Andrew; Caulfield, Adrian M.; Chung, Eric S. 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) https://doi.org/10.1109/ISCA.2014.6853195	conference	June 2014
Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer Shaw, David E.; Grossman, J. P.; Bank, Joseph A. SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.9	conference	November 2014
Parallel Programmability and the Chapel Language Chamberlain, B. L.; Callahan, D.; Zima, H. P. The International Journal of High Performance Computing Applications, Vol. 21, Issue 3 https://doi.org/10.1177/1094342007078442	journal	August 2007
Rodinia: A benchmark suite for heterogeneous computing Che, Shuai; Boyer, Michael; Meng, Jiayuan 2009 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2009.5306797	conference	October 2009
X10: an object-oriented approach to non-uniform cluster computing Charles, Philippe; Grothoff, Christian; Saraswat, Vijay ACM SIGPLAN Notices, Vol. 40, Issue 10 https://doi.org/10.1145/1103845.1094852	journal	October 2005
Intel® Xeon Phi coprocessor (codename Knights Corner) Chrysos, George 2012 IEEE Hot Chips 24 Symposium (HCS) https://doi.org/10.1109/HOTCHIPS.2012.7476487	conference	August 2012
Integrating Reconfigurable Hardware-Based Grid for High Performance Computing Dondo Gazzano, Julio; Sanchez Molina, Francisco; Rincon, Fernando The Scientific World Journal, Vol. 2015 https://doi.org/10.1155/2015/272536	journal	January 2015
MapReduce: simplified data processing on large clusters Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh Communications of the ACM, Vol. 51, Issue 1 https://doi.org/10.1145/1327452.1327492	journal	January 2008
An Introduction to Reconfigurable Systems Lyke, James C.; Christodoulou, Christos G.; Vera, G. Alonzo Proceedings of the IEEE, Vol. 103, Issue 3 https://doi.org/10.1109/JPROC.2015.2397832	journal	March 2015
MDGRAPE-4: a special-purpose computer system for molecular dynamics simulations Ohmura, Itta; Morimoto, Gentaro; Ohno, Yousuke Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2021 https://doi.org/10.1098/rsta.2013.0387	journal	August 2014
Data-Driven Versus Topology-driven Irregular Computations on GPUs Nasre, Rupesh; Burtscher, Martin; Pingali, Keshav 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.28	conference	May 2013
Characterizing the energy consumption of data transfers and arithmetic operations on x86−64 processors Molka, Daniel; Hackenberg, Daniel; Schone, Robert International Conference on Green Computing https://doi.org/10.1109/GREENCOMP.2010.5598316	conference	August 2010
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Awan, A. A.; Hamidouche, K.; Venkatesh, A. Proceedings of the 23rd European MPI Users' Group Meeting https://doi.org/10.1145/2966884.2966912	conference	September 2016
NCBI BLAST: a better web interface Johnson, M.; Zaretskaya, I.; Raytselis, Y. Nucleic Acids Research, Vol. 36, Issue Web Server https://doi.org/10.1093/nar/gkn201	journal	May 2008
Legion: Expressing locality and independence with logical regions Bauer, Michael; Treichler, Sean; Slaughter, Elliott 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.71	conference	November 2012
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations Ben-Nun, Tal; Sutton, Michael; Pai, Sreepathi ACM SIGPLAN Notices, Vol. 52, Issue 8 https://doi.org/10.1145/3155284.3018756	journal	October 2017
Cilk: An Efficient Multithreaded Runtime System Blumofe, Robert D.; Joerg, Christopher F.; Kuszmaul, Bradley C. Journal of Parallel and Distributed Computing, Vol. 37, Issue 1 https://doi.org/10.1006/jpdc.1996.0107	journal	August 1996
GASNet-EX: A High-Performance, Portable Communication Library for Exascale Bonachea, Dan; Hargrove, Paul H. https://doi.org/10.2172/1477359	report	October 2018
Google Workloads for Consumer Devices Boroumand, Amirali; Ghose, Saugata; Kim, Youngsok Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3173162.3173177	conference	March 2018
Routerless Network-on-Chip Alazemi, Fawaz; AziziMazreah, Arash; Bose, Bella 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2018.00049	conference	February 2018
On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus Ainsworth, Thomas William; Pinkston, Timothy Mark First International Symposium on Networks-on-Chip (NOCS'07) https://doi.org/10.1109/NOCS.2007.34	conference	May 2007
IMR: High-Performance Low-Cost Multi-Ring NoCs Liu, Shaoli; Chen, Tianshi; Li, Ling IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 6 https://doi.org/10.1109/TPDS.2015.2465905	journal	June 2016
LLVM: A compilation framework for lifelong program analysis & transformation Lattner, C.; Adve, V. International Symposium on Code Generation and Optimization, 2004. CGO 2004. https://doi.org/10.1109/CGO.2004.1281665	conference	January 2004
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect Li, Ang; Song, Shuaiwen Leon; Chen, Jieyang IEEE Transactions on Parallel and Distributed Systems, Vol. 31, Issue 1 https://doi.org/10.1109/TPDS.2019.2928289	journal	January 2020
HyCUBE Karunaratne, Manupa; Mohite, Aditi Kulkarni; Mitra, Tulika Proceedings of the 54th Annual Design Automation Conference 2017 https://doi.org/10.1145/3061639.3062262	conference	June 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17 https://doi.org/10.1145/3079856.3080246	conference	January 2017
RC3E: Reconfigurable Accelerators in Data Centres and Their Provision by Adapted Service Models Knodel, Oliver; Lehmann, Patrick; Spallek, Rainer G. 2016 IEEE 9th International Conference on Cloud Computing (CLOUD) https://doi.org/10.1109/CLOUD.2016.0013	conference	June 2016
Quantifying the energy cost of data movement in scientific applications Kestor, Gokcen; Gioiosa, Roberto; Kerbyson, Darren J. 2013 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2013.6704670	conference	September 2013

Similar Records

A Framework for Neural Network Inference on FPGA-Centric SmartNICs

Conference · Fri Sep 30 00:00:00 EDT 2022 · OSTI ID:1811825

Guo, Anqi; Geng, Tong; Zhang, Yongan; +6 more

OpenCGRA: An Open-Source Unified Framework for Modeling,Testing, and Evaluating CGRAs

Conference · Sun Oct 18 00:00:00 EDT 2020 · OSTI ID:1811825

Tan, Cheng; Xie, Chenhao; Li, Ang; +2 more

PRIMA-X - Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing

Technical Report · Thu Jun 27 00:00:00 EDT 2019 · OSTI ID:1811825

Wolf, Felix; Lorenz, Daniel

Related Subjects

97 MATHEMATICS AND COMPUTING
compute-flow-architecture
runtime reconfiguration
asynchronous parallel execution
abstract machine model

Title: ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing

Citation Formats

References (52)

Similar Records

Related Subjects