MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation

Wang, Qihan; Peng, Zhen; Ren, Bin; Chen, Jie; Edwards, Robert G.

doi:10.1145/3506705

MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation

Journal Article · Thu Mar 24 00:00:00 EDT 2022 · ACM Transactions on Architecture and Code Optimization

DOI:https://doi.org/10.1145/3506705· OSTI ID:1867362

^[1]; Peng, Zhen ^[1]; Ren, Bin ^[1]; Chen, Jie ^[2]; Edwards, Robert G. ^[2]

College of William and Mary, Williamsburg, VA (United States)
Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA (United States)

The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy: the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy: although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications.

View Accepted Manuscript (DOE)

Research Organization:: Thomas Jefferson National Accelerator Facility, Newport News, VA (United States)

Sponsoring Organization:: NSF; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Nuclear Physics (NP)

Grant/Contract Number:: AC05-06OR23177

OSTI ID:: 1867362

Report Number(s):: DOE/OR/23177-5487; JLAB-CST-22-3602; CCF-2047516; DE-AC05-06OR23177; 17-SC-20-SC

Journal Information:: ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 2 Vol. 19; ISSN 1544-3566

Publisher:: Association for Computing Machinery (ACM)Copyright Statement

Country of Publication:: United States

Language:: English

References (28)

Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs Knap, Marcin; Czarnul, Paweł The Journal of Supercomputing, Vol. 75, Issue 11 https://doi.org/10.1007/s11227-019-02966-8	journal	August 2019
GPU implementations of some many-body potentials for molecular dynamics simulations Minkin, Alexander S.; Knizhnik, Andrey A.; Potapkin, Boris V. Advances in Engineering Software, Vol. 111 https://doi.org/10.1016/j.advengsoft.2016.05.013	journal	September 2017
Efficient GPU-accelerated molecular dynamics simulation of solid covalent crystals Hou, Chaofeng; Xu, Ji; Wang, Peng Computer Physics Communications, Vol. 184, Issue 5 https://doi.org/10.1016/j.cpc.2013.01.001	journal	May 2013
A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics Poya, Roman; Gil, Antonio J.; Ortigosa, Rogelio Computer Physics Communications, Vol. 216 https://doi.org/10.1016/j.cpc.2017.02.016	journal	July 2017
Two-nucleon higher partial-wave scattering from lattice QCD Berkowitz, Evan; Kurth, Thorsten; Nicholson, Amy Physics Letters B, Vol. 765 https://doi.org/10.1016/j.physletb.2016.12.024	journal	February 2017
High-performance Tensor Contractions for GPUs Abdelfattah, A.; Baboulin, M.; Dobrev, V. Procedia Computer Science, Vol. 80 https://doi.org/10.1016/j.procs.2016.05.302	journal	January 2016
Hadronic molecules Guo, Feng-Kun; Hanhart, Christoph; Meißner, Ulf-G. Reviews of Modern Physics, Vol. 90, Issue 1 https://doi.org/10.1103/RevModPhys.90.015004	journal	February 2018
An Evaluation of Unified Memory Technology on NVIDIA GPUs Li, Wenqiang; Jin, Guanghao; Cui, Xuewen 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) https://doi.org/10.1109/CCGrid.2015.105	conference	May 2015
Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Wang, Hao; Potluri, Sreeram; Luo, Miao 2011 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2011.42	conference	September 2011
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments Jenkins, John; Dinan, James; Balaji, Pavan 2012 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2012.72	conference	September 2012
HARENS: Hardware Accelerated Redundancy Elimination in Network Systems Diao, Kelu; Papapanagiotou, Ioannis; Hacker, Thomas J. 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) https://doi.org/10.1109/CloudCom.2016.0048	conference	December 2016
An investigation of Unified Memory Access performance in CUDA Landaverde, Raphael; Coskun, Ayse K. 2014 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2014.7040988	conference	September 2014
Tensor Contractions with Extended BLAS Kernels on CPU and GPU Shi, Yang; Niranjan, U. N.; Anandkumar, Animashree 2016 IEEE 23rd International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2016.031	conference	December 2016
HEALS: A Parallel eALS Recommendation System on CPU/GPU Heterogeneous Platforms Wang, Qihan; Niu, Wei; Chen, Li 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC) https://doi.org/10.1109/HiPC53243.2021.00039	conference	December 2021
An overview of modern cache memory and performance analysis of replacement policies Kumar, Swadhesh; Singh, P. K. 2016 IEEE International Conference on Engineering and Technology (ICETECH) https://doi.org/10.1109/ICETECH.2016.7569243	conference	March 2016
Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs Ibrahim, Mohamed Assem; Liu, Hongyuan; Kayiran, Onur 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT) https://doi.org/10.1109/PACT.2019.00028	conference	September 2019
Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data Jenkins, John; Dinan, James; Balaji, Pavan IEEE Transactions on Parallel and Distributed Systems, Vol. 25, Issue 10 https://doi.org/10.1109/TPDS.2013.234	journal	October 2014
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA Ryoo, Shane; Rodrigues, Christopher I.; Baghsorkhi, Sara S. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08 https://doi.org/10.1145/1345206.1345220	conference	January 2008
The LRU-K page replacement algorithm for database disk buffering O'Neil, Elizabeth J.; O'Neil, Patrick E.; Weikum, Gerhard ACM SIGMOD Record, Vol. 22, Issue 2 https://doi.org/10.1145/170036.170081	journal	June 1993
Analyzing memory management methods on integrated CPU-GPU systems Dashti, Mohammad; Fedorova, Alexandra ISMM '17: International Symposium on Memory Management, Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management https://doi.org/10.1145/3092255.3092256	conference	June 2017
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading Mishra, Alok; Li, Lingda; Kong, Martin Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC - LLVM-HPC'17 https://doi.org/10.1145/3148173.3148184	conference	January 2017
Compiler assisted hybrid implicit and explicit GPU memory management under unified address space Li, Lingda; Chapman, Barbara SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356141	conference	November 2019
Analytical cache modeling and tilesize optimization for tensor contractions Li, Rui; Sukumaran-Rajam, Aravind; Veras, Richard SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356218	conference	November 2019
A Framework for Memory Oversubscription Management in Graphics Processing Units Li, Chen; Ausavarungnirun, Rachata; Rossbach, Christopher J. ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3297858.3304044	conference	April 2019
Batch-Aware Unified Memory Management in GPUs for Irregular Workloads Kim, Hyojong; Sim, Jaewoong; Gera, Prasun ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3373376.3378529	conference	March 2020
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning Niu, Wei; Ma, Xiaolong; Lin, Sheng ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3373376.3378534	conference	March 2020
BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation Wu, Qiong; Hare, Adam; Wang, Sirui ACM Transactions on Intelligent Systems and Technology, Vol. 12, Issue 5 https://doi.org/10.1145/3468268	journal	October 2021
Evaluating Multicore Algorithms on the Unified Memory Model Savage, John E.; Zubair, Mohammad Scientific Programming, Vol. 17, Issue 4 https://doi.org/10.1155/2009/681708	journal	January 2009

Similar Records

Efficient Parallelization of Irregular Applications on GPU Architectures

Thesis/Dissertation · Sun Dec 31 23:00:00 EST 2023 · OSTI ID:2349242

MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Conference · Sun May 01 00:00:00 EDT 2022 · 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) · OSTI ID:1886910

An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Journal Article · Sun Jan 04 19:00:00 EST 2015 · Computer Physics Communications · OSTI ID:1185465

Related Subjects

97 MATHEMATICS AND COMPUTING

MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation

Citation Formats

References (28)

Similar Records

Related Subjects