Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation

Journal Article · · ACM Transactions on Architecture and Code Optimization
DOI:https://doi.org/10.1145/3506705· OSTI ID:1867362
 [1];  [1];  [1];  [2];  [2]
  1. College of William and Mary, Williamsburg, VA (United States)
  2. Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA (United States)
The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy: the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy: although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications.
Research Organization:
Thomas Jefferson National Accelerator Facility, Newport News, VA (United States)
Sponsoring Organization:
NSF; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Nuclear Physics (NP)
Grant/Contract Number:
AC05-06OR23177
OSTI ID:
1867362
Report Number(s):
DOE/OR/23177-5487; JLAB-CST-22-3602; CCF-2047516; DE-AC05-06OR23177; 17-SC-20-SC
Journal Information:
ACM Transactions on Architecture and Code Optimization, Journal Name: ACM Transactions on Architecture and Code Optimization Journal Issue: 2 Vol. 19; ISSN 1544-3566
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (28)

Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs journal August 2019
GPU implementations of some many-body potentials for molecular dynamics simulations journal September 2017
Efficient GPU-accelerated molecular dynamics simulation of solid covalent crystals journal May 2013
A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics journal July 2017
Two-nucleon higher partial-wave scattering from lattice QCD journal February 2017
High-performance Tensor Contractions for GPUs journal January 2016
Hadronic molecules journal February 2018
An Evaluation of Unified Memory Technology on NVIDIA GPUs conference May 2015
Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 conference September 2011
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments conference September 2012
HARENS: Hardware Accelerated Redundancy Elimination in Network Systems conference December 2016
An investigation of Unified Memory Access performance in CUDA conference September 2014
Tensor Contractions with Extended BLAS Kernels on CPU and GPU conference December 2016
HEALS: A Parallel eALS Recommendation System on CPU/GPU Heterogeneous Platforms conference December 2021
An overview of modern cache memory and performance analysis of replacement policies conference March 2016
Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs conference September 2019
Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data journal October 2014
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
  • Ryoo, Shane; Rodrigues, Christopher I.; Baghsorkhi, Sara S.
  • Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08 https://doi.org/10.1145/1345206.1345220
conference January 2008
The LRU-K page replacement algorithm for database disk buffering journal June 1993
Analyzing memory management methods on integrated CPU-GPU systems
  • Dashti, Mohammad; Fedorova, Alexandra
  • ISMM '17: International Symposium on Memory Management, Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management https://doi.org/10.1145/3092255.3092256
conference June 2017
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading conference January 2017
Compiler assisted hybrid implicit and explicit GPU memory management under unified address space
  • Li, Lingda; Chapman, Barbara
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356141
conference November 2019
Analytical cache modeling and tilesize optimization for tensor contractions
  • Li, Rui; Sukumaran-Rajam, Aravind; Veras, Richard
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356218
conference November 2019
A Framework for Memory Oversubscription Management in Graphics Processing Units
  • Li, Chen; Ausavarungnirun, Rachata; Rossbach, Christopher J.
  • ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3297858.3304044
conference April 2019
Batch-Aware Unified Memory Management in GPUs for Irregular Workloads
  • Kim, Hyojong; Sim, Jaewoong; Gera, Prasun
  • ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3373376.3378529
conference March 2020
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
  • Niu, Wei; Ma, Xiaolong; Lin, Sheng
  • ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3373376.3378534
conference March 2020
BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation journal October 2021
Evaluating Multicore Algorithms on the Unified Memory Model journal January 2009

Similar Records

Efficient Parallelization of Irregular Applications on GPU Architectures
Thesis/Dissertation · Sun Dec 31 23:00:00 EST 2023 · OSTI ID:2349242

MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions
Conference · Sun May 01 00:00:00 EDT 2022 · 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) · OSTI ID:1886910

An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Journal Article · Sun Jan 04 19:00:00 EST 2015 · Computer Physics Communications · OSTI ID:1185465

Related Subjects