Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Conference · · 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
 [1];  [1];  [2];  [2]
  1. William & Mary,Department of Computer Science,Williamsburg,VA
  2. Jefferson Lab,Newport News,VA

Calculation of many-body correlation functions is one of the critical kernels utilized in many scientific computing areas, especially in Lattice Quantum Chromodynamics (Lattice QCD). It is formalized as a sum of a large number of contraction terms each of which can be represented by a graph consisting of vertices describing quarks inside a hadron node and edges designating quark propagations at specific time intervals. Due to its computation- and memory-intensive nature, real-world physics systems (e.g., multi-meson or multi-baryon systems) explored by Lattice QCD prefer to leverage multi-GPUs. Different from general graph processing, many-body correlation function calculations show two specific features: a large number of computation-/data-intensive kernels and frequently repeated appearances of original and intermediate data. The former results in expensive memory operations such as tensor movements and evictions. The latter offers data reuse opportunities to mitigate the data-intensive nature of many-body correlation function calculations. However, existing graph-based multi-GPU schedulers cannot capture these data-centric features, thus resulting in a sub-optimal performance for many-body correlation function calculations. To address this issue, this paper presents a multi-GPU scheduling framework, MICCO, to accelerate contractions for correlation functions particularly by taking the data dimension (e.g., data reuse and data eviction) into account. This work first performs a comprehensive study on the interplay of data reuse and load balance, and designs two new concepts: local reuse pattern and reuse bound to study the opportunity of achieving the optimal trade-off between them. Based on this study, MICCO proposes a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds. Specifically, MICCO is integrated into a real-world Lattice QCD system, Redstar, for the first time running on multiple GPUs. The evaluation demonstrates MICCO outperforms other state-of-art works, achieving up to 2.25× speedup in synthesized datasets, and 1.49× speedup in real-world correlation functions.

Research Organization:
Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Nuclear Physics (NP)
DOE Contract Number:
AC05-06OR23177
OSTI ID:
1886910
Report Number(s):
JLAB-CST-22-3715; DOE/OR/23177-5614; NSF award CCF-2047516; 17-SC-20-SC Exascale Computing Project
Journal Information:
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Conference: IPDPS 2022, 30 May-3 June 2022, Lyon, France
Country of Publication:
United States
Language:
English

References (25)

Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters December 2021
Speaking with Actions - Learning Customer Journey Behavior January 2019
BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation October 2021
Equity2Vec November 2021
The Effectiveness of Threshold-Based Scheduling Policies in BOINC Projects December 2006
MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems May 2020
Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters
  • No authors listed
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907316
May 2016
Scalable framework for mapping streaming applications onto multi-GPU systems January 2012
A distributed multi-GPU system for fast graph processing November 2017
Machine learning: Trends, perspectives, and prospects July 2015
Coda September 2018
Crossbow July 2019
Software Pipelined Execution of Stream Programs on GPUs March 2009
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations October 2017
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms
  • No authors listed
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.11
May 2013
Data-Aware Task Scheduling on Multi-accelerator Based Platforms December 2010
Gunrock February 2016
Two-nucleon higher partial-wave scattering from lattice QCD February 2017
Lattice QCD with two dynamical flavors of domain wall fermions December 2005
Dynamic load balancing on single- and multi-GPU systems April 2010
Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems January 2015
Spearman’s rank correlation coefficient November 2014
Unsupervised Learning With Random Forest Predictors March 2006
Nucleon electromagnetic form factors from lattice QCD using 2 + 1 flavor domain wall fermions on fine lattices and chiral perturbation theory February 2010
Whippletree November 2014

Similar Records

Efficient Parallelization of Irregular Applications on GPU Architectures
Thesis/Dissertation · 2024 · OSTI ID:2349242

MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation
Journal Article · 2022 · ACM Transactions on Architecture and Code Optimization · OSTI ID:1867362

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)
Technical Report · 2019 · OSTI ID:1576175

Related Subjects