MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Wang, Qihan; Ren, Bin; Chen, Jie; Edwards, Robert G.

doi:10.1109/ipdps53621.2022.00022

MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Conference · Sun May 01 04:00:00 EDT 2022 · 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

DOI:https://doi.org/10.1109/ipdps53621.2022.00022· OSTI ID:1886910

Wang, Qihan ^[1]; Ren, Bin ^[1]; Chen, Jie ^[2]; Edwards, Robert G. ^[2]

William & Mary,Department of Computer Science,Williamsburg,VA
Jefferson Lab,Newport News,VA

Calculation of many-body correlation functions is one of the critical kernels utilized in many scientific computing areas, especially in Lattice Quantum Chromodynamics (Lattice QCD). It is formalized as a sum of a large number of contraction terms each of which can be represented by a graph consisting of vertices describing quarks inside a hadron node and edges designating quark propagations at specific time intervals. Due to its computation- and memory-intensive nature, real-world physics systems (e.g., multi-meson or multi-baryon systems) explored by Lattice QCD prefer to leverage multi-GPUs. Different from general graph processing, many-body correlation function calculations show two specific features: a large number of computation-/data-intensive kernels and frequently repeated appearances of original and intermediate data. The former results in expensive memory operations such as tensor movements and evictions. The latter offers data reuse opportunities to mitigate the data-intensive nature of many-body correlation function calculations. However, existing graph-based multi-GPU schedulers cannot capture these data-centric features, thus resulting in a sub-optimal performance for many-body correlation function calculations. To address this issue, this paper presents a multi-GPU scheduling framework, MICCO, to accelerate contractions for correlation functions particularly by taking the data dimension (e.g., data reuse and data eviction) into account. This work first performs a comprehensive study on the interplay of data reuse and load balance, and designs two new concepts: local reuse pattern and reuse bound to study the opportunity of achieving the optimal trade-off between them. Based on this study, MICCO proposes a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds. Specifically, MICCO is integrated into a real-world Lattice QCD system, Redstar, for the first time running on multiple GPUs. The evaluation demonstrates MICCO outperforms other state-of-art works, achieving up to 2.25× speedup in synthesized datasets, and 1.49× speedup in real-world correlation functions.

View Conference

Research Organization:: Thomas Jefferson National Accelerator Facility, Newport News, VA (United States)

Sponsoring Organization:: U.S. Department of Energy; USDOE Office of Science (SC), Nuclear Physics (NP)

DOE Contract Number:: AC05-06OR23177

OSTI ID:: 1886910

Report Number(s):: JLAB-CST-22-3715; DOE/OR/23177-5614; NSF award CCF-2047516; 17-SC-20-SC Exascale Computing Project

Conference Information:: Journal Name: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Country of Publication:: United States

Language:: English

References (25)

Dynamic load balancing on single- and multi-GPU systems Chen, Long; Villa, Oreste; Krishnamoorthy, Sriram 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470413	conference	April 2010
Coda Kim, Hyojong; Hadidi, Ramyad; Nai, Lifeng ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 3 https://doi.org/10.1145/3232521	journal	September 2018
Nucleon electromagnetic form factors from lattice QCD using 2 + 1 flavor domain wall fermions on fine lattices and chiral perturbation theory Syritsyn, S. N.; Bratt, J. D.; Lin, M. F. Physical Review D, Vol. 81, Issue 3 https://doi.org/10.1103/PhysRevD.81.034507	journal	February 2010
Speaking with Actions - Learning Customer Journey Behavior Wu, Qiong; Hsu, Wen-Ling; Xu, Tan 2019 IEEE 13th International Conference on Semantic Computing (ICSC) https://doi.org/10.1109/ICOSC.2019.8665577	conference	January 2019
Lattice QCD with two dynamical flavors of domain wall fermions Aoki, Y.; Blum, T.; Christ, N. Physical Review D, Vol. 72, Issue 11 https://doi.org/10.1103/PhysRevD.72.114505	journal	December 2005
Machine learning: Trends, perspectives, and prospects Jordan, M. I.; Mitchell, T. M. Science, Vol. 349, Issue 6245 https://doi.org/10.1126/science.aaa8415	journal	July 2015
Scalable framework for mapping streaming applications onto multi-GPU systems Huynh, Huynh Phung; Hagiescu, Andrei; Wong, Weng-Fai Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145818	conference	January 2012
Two-nucleon higher partial-wave scattering from lattice QCD Berkowitz, Evan; Kurth, Thorsten; Nicholson, Amy Physics Letters B, Vol. 765 https://doi.org/10.1016/j.physletb.2016.12.024	journal	February 2017
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms Teodoro, George; Pan, Tony; Kurc, Tahsin M. 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.11	conference	May 2013
Spearman’s rank correlation coefficient Sedgwick, Philip BMJ https://doi.org/10.1136/bmj.g7327	journal	November 2014
Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems Melot, Nicolas; Kessler, Christoph; Keller, Jörg ACM Transactions on Architecture and Code Optimization, Vol. 11, Issue 4 https://doi.org/10.1145/2687653	journal	January 2015
Crossbow Koliousis, Alexandros; Watcharapichat, Pijika; Weidlich, Matthias Proceedings of the VLDB Endowment, Vol. 12, Issue 11 https://doi.org/10.14778/3342263.3342276	journal	July 2019
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations Ben-Nun, Tal; Sutton, Michael; Pai, Sreepathi ACM SIGPLAN Notices, Vol. 52, Issue 8 https://doi.org/10.1145/3155284.3018756	journal	October 2017
Gunrock Wang, Yangzihao; Davidson, Andrew; Pan, Yuechao Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/2851141.2851145	conference	February 2016
MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems Han, Jingoo; Rafique, M. Mustafa; Xu, Luna 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) https://doi.org/10.1109/CCGrid49817.2020.00-66	conference	May 2020
Unsupervised Learning With Random Forest Predictors Shi, Tao; Horvath, Steve Journal of Computational and Graphical Statistics, Vol. 15, Issue 1 https://doi.org/10.1198/106186006X94072	journal	March 2006
Software Pipelined Execution of Stream Programs on GPUs Udupa, Abhishek; Govindarajan, R.; Thazhuthaveetil, Matthew J. 2009 International Symposium on Code Generation and Optimization https://doi.org/10.1109/CGO.2009.20	conference	March 2009
The Effectiveness of Threshold-Based Scheduling Policies in BOINC Projects Estrada, Trilce; Flores, David A.; Taufer, Michela 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) https://doi.org/10.1109/E-SCIENCE.2006.261172	conference	December 2006
Data-Aware Task Scheduling on Multi-accelerator Based Platforms Augonnet, Cedric; Clet-Ortega, Jerome; Thibault, Samuel 2010 IEEE 16th International Conference on Parallel and Distributed Systems https://doi.org/10.1109/ICPADS.2010.129	conference	December 2010
Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters Herbein, Stephen; Ahn, Dong H.; Lipari, Don HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907316	conference	May 2016
Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters Wu, Qiong; Liu, Zhenming 2021 17th International Conference on Mobility, Sensing and Networking (MSN) https://doi.org/10.1109/MSN53354.2021.00073	conference	December 2021
BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation Wu, Qiong; Hare, Adam; Wang, Sirui ACM Transactions on Intelligent Systems and Technology, Vol. 12, Issue 5 https://doi.org/10.1145/3468268	journal	October 2021
A distributed multi-GPU system for fast graph processing Jia, Zhihao; Kwon, Yongkee; Shipman, Galen Proceedings of the VLDB Endowment, Vol. 11, Issue 3 https://doi.org/10.14778/3157794.3157799	journal	November 2017
Equity2Vec Wu, Qiong; Brinton, Christopher G.; Zhang, Zheng Proceedings of the Second ACM International Conference on AI in Finance https://doi.org/10.1145/3490354.3494409	conference	November 2021
Whippletree Steinberger, Markus; Kenzel, Michael; Boechat, Pedro ACM Transactions on Graphics, Vol. 33, Issue 6 https://doi.org/10.1145/2661229.2661250	journal	November 2014

Similar Records

MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation

Journal Article · Wed Mar 23 20:00:00 EDT 2022 · ACM Transactions on Architecture and Code Optimization · OSTI ID:1867362

Efficient Parallelization of Irregular Applications on GPU Architectures

Thesis/Dissertation · Sun Dec 31 23:00:00 EST 2023 · OSTI ID:2349242

Graph Contractions for Calculating Correlation Functions in Lattice QCD

Conference · Thu Jun 01 00:00:00 EDT 2023 · OSTI ID:2203364

MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Citation Formats

References (25)

Similar Records

Related Subjects