DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applications

Journal Article · · Frontiers in High Performance Computing

Introduction Reconstructing low-level particle tracks in neutrino physics can address some of the most fundamental questions about the universe. However, processing petabytes of raw data using deep learning techniques poses a challenging problem in the field of High Energy Physics (HEP). In the Exa.TrkX Project, an illustrative HEP application, preprocessed simulation data is fed into a state-of-art Graph Neural Network (GNN) model, accelerated by GPUs. However, limited GPU memory often leads to Out-of-Memory (OOM) exceptions during training, due to the large size of models and datasets. This problem is exacerbated when deploying models on High-Performance Computing (HPC) systems designed for large-scale applications. Methods We observe a high workload imbalance issue during GNN model training caused by the irregular sizes of input graph samples in HEP datasets, contributing to OOM exceptions. We aim to scale GNNs on HPC systems, by prioritizing workload balance in graph inputs while maintaining model accuracy. Our paper introduces diverse balancing strategies aimed at decreasing the maximum GPU memory footprint and avoiding the OOM exception, across various datasets. Results Our experiments showcase memory reduction of up to 32.14% compared to the baseline. We also demonstrate the proposed strategies can avoid OOM in application. Additionally, we create a distributed multi-GPU implementation using these samplers to demonstrate the scalability of these techniques on the HEP dataset. Discussion By assessing the performance of these strategies as data loading samplers across multiple datasets, we can gauge their effectiveness in both single-GPU and distributed environments. Our experiments, conducted on datasets of varying sizes and across multiple GPUs, broaden the applicability of our work to various GNN applications that handle input datasets with irregular graph sizes.

Sponsoring Organization:
USDOE Office of Science (SC), High Energy Physics (HEP)
Grant/Contract Number:
SC0019358; SC0021399; AC02-07CH11359; AC02-05CH11231
OSTI ID:
2446997
Journal Information:
Frontiers in High Performance Computing, Journal Name: Frontiers in High Performance Computing Vol. 2; ISSN 2813-7337
Publisher:
Frontiers Media SACopyright Statement
Country of Publication:
Country unknown/Code not available
Language:
English

References (15)

A complete anytime algorithm for number partitioning journal December 1998
Graph Neural Network for Object Reconstruction in Liquid Argon Time Projection Chambers journal January 2021
Design and construction of the MicroBooNE detector journal February 2017
A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows conference May 2023
Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning conference May 2022
Performance of a geometric deep learning pipeline for HL-LHC particle tracking journal October 2021
Flash Storage Today journal July 2008
Scalable Deep Learning via I/O Analysis and Optimization journal September 2019
ZeRO-infinity
  • Rajbhandari, Samyam; Ruwase, Olatunji; Rasley, Jeff
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476205
conference November 2021
A new polynomial-time algorithm for linear programming conference January 1984
A distributed multi-GPU system for fast graph processing journal November 2017
AliGraph journal August 2019
MegDet: A Large Mini-Batch Object Detector preprint January 2017
Fast Graph Representation Learning with PyTorch Geometric preprint January 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library preprint January 2019

Related Subjects