RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing

Wang, Zheng; Wang, Yuke; Deng, Jiaqi; Zheng, Da; Li, Ang; Ding, Yufei

doi:10.1145/3620665.3640406

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing

Conference · Sat Apr 27 04:00:00 EDT 2024

DOI:https://doi.org/10.1145/3620665.3640406· OSTI ID:2446788

Wang, Zheng ^[1]; Wang, Yuke ^[1]; Deng, Jiaqi ^[1]; Zheng, Da ^[2]; Li, Ang ^[3]; Ding, Yufei ^[4]

University of California, Santa Barbara
Amazon
BATTELLE (PACIFIC NW LAB)
University of California, San Diego

Ensuring high-quality recommendations for newly onboarded users requires the continuous retraining of Deep Learning Recommendation Models (DLRMs) with freshly generated data. To serve the online DLRM retraining, existing solutions use hundreds of CPU computing nodes designated for input preprocessing, causing significant power consumption that surpasses even the power usage of GPU trainers. To this end, we propose RAP, an end-to-end DLRM training framework that supports Resource-aware Automated GPU sharing for DLRM input Preprocessing and Training. The core idea of RAP is to accurately capture the remaining GPU computing resources during DLRM training for input preprocessing, achieving superior training efficiency without requiring additional resources. Specifically, RAP utilizes a co-running cost model to efficiently assess the costs of various input preprocessing operations, and it implements a resource-aware horizontal fusion technique that adaptively merges smaller kernels according to GPU availability, circumventing any interference with DLRM training. In addition, RAP leverages a heuristic searching algorithm that jointly optimizes both the input preprocessing graph mapping and the co-running schedule to maximize the end-to-end DLRM training throughput. The comprehensive evaluation shows that RAP achieves 78.3× speedup on average over CPU-based DLRM input preprocessing frameworks. In addition, the end-to-end training throughput of RAP is only 2.04% lower than the ideal case, which has no input preprocessing overhead.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 2446788

Report Number(s):: PNNL-SA-189479

Country of Publication:: United States

Language:: English

Similar Records

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems

Conference · Fri May 01 00:00:00 EDT 2020 · OSTI ID:1649080

RACB: Resource Aware Cache Bypass on GPUs

Conference · Wed Oct 01 00:00:00 EDT 2014 · 2014 International Symposium on Computer Architecture and High Performance Computing Workshop; 22-24 Oct. 2014; Paris, France · OSTI ID:1567596

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Conference · Fri Jun 23 00:00:00 EDT 2023 · OSTI ID:1988131

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing

Citation Formats

Similar Records

Related Subjects