Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing

Conference ·
 [1];  [1];  [1];  [2];  [3];  [4]
  1. University of California, Santa Barbara
  2. Amazon
  3. BATTELLE (PACIFIC NW LAB)
  4. University of California, San Diego
Ensuring high-quality recommendations for newly onboarded users requires the continuous retraining of Deep Learning Recommendation Models (DLRMs) with freshly generated data. To serve the online DLRM retraining, existing solutions use hundreds of CPU computing nodes designated for input preprocessing, causing significant power consumption that surpasses even the power usage of GPU trainers. To this end, we propose RAP, an end-to-end DLRM training framework that supports Resource-aware Automated GPU sharing for DLRM input Preprocessing and Training. The core idea of RAP is to accurately capture the remaining GPU computing resources during DLRM training for input preprocessing, achieving superior training efficiency without requiring additional resources. Specifically, RAP utilizes a co-running cost model to efficiently assess the costs of various input preprocessing operations, and it implements a resource-aware horizontal fusion technique that adaptively merges smaller kernels according to GPU availability, circumventing any interference with DLRM training. In addition, RAP leverages a heuristic searching algorithm that jointly optimizes both the input preprocessing graph mapping and the co-running schedule to maximize the end-to-end DLRM training throughput. The comprehensive evaluation shows that RAP achieves 78.3× speedup on average over CPU-based DLRM input preprocessing frameworks. In addition, the end-to-end training throughput of RAP is only 2.04% lower than the ideal case, which has no input preprocessing overhead.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
2446788
Report Number(s):
PNNL-SA-189479
Country of Publication:
United States
Language:
English

Similar Records

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
Conference · Fri May 01 00:00:00 EDT 2020 · OSTI ID:1649080

RACB: Resource Aware Cache Bypass on GPUs
Conference · Wed Oct 01 00:00:00 EDT 2014 · 2014 International Symposium on Computer Architecture and High Performance Computing Workshop; 22-24 Oct. 2014; Paris, France · OSTI ID:1567596

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training
Conference · Fri Jun 23 00:00:00 EDT 2023 · OSTI ID:1988131

Related Subjects