Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Conference ·
 [1];  [2];  [1];  [1];  [3];  [2];  [4];  [5];  [1];  [3]
  1. Boston University
  2. Meta
  3. University of Rochester
  4. Indiana University-Bloomington
  5. BATTELLE (PACIFIC NW LAB)
Deep Learning Recommendation Models (DLRMs) are critical applications in various domains and have evolved as one of the single largest machine learning applications. Trillions of DLRM parameters exceed the on-chip memory capacity of GPUs. Large-scale multi-node systems are required for distributed DLRM inference and training, which suffer from the all-to-all communication bottleneck, mainly limiting the scalability of ever-growing DLRMs. In recent years, SmartNICs have evolved with coupled computation and communication capabilities providing opportunities for a powerful heterogeneous device in the system. However, there isn't such a distributed system that fully leverages the abundant smartNIC resources that resolve the scalability issue of DLRMs. In this work, we proposed a software-hardware co-design of a heterogeneous smartNIC system that resolves the communication bottleneck of distributed DLRMs, mitigates the memory bandwidth pressure, and improves computation efficiency. We provide a set of smartNIC designs of cache systems (including local cache and remote cache) and smartNIC computation kernels which reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches which optimizes the overall system performance with higher data reuse. Our evaluation shows that our system achieves 2.1x latency speedup for inference and 1.6x throughput speedup for training.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1988131
Report Number(s):
PNNL-SA-181666
Country of Publication:
United States
Language:
English

Similar Records

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
Conference · Wed Jul 10 00:00:00 EDT 2024 · OSTI ID:2439115

A Framework for Neural Network Inference on FPGA-Centric SmartNICs
Conference · Fri Sep 30 00:00:00 EDT 2022 · OSTI ID:1964158

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
Conference · Sat Apr 27 00:00:00 EDT 2024 · OSTI ID:2446788

Related Subjects