Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Guo, Anqi; Hao, Yuchen; Wu, Chunshu; Haghi, Pouya; Pan, Zhenyu; Si, Min; Tao, Dingwen; Li, Ang; Herbordt, Martin; Geng, Tong

doi:10.1145/3577193.3593724

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Conference · Fri Jun 23 04:00:00 EDT 2023

DOI:https://doi.org/10.1145/3577193.3593724· OSTI ID:1988131

Guo, Anqi ^[1]; Hao, Yuchen ^[2]; Wu, Chunshu ^[1]; Haghi, Pouya ^[1]; Pan, Zhenyu ^[3]; Si, Min ^[2]; Tao, Dingwen ^[4]; Li, Ang ^[5]; Herbordt, Martin ^[1]; Geng, Tong ^[3]

Boston University
Meta
University of Rochester
Indiana University-Bloomington
BATTELLE (PACIFIC NW LAB)

Deep Learning Recommendation Models (DLRMs) are critical applications in various domains and have evolved as one of the single largest machine learning applications. Trillions of DLRM parameters exceed the on-chip memory capacity of GPUs. Large-scale multi-node systems are required for distributed DLRM inference and training, which suffer from the all-to-all communication bottleneck, mainly limiting the scalability of ever-growing DLRMs. In recent years, SmartNICs have evolved with coupled computation and communication capabilities providing opportunities for a powerful heterogeneous device in the system. However, there isn't such a distributed system that fully leverages the abundant smartNIC resources that resolve the scalability issue of DLRMs. In this work, we proposed a software-hardware co-design of a heterogeneous smartNIC system that resolves the communication bottleneck of distributed DLRMs, mitigates the memory bandwidth pressure, and improves computation efficiency. We provide a set of smartNIC designs of cache systems (including local cache and remote cache) and smartNIC computation kernels which reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches which optimizes the overall system performance with higher data reuse. Our evaluation shows that our system achieves 2.1x latency speedup for inference and 1.6x throughput speedup for training.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1988131

Report Number(s):: PNNL-SA-181666

Country of Publication:: United States

Language:: English

Similar Records

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model

Conference · Wed Jul 10 00:00:00 EDT 2024 · OSTI ID:2439115

A Framework for Neural Network Inference on FPGA-Centric SmartNICs

Conference · Fri Sep 30 00:00:00 EDT 2022 · OSTI ID:1964158

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing

Conference · Sat Apr 27 00:00:00 EDT 2024 · OSTI ID:2446788

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Citation Formats

Similar Records

Related Subjects