Distributed deep learning training using silicon photonic switched architectures
- Columbia University, New York, NY (United States); Columbia University, New York, New York 10027, USA
- Columbia University, New York, NY (United States)
The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.
- Research Organization:
- Columbia University, New York, NY (United States)
- Sponsoring Organization:
- National Security Agency (NSA); USDOE Advanced Research Projects Agency - Energy (ARPA-E); USDOE Office of Science (SC), Office of SBIR/STTR Programs (SBIR/STTR)
- Grant/Contract Number:
- AR0000843
- OSTI ID:
- 1978979
- Journal Information:
- APL Photonics, Journal Name: APL Photonics Journal Issue: 3 Vol. 7; ISSN 2378-0967
- Publisher:
- American Institute of Physics (AIP)Copyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Flexible silicon photonic architecture for accelerating distributed deep learning
Performance trade-offs in reconfigurable networks for HPC
Optics Enabled Networks and Architectures for Data Center Cost and Power Efficiency
Journal Article
·
Mon Jan 08 19:00:00 EST 2024
· Journal of Optical Communications and Networking
·
OSTI ID:2280467
Performance trade-offs in reconfigurable networks for HPC
Journal Article
·
Tue May 10 20:00:00 EDT 2022
· Journal of Optical Communications and Networking
·
OSTI ID:1874993
Optics Enabled Networks and Architectures for Data Center Cost and Power Efficiency
Journal Article
·
Thu Oct 14 20:00:00 EDT 2021
· Journal of Optical Communications and Networking
·
OSTI ID:1828354