Flexible silicon photonic architecture for accelerating distributed deep learning
The increasing size and complexity of deep learning (DL) models have led to the wide adoption of distributed training methods in datacenters (DCs) and high-performance computing (HPC) systems. However, communication among distributed computing units (CUs) has emerged as a major bottleneck in the training process. In this study, we propose Flex-SiPAC, a flexible silicon photonic accelerated compute cluster designed to accelerate multi-tenant distributed DL training workloads. Flex-SiPAC takes a co-design approach that combines a silicon photonic hardware platform with a tailored collective algorithm, optimized to leverage the unique physical properties of the architecture. The hardware platform integrates a novel wavelength-reconfigurable transceiver design and a micro-resonator-based wavelength-reconfigurable switch, enabling the system to achieve flexible bandwidth steering in the wavelength domain. The collective algorithm is designed to support reconfigurable topologies, enabling efficient all-reduce communications that are commonly used in DL training. The feasibility of the Flex-SiPAC architecture is demonstrated through two testbed experiments. First, an optical testbed experiment demonstrates the flexible routing of wavelengths by shuffling an array of input wavelengths using a custom-designed spatial-wavelength selective switch. Second, a four-GPU testbed running two DL workloads shows a 23% improvement in job completion time compared to a similarly sized leaf-spine topology. We further evaluate Flex-SiPAC using large-scale simulations, which show that Flex-SiPAC is able to reduce the communication time by 26% to 29% compared to state-of-the-art compute clusters under representative collective operations.
- Sponsoring Organization:
- USDOE Advanced Research Projects Agency - Energy (ARPA-E)
- OSTI ID:
- 2280467
- Journal Information:
- Journal of Optical Communications and Networking, Journal Name: Journal of Optical Communications and Networking Journal Issue: 2 Vol. 16; ISSN JOCNBB; ISSN 1943-0620
- Publisher:
- Optical Society of AmericaCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
LEED: A Lightwave Energy-Efficient Datacenter
New trends in photonic switching and optical networking architectures for data centers and computing systems [Invited]