Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Flexible silicon photonic architecture for accelerating distributed deep learning

Journal Article · · Journal of Optical Communications and Networking
DOI:https://doi.org/10.1364/JOCN.497372· OSTI ID:2280467

The increasing size and complexity of deep learning (DL) models have led to the wide adoption of distributed training methods in datacenters (DCs) and high-performance computing (HPC) systems. However, communication among distributed computing units (CUs) has emerged as a major bottleneck in the training process. In this study, we propose Flex-SiPAC, a flexible silicon photonic accelerated compute cluster designed to accelerate multi-tenant distributed DL training workloads. Flex-SiPAC takes a co-design approach that combines a silicon photonic hardware platform with a tailored collective algorithm, optimized to leverage the unique physical properties of the architecture. The hardware platform integrates a novel wavelength-reconfigurable transceiver design and a micro-resonator-based wavelength-reconfigurable switch, enabling the system to achieve flexible bandwidth steering in the wavelength domain. The collective algorithm is designed to support reconfigurable topologies, enabling efficient all-reduce communications that are commonly used in DL training. The feasibility of the Flex-SiPAC architecture is demonstrated through two testbed experiments. First, an optical testbed experiment demonstrates the flexible routing of wavelengths by shuffling an array of input wavelengths using a custom-designed spatial-wavelength selective switch. Second, a four-GPU testbed running two DL workloads shows a 23% improvement in job completion time compared to a similarly sized leaf-spine topology. We further evaluate Flex-SiPAC using large-scale simulations, which show that Flex-SiPAC is able to reduce the communication time by 26% to 29% compared to state-of-the-art compute clusters under representative collective operations.

Sponsoring Organization:
USDOE Advanced Research Projects Agency - Energy (ARPA-E)
OSTI ID:
2280467
Journal Information:
Journal of Optical Communications and Networking, Journal Name: Journal of Optical Communications and Networking Journal Issue: 2 Vol. 16; ISSN JOCNBB; ISSN 1943-0620
Publisher:
Optical Society of AmericaCopyright Statement
Country of Publication:
United States
Language:
English

References (38)

Ultra-dense optical data transmission over standard fibre with a single chip source journal May 2020
A 128 Gb/s PAM4 Silicon Microring Modulator With Integrated Thermo-Optic Resonance Tuning journal January 2019
PULSE: Optical Circuit Switched Data Center Architecture Operating at Nanosecond Timescales journal September 2020
Silicon Photonic Flex-LIONS for Reconfigurable Multi-GPU Systems journal February 2021
X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine Learning journal July 2021
Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications journal June 2023
Scalable Microring-Based Silicon Clos Switch Fabric With Switch-and-Select Stages journal September 2019
Petabit-Scale Silicon Photonic Interconnects With Integrated Kerr Frequency Combs journal January 2023
High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response
  • Wozniak, Justin M.; Yoo, Hyunseung; Mohd-Yusof, Jamaludin
  • 2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S) https://doi.org/10.1109/MLHPCAI4S51975.2020.00012
conference November 2020
Flexfly: Enabling a Reconfigurable Dragonfly through Silicon Photonics
  • Wen, Ke; Samadi, Payman; Rumley, Sebastien
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.14
conference November 2016
Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks conference November 2020
Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology journal April 2022
Enabling Quasi-Static Reconfigurable Networks With Robust Topology Engineering journal June 2023
Scalable architecture for sub-pJ/b multi-Tbps comb-driven DWDM silicon photonic transceiver conference March 2023
BCube conference August 2009
Efficient large-scale language model training on GPU clusters using megatron-LM
  • Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476209
conference November 2021
Software-hardware co-design for fast and scalable training of deep learning recommendation models conference June 2022
Jupiter evolving conference August 2022
New methods to color the vertices of a graph journal April 1979
Optimization of Collective Communication Operations in MPICH journal February 2005
Flexspander: augmenting expander networks in high-performance systems with optical bandwidth steering journal January 2020
RDON: a rack-scale disaggregated data center network based on a distributed fast optical switch journal July 2020
Performance trade-offs in reconfigurable networks for HPC journal May 2022
Prospects and challenges of optical switching technologies for intra data center networks journal October 2022
Very Deep Convolutional Networks for Large-Scale Image Recognition preprint January 2014
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes preprint January 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding preprint January 2018
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism preprint January 2019
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems preprint January 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding preprint January 2020
Carbon Emissions and Large Neural Network Training preprint January 2021
Integrated Kerr frequency comb-driven silicon photonic transmitter preprint January 2021
LaMDA: Language Models for Dialog Applications preprint January 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model preprint January 2022
PaLM: Scaling Language Modeling with Pathways preprint January 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model preprint January 2022
LLaMA: Open and Efficient Foundation Language Models preprint January 2023
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings preprint January 2023

Similar Records

Distributed deep learning training using silicon photonic switched architectures
Journal Article · Mon Feb 28 19:00:00 EST 2022 · APL Photonics · OSTI ID:1978979

LEED: A Lightwave Energy-Efficient Datacenter
Technical Report · Fri May 31 20:00:00 EDT 2024 · OSTI ID:2565965

New trends in photonic switching and optical networking architectures for data centers and computing systems [Invited]
Journal Article · Sat Dec 31 23:00:00 EST 2022 · Journal of Optical Communications and Networking · OSTI ID:2421350

Related Subjects