Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration

Journal Article · · ACM Transactions on Mathematical Software
DOI:https://doi.org/10.1145/3759244· OSTI ID:3005462
TorchBraid is a high-performance implementation of layer-parallel training for deep neural networks (DNNs) supporting MPI-based parallelism and GPU acceleration. Layer-parallel training has been developed to overcome the serialization inherent in forward and backward propagation of DNNs that limits utilization of computational resources in the strong scaling limit. To achieve this, TorchBraid integrates the PyTorch neural network framework with the state-of-the-art XBraid time-parallel library. Furthermore, this article presents the use and performance of TorchBraid, in addition to solutions for overcoming the algorithmic challenges inherent in combining automatic differentiation with layer-parallel. Results are presented with and without GPU acceleration for the Tiny ImageNet and MNIST image classification data sets, as well as recurrent neural networks. Overall, TorchBraid enables fast training of DNNs, both in a strong and weak scaling context. In addition to the TorchBraid software, several new advances in applying layer-parallel algorithms are detailed. Integration of layer-parallel with data-parallel algorithms is presented for the first time, showing the computational advantages of the combination. Standard deep learning techniques, like batch-normalization, are developed for layer-parallel training. Finally, a new approach combining layer-parallel with spatial coarsening in order to accelerate training for 3D image classification shows roughly a 10× speedup over serial execution.
Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
Deutsche Forschungsgemeinschaft (DFG); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
NA0003525
OSTI ID:
3005462
Report Number(s):
LA-UR--24-20385; SAND--2025-14725J; 1784019
Journal Information:
ACM Transactions on Mathematical Software, Journal Name: ACM Transactions on Mathematical Software Journal Issue: 3 Vol. 51; ISSN 0098-3500; ISSN 1557-7295
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (47)

50 Years of Time Parallel Time Integration book January 2015
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position journal April 1980
Multigrid methods with space–time concurrency journal August 2017
Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning journal October 2017
A Proposal on Machine Learning via Dynamical Systems journal March 2017
A fast algorithm for particle simulations journal December 1987
A review of algebraic multigrid journal March 2001
Parallel distributed computing using Python journal September 2011
MPI for Python journal September 2005
MPI for Python: Performance improvements and MPI-2 extensions journal May 2008
Algebraic multigrid methods journal May 2017
Deep learning journal May 2015
Array programming with NumPy journal September 2020
Neural networks and physical systems with emergent collective computational abilities. journal April 1982
Parallel Approximate Ideal Restriction Multigrid for Solving the S N Transport Equations journal June 2020
Stable architectures for deep neural networks journal December 2017
The fast multipole method for the wave equation: a pedestrian prescription journal June 1993
The fast multipole method (FMM) for electromagnetic scattering problems journal June 1992
Implementing CUDA Unified Memory in the PyTorch Framework conference September 2021
3D ShapeNets: A deep representation for volumetric shapes conference June 2015
Deep Residual Learning for Image Recognition conference June 2016
Neural Operator Learning for Long-Time Integration in Dynamical Systems with Recurrent Neural Networks conference June 2024
Cython: The Best of Both Worlds journal March 2011
mpi4py: Status Update After 12 Years of Development journal July 2021
Inexact Newton Methods journal April 1982
Globally Convergent Inexact Newton Methods journal May 1994
Minimal Repetition Dynamic Checkpointing Algorithm for Unsteady Adjoint Calculation journal January 2009
Parallel Time Integration with Multigrid journal January 2014
Two-Level Convergence Theory for Multigrid Reduction in Time (MGRIT) journal January 2017
Optimization Methods for Large-Scale Machine Learning journal January 2018
Nonsymmetric Algebraic Multigrid Based on Local Approximate Ideal Restriction ($\ell$AIR) journal January 2018
Parallel-In-Time Multigrid with Adaptive Spatial Coarsening for The Linear Advection and Inviscid Burgers Equations journal January 2019
Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time journal January 2019
Multilevel Convergence Analysis of Multigrid-Reduction-in-Time journal January 2020
Layer-Parallel Training of Deep Residual Neural Networks journal January 2020
Space-Time Block Preconditioning for Incompressible Flow journal February 2022
Globally Convergent Multilevel Training of Deep Residual Networks journal August 2022
Torchvision the machine-vision package of torch conference January 2010
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis journal August 2019
Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation journal March 2000
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training conference August 2023
Long Short-Term Memory journal November 1997
A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures journal July 2019
Reversible Architectures for Arbitrarily Deep Residual Neural Networks journal April 2018
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation conference January 2022
Human Activity Recognition Using Smartphones dataset January 2013
Gated Recurrent Units Viewed Through the Lens of Continuous Time Dynamical Systems journal July 2021

Similar Records

TorchBraid
Software · Mon Jun 08 20:00:00 EDT 2020 · OSTI ID:code-47108

On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks
Technical Report · Tue Dec 03 23:00:00 EST 2019 · OSTI ID:1525811

Scaling deep learning on GPU and knights landing clusters
Journal Article · Sat Dec 31 23:00:00 EST 2016 · International Conference for High Performance Computing, Networking, Storage and Analysis · OSTI ID:1439212