TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration

Cyr, Eric Christopher; Hahne, Jens; Moore, Nicholas S.; Schroder, Jacob B.; Southworth, Ben S.; Vargas, David Alan

doi:10.1145/3759244

TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration

Journal Article · Mon Sep 29 00:00:00 EDT 2025 · ACM Transactions on Mathematical Software

DOI:https://doi.org/10.1145/3759244· OSTI ID:3005462

^[1]; ^[2]; ^[3]; ^[4]; ^[5]; ^[1]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Univ. of Wuppertal (Germany)
West Texas A&M University, Canyon, TX (United States)
Univ. of New Mexico, Albuquerque, NM (United States)
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

TorchBraid is a high-performance implementation of layer-parallel training for deep neural networks (DNNs) supporting MPI-based parallelism and GPU acceleration. Layer-parallel training has been developed to overcome the serialization inherent in forward and backward propagation of DNNs that limits utilization of computational resources in the strong scaling limit. To achieve this, TorchBraid integrates the PyTorch neural network framework with the state-of-the-art XBraid time-parallel library. Furthermore, this article presents the use and performance of TorchBraid, in addition to solutions for overcoming the algorithmic challenges inherent in combining automatic differentiation with layer-parallel. Results are presented with and without GPU acceleration for the Tiny ImageNet and MNIST image classification data sets, as well as recurrent neural networks. Overall, TorchBraid enables fast training of DNNs, both in a strong and weak scaling context. In addition to the TorchBraid software, several new advances in applying layer-parallel algorithms are detailed. Integration of layer-parallel with data-parallel algorithms is presented for the first time, showing the computational advantages of the combination. Standard deep learning techniques, like batch-normalization, are developed for layer-parallel training. Finally, a new approach combining layer-parallel with spatial coarsening in order to accelerate training for 3D image classification shows roughly a 10× speedup over serial execution.

View Accepted Manuscript (DOE)

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: Deutsche Forschungsgemeinschaft (DFG); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: NA0003525

OSTI ID:: 3005462

Report Number(s):: LA-UR--24-20385; SAND--2025-14725J; 1784019

Journal Information:: ACM Transactions on Mathematical Software, Journal Name: ACM Transactions on Mathematical Software Journal Issue: 3 Vol. 51; ISSN 0098-3500; ISSN 1557-7295

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

References (47)

50 Years of Time Parallel Time Integration Gander, Martin J. Contributions in Mathematical and Computational Sciences https://doi.org/10.1007/978-3-319-23321-5_3	book	January 2015
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Fukushima, Kunihiko Biological Cybernetics, Vol. 36, Issue 4 https://doi.org/10.1007/BF00344251	journal	April 1980
Multigrid methods with space–time concurrency Falgout, R. D.; Friedhoff, S.; Kolev, Tz. V. Computing and Visualization in Science, Vol. 18, Issue 4-5 https://doi.org/10.1007/s00791-017-0283-9	journal	August 2017
Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning Heaton, Jeff Genetic Programming and Evolvable Machines, Vol. 19, Issue 1-2 https://doi.org/10.1007/s10710-017-9314-z	journal	October 2017
A Proposal on Machine Learning via Dynamical Systems E., Weinan Communications in Mathematics and Statistics, Vol. 5, Issue 1 https://doi.org/10.1007/s40304-017-0103-z	journal	March 2017
A fast algorithm for particle simulations Greengard, L.; Rokhlin, V. Journal of Computational Physics, Vol. 73, Issue 2 https://doi.org/10.1016/0021-9991(87)90140-9	journal	December 1987
A review of algebraic multigrid Stüben, K. Journal of Computational and Applied Mathematics, Vol. 128, Issue 1-2 https://doi.org/10.1016/S0377-0427(00)00516-1	journal	March 2001
Parallel distributed computing using Python Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A. Advances in Water Resources, Vol. 34, Issue 9 https://doi.org/10.1016/j.advwatres.2011.04.013	journal	September 2011
MPI for Python Dalcín, Lisandro; Paz, Rodrigo; Storti, Mario Journal of Parallel and Distributed Computing, Vol. 65, Issue 9 https://doi.org/10.1016/j.jpdc.2005.03.010	journal	September 2005
MPI for Python: Performance improvements and MPI-2 extensions Dalcín, Lisandro; Paz, Rodrigo; Storti, Mario Journal of Parallel and Distributed Computing, Vol. 68, Issue 5 https://doi.org/10.1016/j.jpdc.2007.09.005	journal	May 2008
Algebraic multigrid methods Xu, Jinchao; Zikatanov, Ludmil Acta Numerica, Vol. 26 https://doi.org/10.1017/S0962492917000083	journal	May 2017
Deep learning LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey Nature, Vol. 521, Issue 7553 https://doi.org/10.1038/nature14539	journal	May 2015
Array programming with NumPy Harris, Charles R.; Millman, K. Jarrod; van der Walt, Stéfan J. Nature, Vol. 585, Issue 7825 https://doi.org/10.1038/s41586-020-2649-2	journal	September 2020
Neural networks and physical systems with emergent collective computational abilities. Hopfield, J. J. Proceedings of the National Academy of Sciences, Vol. 79, Issue 8 https://doi.org/10.1073/pnas.79.8.2554	journal	April 1982
Parallel Approximate Ideal Restriction Multigrid for Solving the S _N Transport Equations Hanophy, Joshua; Southworth, Ben S.; Li, Ruipeng Nuclear Science and Engineering, Vol. 194, Issue 11 https://doi.org/10.1080/00295639.2020.1747263	journal	June 2020
Stable architectures for deep neural networks Haber, Eldad; Ruthotto, Lars Inverse Problems, Vol. 34, Issue 1 https://doi.org/10.1088/1361-6420/aa9a90	journal	December 2017
The fast multipole method for the wave equation: a pedestrian prescription Coifman, R.; Rokhlin, V.; Wandzura, S. IEEE Antennas and Propagation Magazine, Vol. 35, Issue 3 https://doi.org/10.1109/74.250128	journal	June 1993
The fast multipole method (FMM) for electromagnetic scattering problems Engheta, N.; Murphy, W. D.; Rokhlin, V. IEEE Transactions on Antennas and Propagation, Vol. 40, Issue 6 https://doi.org/10.1109/8.144597	journal	June 1992
Implementing CUDA Unified Memory in the PyTorch Framework Choi, Jake; Yeom, Heon Young; Kim, Yoonhee 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C) https://doi.org/10.1109/ACSOS-C52956.2021.00029	conference	September 2021
3D ShapeNets: A deep representation for volumetric shapes No authors listed 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2015.7298801	conference	June 2015
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
Neural Operator Learning for Long-Time Integration in Dynamical Systems with Recurrent Neural Networks Michałowska, Katarzyna; Goswami, Somdatta; Karniadakis, George Em 2024 International Joint Conference on Neural Networks (IJCNN) https://doi.org/10.1109/IJCNN60899.2024.10650331	conference	June 2024
Cython: The Best of Both Worlds Behnel, Stefan; Bradshaw, Robert; Citro, Craig Computing in Science & Engineering, Vol. 13, Issue 2 https://doi.org/10.1109/MCSE.2010.118	journal	March 2011
mpi4py: Status Update After 12 Years of Development Dalcin, Lisandro; Fang, Yao-Lung L. Computing in Science & Engineering, Vol. 23, Issue 4 https://doi.org/10.1109/MCSE.2021.3083216	journal	July 2021
Inexact Newton Methods Dembo, Ron S.; Eisenstat, Stanley C.; Steihaug, Trond SIAM Journal on Numerical Analysis, Vol. 19, Issue 2 https://doi.org/10.1137/0719025	journal	April 1982
Globally Convergent Inexact Newton Methods Eisenstat, Stanley C.; Walker, Homer F. SIAM Journal on Optimization, Vol. 4, Issue 2 https://doi.org/10.1137/0804022	journal	May 1994
Minimal Repetition Dynamic Checkpointing Algorithm for Unsteady Adjoint Calculation Wang, Qiqi; Moin, Parviz; Iaccarino, Gianluca SIAM Journal on Scientific Computing, Vol. 31, Issue 4 https://doi.org/10.1137/080727890	journal	January 2009
Parallel Time Integration with Multigrid Falgout, R. D.; Friedhoff, S.; Kolev, Tz. V. SIAM Journal on Scientific Computing, Vol. 36, Issue 6 https://doi.org/10.1137/130944230	journal	January 2014
Two-Level Convergence Theory for Multigrid Reduction in Time (MGRIT) Dobrev, V. A.; Kolev, Tz.; Petersson, N. A. SIAM Journal on Scientific Computing, Vol. 39, Issue 5 https://doi.org/10.1137/16M1074096	journal	January 2017
Optimization Methods for Large-Scale Machine Learning Bottou, Léon; Curtis, Frank E.; Nocedal, Jorge SIAM Review, Vol. 60, Issue 2 https://doi.org/10.1137/16M1080173	journal	January 2018
Nonsymmetric Algebraic Multigrid Based on Local Approximate Ideal Restriction ($\ell$AIR) Manteuffel, Thomas A.; Ruge, John; Southworth, Ben S. SIAM Journal on Scientific Computing, Vol. 40, Issue 6 https://doi.org/10.1137/17M1144350	journal	January 2018
Parallel-In-Time Multigrid with Adaptive Spatial Coarsening for The Linear Advection and Inviscid Burgers Equations Howse, Alexander J.; Sterck, Hans De; Falgout, Robert D. SIAM Journal on Scientific Computing, Vol. 41, Issue 1 https://doi.org/10.1137/17M1144982	journal	January 2019
Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time Southworth, Ben S. SIAM Journal on Matrix Analysis and Applications, Vol. 40, Issue 2 https://doi.org/10.1137/18M1226208	journal	January 2019
Multilevel Convergence Analysis of Multigrid-Reduction-in-Time Hessenthaler, Andreas; Southworth, Ben S.; Nordsletten, David SIAM Journal on Scientific Computing, Vol. 42, Issue 2 https://doi.org/10.1137/19M1238812	journal	January 2020
Layer-Parallel Training of Deep Residual Neural Networks Günther, Stefanie; Ruthotto, Lars; Schroder, Jacob B. SIAM Journal on Mathematics of Data Science, Vol. 2, Issue 1 https://doi.org/10.1137/19M1247620	journal	January 2020
Space-Time Block Preconditioning for Incompressible Flow Danieli, Federico; Southworth, Ben S.; Wathen, Andrew J. SIAM Journal on Scientific Computing, Vol. 44, Issue 1 https://doi.org/10.1137/21M1390773	journal	February 2022
Globally Convergent Multilevel Training of Deep Residual Networks Kopaničáková, Alena; Krause, Rolf SIAM Journal on Scientific Computing https://doi.org/10.1137/21M1434076	journal	August 2022
Torchvision the machine-vision package of torch Marcel, Sébastien; Rodriguez, Yann Proceedings of the international conference on Multimedia - MM '10 https://doi.org/10.1145/1873951.1874254	conference	January 2010
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis Ben-Nun, Tal; Hoefler, Torsten ACM Computing Surveys, Vol. 52, Issue 4 https://doi.org/10.1145/3320060	journal	August 2019
Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation Griewank, Andreas; Walther, Andrea ACM Transactions on Mathematical Software, Vol. 26, Issue 1 https://doi.org/10.1145/347837.347846	journal	March 2000
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training Li, Shenggui; Liu, Hongxin; Bian, Zhengda Proceedings of the 52nd International Conference on Parallel Processing https://doi.org/10.1145/3605573.3605613	conference	August 2023
Long Short-Term Memory Hochreiter, Sepp; Schmidhuber, Jürgen Neural Computation, Vol. 9, Issue 8 https://doi.org/10.1162/neco.1997.9.8.1735	journal	November 1997
A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures Yu, Yong; Si, Xiaosheng; Hu, Changhua Neural Computation, Vol. 31, Issue 7 https://doi.org/10.1162/neco_a_01199	journal	July 2019
Reversible Architectures for Arbitrarily Deep Residual Neural Networks Chang, Bo; Meng, Lili; Haber, Eldad Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, Issue 1 https://doi.org/10.1609/aaai.v32i1.11668	journal	April 2018
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation Li, Bei; Du, Quan; Zhou, Tao Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/2022.acl-long.571	conference	January 2022
Human Activity Recognition Using Smartphones Jorge Reyes-Ortiz, Davide Anguita UCI Machine Learning Repository https://doi.org/10.24432/C54S4K	dataset	January 2013
Gated Recurrent Units Viewed Through the Lens of Continuous Time Dynamical Systems Jordan, Ian D.; Sokół, Piotr Aleksander; Park, Il Memming Frontiers in Computational Neuroscience, Vol. 15 https://doi.org/10.3389/fncom.2021.678158	journal	July 2021

Similar Records

TorchBraid

Software · Mon Jun 08 20:00:00 EDT 2020 · OSTI ID:code-47108

On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks

Technical Report · Tue Dec 03 23:00:00 EST 2019 · OSTI ID:1525811

Scaling deep learning on GPU and knights landing clusters

Journal Article · Sat Dec 31 23:00:00 EST 2016 · International Conference for High Performance Computing, Networking, Storage and Analysis · OSTI ID:1439212

Related Subjects

Layer-parallel
deep neural networks
distributed machine learning
multigrid-reduction-in-time
parallel-in-time

TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration

Citation Formats

References (47)

Similar Records

Related Subjects