TorchBraid: High-Performance Layer-Parallel Training of Deep Neural Networks with MPI and GPU Acceleration
Journal Article
·
· ACM Transactions on Mathematical Software
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Univ. of Wuppertal (Germany)
- West Texas A&M University, Canyon, TX (United States)
- Univ. of New Mexico, Albuquerque, NM (United States)
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
TorchBraid is a high-performance implementation of layer-parallel training for deep neural networks (DNNs) supporting MPI-based parallelism and GPU acceleration. Layer-parallel training has been developed to overcome the serialization inherent in forward and backward propagation of DNNs that limits utilization of computational resources in the strong scaling limit. To achieve this, TorchBraid integrates the PyTorch neural network framework with the state-of-the-art XBraid time-parallel library. Furthermore, this article presents the use and performance of TorchBraid, in addition to solutions for overcoming the algorithmic challenges inherent in combining automatic differentiation with layer-parallel. Results are presented with and without GPU acceleration for the Tiny ImageNet and MNIST image classification data sets, as well as recurrent neural networks. Overall, TorchBraid enables fast training of DNNs, both in a strong and weak scaling context. In addition to the TorchBraid software, several new advances in applying layer-parallel algorithms are detailed. Integration of layer-parallel with data-parallel algorithms is presented for the first time, showing the computational advantages of the combination. Standard deep learning techniques, like batch-normalization, are developed for layer-parallel training. Finally, a new approach combining layer-parallel with spatial coarsening in order to accelerate training for 3D image classification shows roughly a 10× speedup over serial execution.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- Deutsche Forschungsgemeinschaft (DFG); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- NA0003525
- OSTI ID:
- 3005462
- Report Number(s):
- LA-UR--24-20385; SAND--2025-14725J; 1784019
- Journal Information:
- ACM Transactions on Mathematical Software, Journal Name: ACM Transactions on Mathematical Software Journal Issue: 3 Vol. 51; ISSN 0098-3500; ISSN 1557-7295
- Publisher:
- Association for Computing MachineryCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
TorchBraid
On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks
Scaling deep learning on GPU and knights landing clusters
Software
·
Mon Jun 08 20:00:00 EDT 2020
·
OSTI ID:code-47108
On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks
Technical Report
·
Tue Dec 03 23:00:00 EST 2019
·
OSTI ID:1525811
Scaling deep learning on GPU and knights landing clusters
Journal Article
·
Sat Dec 31 23:00:00 EST 2016
· International Conference for High Performance Computing, Networking, Storage and Analysis
·
OSTI ID:1439212