The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Oyama, Yosuke; Maruyama, Naoya; Dryden, Nikoli; Mccarthy, Erin; Harrington, Peter; Balewski, Jan; Matsuoka, Satoshi; Nugent, Peter; Van Essen, Brian

doi:10.1109/tpds.2020.3047974

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Journal Article · Wed Dec 30 00:00:00 EST 2020 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2020.3047974· OSTI ID:1959404

^[1]; Maruyama, Naoya ^[2]; Dryden, Nikoli ^[3]; Mccarthy, Erin ^[4]; Harrington, Peter ^[5]; Balewski, Jan ^[5]; Matsuoka, Satoshi ^[6]; Nugent, Peter ^[5]; Van Essen, Brian ^[2]

Tokyo Institute of Technology (Japan); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Eidgenoessische Technische Hochschule (ETH), Zurich (Switzerland); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
University of Oregon, Eugene, OR (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
RIKEN Center for Computational Science, Hyogo (Japan); Tokyo Institute of Technology (Japan)

Here, we present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. Deep learning-based emerging scientific workflows often require model training with large, high-dimensional samples, which can make training much more costly and even infeasible due to excessive memory usage. We solve these challenges by extensively applying hybrid parallelism throughout the end-to-end training pipeline, including both computations and I/O. Our hybrid-parallel algorithm extends the standard data parallelism with spatial parallelism, which partitions a single sample in the spatial domain, realizing strong scaling beyond the mini-batch dimension with a larger aggregated memory capacity. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up to 2K GPUs. More importantly, we enable training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: Exascale Computing Project; Japan Society for the Promotion of Science (JSPS); USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC02-05CH11231; AC52-07NA27344

OSTI ID:: 1959404

Report Number(s):: LLNL-JRNL-812691; 1019825

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: N/A; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (30)

CosmoFlow: Using Deep Learning to Learn the Universe at Scale Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2018.00068	conference	November 2018
One weird trick for parallelizing convolutional neural networks Krizhevsky, Alex arXiv https://doi.org/10.48550/arxiv.1404.5997	preprint	January 2014
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation Milletari, Fausto; Navab, Nassir; Ahmadi, Seyed-Ahmad arXiv https://doi.org/10.48550/arxiv.1606.04797	preprint	January 2016
Infrastructure for Machine Learning: Ideas from Industry and Research Zaharia, Matei ACM SIGMOD https://doi.org/10.5446/42851	audiovisual	January 2019
Mastering the game of Go with deep neural networks and tree search Silver, David; Huang, Aja; Maddison, Chris J. Nature, Vol. 529, Issue 7587 https://doi.org/10.1038/nature16961	journal	January 2016
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation Milletari, Fausto; Navab, Nassir; Ahmadi, Seyed-Ahmad 2016 Fourth International Conference on 3D Vision (3DV) https://doi.org/10.1109/3DV.2016.79	conference	October 2016
Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers Oyama, Yosuke; Nomura, Akihiro; Sato, Ikuro 2016 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData.2016.7840590	conference	December 2016
ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity Ito, Yuki; Matsumiya, Ryo; Endo, Toshio 2017 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData.2017.8257926	conference	December 2017
Accelerating Deep Learning Frameworks with Micro-Batches Oyama, Yosuke; Ben-Nun, Tal; Hoefler, Torsten 2018 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2018.00058	conference	September 2018
Parallelizing Training of Deep Generative Models on Massive Scientific Datasets Jacobs, Sam Ade; Gaffney, Jim; Benson, Tom 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891012	conference	September 2019
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset Carreira, Joao; Zisserman, Andrew 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2017.502	conference	July 2017
Data sieving and collective I/O in ROMIO Thakur, R.; Gropp, W.; Lusk, E. Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation https://doi.org/10.1109/FMPC.1999.750599	conference	January 1999
Towards Scalable Deep Learning via I/O Analysis and Optimization Pumma, Sarunya; Si, Min; Feng, Wu-chun 2017 IEEE 19th International Conference on High Performance Computing and Communications, IEEE 15th International Conference on Smart City and IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.29	conference	December 2017
Learning Spatiotemporal Features with 3D Convolutional Networks Tran, Du; Bourdev, Lubomir; Fergus, Rob 2015 IEEE International Conference on Computer Vision (ICCV) https://doi.org/10.1109/ICCV.2015.510	conference	December 2015
Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism Dryden, Nikoli; Maruyama, Naoya; Benson, Tom 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00031	conference	May 2019
Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems Zhu, Yue; Chowdhury, Fahim; Fu, Huansong 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2018.00023	conference	September 2018
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design Rhu, Minsoo; Gimelshein, Natalia; Clemons, Jason 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) https://doi.org/10.1109/MICRO.2016.7783721	conference	October 2016
Characterizing Deep-Learning I/O Workloads in TensorFlow Chien, Steven W. D.; Markidis, Stefano; Sishtla, Chaitanya Prasad 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS) https://doi.org/10.1109/PDSW-DISCS.2018.00011	conference	November 2018
Exascale Deep Learning for Climate Analytics Kurth, Thorsten; Treichler, Sean; Romero, Joshua SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00054	conference	November 2018
CosmoFlow: Using Deep Learning to Learn the Universe at Scale Mathuriya, Amrita; Bard, Deborah; Mendygral, Peter SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00068	conference	November 2018
Parallel netCDF: A High-Performance Scientific I/O Interface Li, Jianwei; Zingale, Michael; Liao, Wei-keng Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03 https://doi.org/10.1145/1048935.1050189	conference	January 2003
LBANN: livermore big artificial neural network HPC toolkit Van Essen, Brian; Kim, Hyojin; Pearce, Roger Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments - MLHPC '15 https://doi.org/10.1145/2834892.2834897	conference	January 2015
Superneurons Wang, Linnan; Ye, Jinmian; Zhao, Yiyang Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178491	conference	February 2018
Integrated Model, Batch, and Domain Parallelism in Training Neural Networks Gholami, Amir; Azad, Ariful; Jin, Peter Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures https://doi.org/10.1145/3210377.3210394	conference	July 2018
Channel and filter parallelism for large-scale CNN training Dryden, Nikoli; Maruyama, Naoya; Moon, Tim Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356207	conference	November 2019
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis Ben-Nun, Tal; Hoefler, Torsten ACM Computing Surveys, Vol. 52, Issue 4 https://doi.org/10.1145/3320060	journal	August 2019
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning Chowdhury, Fahim; Zhu, Yue; Heer, Todd Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337902	conference	August 2019
PipeDream: generalized pipeline parallelism for DNN training Narayanan, Deepak; Harlap, Aaron; Phanishayee, Amar SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles, Proceedings of the 27th ACM Symposium on Operating Systems Principles https://doi.org/10.1145/3341301.3359646	conference	October 2019
Optimization of Collective Communication Operations in MPICH Thakur, Rajeev; Rabenseifner, Rolf; Gropp, William The International Journal of High Performance Computing Applications, Vol. 19, Issue 1 https://doi.org/10.1177/1094342005051521	journal	February 2005
New Approaches in Turbulence and Transition Modeling Using Data-driven Techniques Duraisamy, Karthikeyan; Zhang, Ze J.; Singh, Anand Pratap 53rd AIAA Aerospace Sciences Meeting https://doi.org/10.2514/6.2015-1284	conference	January 2015

Similar Records

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Conference · Thu May 01 00:00:00 EDT 2025 · OSTI ID:3002431

Wootz: a compiler-based framework for fast CNN pruning via composability

Conference · Sat Jun 01 00:00:00 EDT 2019 · OSTI ID:1543204

In-Place Zero-Space Memory Protection for CNN

Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606858

Related Subjects

97 MATHEMATICS AND COMPUTING
convolutional neural network
deep learning
hybrid-parallel training
model-parallel training

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Citation Formats

References (30)

Similar Records

Related Subjects