Scaling Deep Learning on GPU and Knights Landing clusters

You, Yang; Buluc, Aydin; Demmel, James

doi:10.1145/3126908.3126912

Title: Scaling Deep Learning on GPU and Knights Landing clusters

Journal Article · Tue Sep 26 00:00:00 EDT 2017 · International Conference for High Performance Computing, Networking, Storage and Analysis (Online)

DOI:https://doi.org/10.1145/3126908.3126912· OSTI ID:1398518

You, Yang ^[1]; Buluc, Aydin ^[2]; Demmel, James ^[1]

Univ. of California, Berkeley, CA (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 1398518

Journal Information:: International Conference for High Performance Computing, Networking, Storage and Analysis (Online), Vol. 2017; Conference: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), Denver, CO (United States), 12-17 Nov 2017; ISSN 2167-4337

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 39 works

Citation information provided by
Web of Science

References (12)

ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
Efficient mini-batch training for stochastic optimization Li, Mu; Zhang, Tong; Chen, Yuqiang Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14 https://doi.org/10.1145/2623330.2623612	conference	January 2014
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Li, Chao; Yang, Yi; Feng, Min arXiv https://doi.org/10.48550/arxiv.1610.03618	preprint	January 2016
Gradient-based learning applied to document recognition Lecun, Y.; Bottou, L.; Bengio, Y. Proceedings of the IEEE, Vol. 86, Issue 11 https://doi.org/10.1109/5.726791	journal	January 1998
Going Deeper with Convolutions Szegedy, Christian; Liu, Wei; Jia, Yangqing arXiv https://doi.org/10.48550/arxiv.1409.4842	preprint	January 2014
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters Iandola, Forrest N.; Moskewicz, Matthew W.; Ashraf, Khalid 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.284	conference	June 2016
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems You, Yang; Demmel, James; Czechowski, Kenneth 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.117	conference	May 2015
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Li, Chao; Yang, Yi; Feng, Min SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.53	conference	November 2016
Going deeper with convolutions Szegedy, Christian 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2015.7298594	conference	June 2015
Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform Peemen, Maurice; Mesman, Bart; Corporaal, Henk Advanced Concepts for Intelligent Vision Systems https://doi.org/10.1007/978-3-642-23687-7_27	book	January 2011
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs Seide, Frank; Fu, Hao; Droppo, Jasha Interspeech 2014 https://doi.org/10.21437/Interspeech.2014-274	conference	September 2014

Cited By (6)

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent Daily, Jeff; Vishnu, Abhinav; Siegel, Charles arXiv https://doi.org/10.48550/arxiv.1803.05880	preprint	January 2018
A Survey on Distributed Machine Learning Verbraeken, Joost; Wolting, Matthijs; Katzy, Jonathan ACM Computing Surveys, Vol. 53, Issue 2 https://doi.org/10.1145/3377454	journal	March 2021
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters Thao Nguyen, Truong; Wahib, Mohamed; Takano, Ryousei Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.5574	journal	December 2019
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory Miao, Hongyu; Jeon, Myeongjae; Pekhimenko, Gennady arXiv https://doi.org/10.48550/arxiv.1901.01328	text	January 2019
Reducing Data Motion to Accelerate the Training of Deep Neural Networks Zhuang, Sicong; Malossi, Cristiano; Casas, Marc arXiv https://doi.org/10.48550/arxiv.2004.02297	preprint	January 2020
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory Miao, Hongyu; Jeon, Myeongjae; Pekhimenko, Gennady ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3297858.3304031	conference	April 2019

Similar Records

Scaling deep learning on GPU and knights landing clusters

Journal Article · Sun Jan 01 00:00:00 EST 2017 · International Conference for High Performance Computing, Networking, Storage and Analysis · OSTI ID:1398518

You, Yang; Buluc, Aydin; Demmel, James

Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Journal Article · Sat May 05 00:00:00 EDT 2018 · Future Generations Computer Systems · OSTI ID:1398518

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; +2 more

Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Conference · Thu Aug 24 00:00:00 EDT 2017 · OSTI ID:1398518

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.; +3 more

Related Subjects

60 APPLIED LIFE SCIENCES
97 MATHEMATICS AND COMPUTING
Distributed Deep Learning
Knights Landing
Scalable Algorithm

Title: Scaling Deep Learning on GPU and Knights Landing clusters

Citation Formats

References (12)

Cited By (6)

Similar Records

Related Subjects