Scaling deep learning on GPU and knights landing clusters

You, Yang; Buluc, Aydin; Demmel, James

doi:10.1145/3126908.3126912

Scaling deep learning on GPU and knights landing clusters

Journal Article · Sat Dec 31 23:00:00 EST 2016 · International Conference for High Performance Computing, Networking, Storage and Analysis

DOI:https://doi.org/10.1145/3126908.3126912· OSTI ID:1439212

You, Yang ^[1]; Buluc, Aydin ^[1]; Demmel, James ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. We redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons. Sync EASGD achieves 5.3X speedup over original EASGD on the same platform. We achieve 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1439212

Journal Information:: International Conference for High Performance Computing, Networking, Storage and Analysis, Journal Name: International Conference for High Performance Computing, Networking, Storage and Analysis Journal Issue: 9 Vol. 2017; ISSN 2167-4329

Publisher:: IEEE

Country of Publication:: United States

Language:: English

References (13)

ImageNet: A large-scale hierarchical image database Deng, Jia; Dong, Wei; Socher, Richard 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848	conference	June 2009
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems You, Yang; Demmel, James; Czechowski, Kenneth 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.117	conference	May 2015
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing arXiv https://doi.org/10.48550/arxiv.1512.03385	preprint	January 2015
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters Iandola, Forrest N.; Moskewicz, Matthew W.; Ashraf, Khalid 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.284	conference	June 2016
Deep Residual Learning for Image Recognition He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90	conference	June 2016
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Li, Chao; Yang, Yi; Feng, Min SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.53	conference	November 2016
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs Seide, Frank; Fu, Hao; Droppo, Jasha Interspeech 2014 https://doi.org/10.21437/Interspeech.2014-274	conference	September 2014
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Li, Chao; Yang, Yi; Feng, Min arXiv https://doi.org/10.48550/arxiv.1610.03618	preprint	January 2016
Gradient-based learning applied to document recognition Lecun, Y.; Bottou, L.; Bengio, Y. Proceedings of the IEEE, Vol. 86, Issue 11 https://doi.org/10.1109/5.726791	journal	January 1998
Going Deeper with Convolutions Szegedy, Christian; Liu, Wei; Jia, Yangqing arXiv https://doi.org/10.48550/arxiv.1409.4842	preprint	January 2014
Going deeper with convolutions Szegedy, Christian 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2015.7298594	conference	June 2015
Efficient mini-batch training for stochastic optimization Li, Mu; Zhang, Tong; Chen, Yuqiang Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14 https://doi.org/10.1145/2623330.2623612	conference	January 2014
Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform Peemen, Maurice; Mesman, Bart; Corporaal, Henk Advanced Concepts for Intelligent Vision Systems https://doi.org/10.1007/978-3-642-23687-7_27	book	January 2011

Cited By (6)

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory Miao, Hongyu; Jeon, Myeongjae; Pekhimenko, Gennady ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3297858.3304031	conference	April 2019
Reducing Data Motion to Accelerate the Training of Deep Neural Networks Zhuang, Sicong; Malossi, Cristiano; Casas, Marc arXiv https://doi.org/10.48550/arxiv.2004.02297	preprint	January 2020
GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent Daily, Jeff; Vishnu, Abhinav; Siegel, Charles arXiv https://doi.org/10.48550/arxiv.1803.05880	preprint	January 2018
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters Thao Nguyen, Truong; Wahib, Mohamed; Takano, Ryousei Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.5574	journal	December 2019
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory Miao, Hongyu; Jeon, Myeongjae; Pekhimenko, Gennady arXiv https://doi.org/10.48550/arxiv.1901.01328	text	January 2019
A Survey on Distributed Machine Learning Verbraeken, Joost; Wolting, Matthijs; Katzy, Jonathan ACM Computing Surveys, Vol. 53, Issue 2 https://doi.org/10.1145/3377454	journal	March 2021

Similar Records

Scaling Deep Learning on GPU and Knights Landing clusters

Journal Article · Tue Sep 26 00:00:00 EDT 2017 · International Conference for High Performance Computing, Networking, Storage and Analysis (Online) · OSTI ID:1398518

Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Conference · Mon Jul 03 00:00:00 EDT 2017 · OSTI ID:1373860

Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Conference · Thu Aug 24 00:00:00 EDT 2017 · OSTI ID:1411927

Related Subjects

97 MATHEMATICS AND COMPUTING

Scaling deep learning on GPU and knights landing clusters

Citation Formats

References (13)

Cited By (6)

Similar Records

Related Subjects