skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scaling Deep Learning on GPU and Knights Landing clusters

Journal Article · · International Conference for High Performance Computing, Networking, Storage and Analysis (Online)
 [1];  [2];  [1]
  1. Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1398518
Journal Information:
International Conference for High Performance Computing, Networking, Storage and Analysis (Online), Vol. 2017; Conference: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), Denver, CO (United States), 12-17 Nov 2017; ISSN 2167-4337
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 39 works
Citation information provided by
Web of Science

References (12)

ImageNet: A large-scale hierarchical image database
  • Deng, Jia; Dong, Wei; Socher, Richard
  • 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2009 IEEE Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.2009.5206848
conference June 2009
Deep Residual Learning for Image Recognition conference June 2016
Efficient mini-batch training for stochastic optimization conference January 2014
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs preprint January 2016
Gradient-based learning applied to document recognition journal January 1998
Going Deeper with Convolutions preprint January 2014
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters conference June 2016
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems conference May 2015
Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs conference November 2016
Going deeper with convolutions conference June 2015
Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform book January 2011
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs conference September 2014

Cited By (6)

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent preprint January 2018
A Survey on Distributed Machine Learning journal March 2021
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters journal December 2019
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory text January 2019
Reducing Data Motion to Accelerate the Training of Deep Neural Networks preprint January 2020
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory
  • Miao, Hongyu; Jeon, Myeongjae; Pekhimenko, Gennady
  • ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/3297858.3304031
conference April 2019

Similar Records

Scaling deep learning on GPU and knights landing clusters
Journal Article · Sun Jan 01 00:00:00 EST 2017 · International Conference for High Performance Computing, Networking, Storage and Analysis · OSTI ID:1398518

Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
Journal Article · Sat May 05 00:00:00 EDT 2018 · Future Generations Computer Systems · OSTI ID:1398518

Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
Conference · Thu Aug 24 00:00:00 EDT 2017 · OSTI ID:1398518