Scaling deep learning on GPU and knights landing clusters
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. We redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons. Sync EASGD achieves 5.3X speedup over original EASGD on the same platform. We achieve 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1439212
- Journal Information:
- International Conference for High Performance Computing, Networking, Storage and Analysis, Journal Name: International Conference for High Performance Computing, Networking, Storage and Analysis Journal Issue: 9 Vol. 2017; ISSN 2167-4329
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory
|
conference | April 2019 |
| Reducing Data Motion to Accelerate the Training of Deep Neural Networks | preprint | January 2020 |
| GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent | preprint | January 2018 |
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters
|
journal | December 2019 |
| StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory | text | January 2019 |
A Survey on Distributed Machine Learning
|
journal | March 2021 |
Similar Records
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing