skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scaling Deep Learning on GPU and Knights Landing clusters

Abstract

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all themore » comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less

Authors:
 [1];  [2];  [1]
  1. Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1398518
Grant/Contract Number:
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
International Conference for High Performance Computing, Networking, Storage and Analysis (Online)
Additional Journal Information:
Journal Name: International Conference for High Performance Computing, Networking, Storage and Analysis (Online); Journal Volume: 2017; Conference: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), Denver, CO (United States), 12-17 Nov 2017; Journal ID: ISSN 2167-4337
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 97 MATHEMATICS AND COMPUTING; Distributed Deep Learning; Knights Landing; Scalable Algorithm

Citation Formats

You, Yang, Buluc, Aydin, and Demmel, James. Scaling Deep Learning on GPU and Knights Landing clusters. United States: N. p., 2017. Web. doi:10.1145/3126908.3126912.
You, Yang, Buluc, Aydin, & Demmel, James. Scaling Deep Learning on GPU and Knights Landing clusters. United States. doi:10.1145/3126908.3126912.
You, Yang, Buluc, Aydin, and Demmel, James. Tue . "Scaling Deep Learning on GPU and Knights Landing clusters". United States. doi:10.1145/3126908.3126912.
@article{osti_1398518,
title = {Scaling Deep Learning on GPU and Knights Landing clusters},
author = {You, Yang and Buluc, Aydin and Demmel, James},
abstractNote = {The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.},
doi = {10.1145/3126908.3126912},
journal = {International Conference for High Performance Computing, Networking, Storage and Analysis (Online)},
number = ,
volume = 2017,
place = {United States},
year = {Tue Sep 26 00:00:00 EDT 2017},
month = {Tue Sep 26 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on September 26, 2018
Publisher's Version of Record

Save / Share: