Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
Abstract
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.
- Authors:
-
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Publication Date:
- Research Org.:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1617450
- Alternate Identifier(s):
- OSTI ID: 1778383
- Report Number(s):
- PNNL-SA-134513
Journal ID: ISSN 0167-739X
- Grant/Contract Number:
- AC05-76RL01830
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Future Generations Computer Systems
- Additional Journal Information:
- Journal Volume: 108; Journal Issue: C; Journal ID: ISSN 0167-739X
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; NVIDIA DGX-1; Intel Knights Landing; Caffe; maTEx; Deep learning; Convolutional neural networks
Citation Formats
Gawande, Nitin A., Daily, Jeff A., Siegel, Charles, Tallent, Nathan R., and Vishnu, Abhinav. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing. United States: N. p., 2018.
Web. doi:10.1016/j.future.2018.04.073.
Gawande, Nitin A., Daily, Jeff A., Siegel, Charles, Tallent, Nathan R., & Vishnu, Abhinav. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing. United States. https://doi.org/10.1016/j.future.2018.04.073
Gawande, Nitin A., Daily, Jeff A., Siegel, Charles, Tallent, Nathan R., and Vishnu, Abhinav. Sat .
"Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing". United States. https://doi.org/10.1016/j.future.2018.04.073. https://www.osti.gov/servlets/purl/1617450.
@article{osti_1617450,
title = {Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing},
author = {Gawande, Nitin A. and Daily, Jeff A. and Siegel, Charles and Tallent, Nathan R. and Vishnu, Abhinav},
abstractNote = {Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.},
doi = {10.1016/j.future.2018.04.073},
journal = {Future Generations Computer Systems},
number = C,
volume = 108,
place = {United States},
year = {Sat May 05 00:00:00 EDT 2018},
month = {Sat May 05 00:00:00 EDT 2018}
}
Web of Science
Figures / Tables:
Works referenced in this record:
Going deeper with convolutions
conference, June 2015
- Szegedy, Christian
- 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Searching for exotic particles in high-energy physics with deep learning
journal, July 2014
- Baldi, P.; Sadowski, P.; Whiteson, D.
- Nature Communications, Vol. 5, Issue 1
Caffe: Convolutional Architecture for Fast Feature Embedding
conference, January 2014
- Jia, Yangqing; Shelhamer, Evan; Donahue, Jeff
- Proceedings of the ACM International Conference on Multimedia - MM '14
Theano: A CPU and GPU Math Compiler in Python
conference, January 2010
- Bergstra, James; Breuleux, Olivier; Bastien, Frédéric
- Proceedings of the Python in Science Conference
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
conference, May 2017
- Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeff A.
- 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Deep Residual Learning for Image Recognition
conference, June 2016
- He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing
- 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Knights Landing: Second-Generation Intel Xeon Phi Product
journal, March 2016
- Sodani, Avinash; Gramunt, Roger; Corbal, Jesus
- IEEE Micro, Vol. 36, Issue 2
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition
journal, November 2008
- Torralba, A.; Fergus, R.; Freeman, W. T.
- IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, Issue 11
ImageNet Large Scale Visual Recognition Challenge
journal, April 2015
- Russakovsky, Olga; Deng, Jia; Su, Hao
- International Journal of Computer Vision, Vol. 115, Issue 3
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
conference, June 2016
- Iandola, Forrest N.; Moskewicz, Matthew W.; Ashraf, Khalid
- 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
RAPL: memory power estimation and capping
conference, January 2010
- David, Howard; Gorbatov, Eugene; Hanebutte, Ulf R.
- Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED '10
Benchmarking State-of-the-Art Deep Learning Software Tools
conference, November 2016
- Shi, Shaohuai; Wang, Qiang; Xu, Pengfei
- 2016 7th International Conference on Cloud Computing and Big Data (CCBD)
Large Minibatch Training on Supercomputers with Improved Accuracy and Reduced Time to Train
conference, November 2018
- Codreanu, Valeriu; Podareanu, Damian; Saletore, Vikram
- 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
Works referencing / citing this record:
Applications of Artificial Intelligence Methodologies to Behavioral and Social Sciences
journal, December 2019
- Robila, Mihaela; Robila, Stefan A.
- Journal of Child and Family Studies, Vol. 29, Issue 10
A Framework for Memory Oversubscription Management in Graphics Processing Units
conference, April 2019
- Li, Chen; Ausavarungnirun, Rachata; Rossbach, Christopher J.
- ASPLOS '19: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
Figures / Tables found in this record: