skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Evaluating and optimizing the NERSC workload on Knights Landing

Abstract

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.

Authors:
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »; ; « less
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1398462
DOE Contract Number:
AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: PMBS 2016: 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT (United States), 13-18 Nov 2016
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Barnes, T, Cook, B, Deslippe, J, Doerfler, D, Friesen, B, He, Y, Kurth, T, Koskela, T, Lobet, M, Malas, T, Oliker, L, Ovsyannikov, A, Sarje, A, Vay, JL, Vincenti, H, Williams, S, Carrier, P, Wichmann, N, Wagner, M, Kent, P, Kerr, C, and Dennis, J. Evaluating and optimizing the NERSC workload on Knights Landing. United States: N. p., 2017. Web. doi:10.1109/PMBS.2016.010.
Barnes, T, Cook, B, Deslippe, J, Doerfler, D, Friesen, B, He, Y, Kurth, T, Koskela, T, Lobet, M, Malas, T, Oliker, L, Ovsyannikov, A, Sarje, A, Vay, JL, Vincenti, H, Williams, S, Carrier, P, Wichmann, N, Wagner, M, Kent, P, Kerr, C, & Dennis, J. Evaluating and optimizing the NERSC workload on Knights Landing. United States. doi:10.1109/PMBS.2016.010.
Barnes, T, Cook, B, Deslippe, J, Doerfler, D, Friesen, B, He, Y, Kurth, T, Koskela, T, Lobet, M, Malas, T, Oliker, L, Ovsyannikov, A, Sarje, A, Vay, JL, Vincenti, H, Williams, S, Carrier, P, Wichmann, N, Wagner, M, Kent, P, Kerr, C, and Dennis, J. Mon . "Evaluating and optimizing the NERSC workload on Knights Landing". United States. doi:10.1109/PMBS.2016.010. https://www.osti.gov/servlets/purl/1398462.
@article{osti_1398462,
title = {Evaluating and optimizing the NERSC workload on Knights Landing},
author = {Barnes, T and Cook, B and Deslippe, J and Doerfler, D and Friesen, B and He, Y and Kurth, T and Koskela, T and Lobet, M and Malas, T and Oliker, L and Ovsyannikov, A and Sarje, A and Vay, JL and Vincenti, H and Williams, S and Carrier, P and Wichmann, N and Wagner, M and Kent, P and Kerr, C and Dennis, J},
abstractNote = {NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.},
doi = {10.1109/PMBS.2016.010},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 30 00:00:00 EST 2017},
month = {Mon Jan 30 00:00:00 EST 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.
  • There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
  • We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
  • Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less
  • Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less