skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fusion PIC code performance analysis on the Cori KNL system

Abstract

We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is shown to be the most beneficial optimization path with theoretical yield of up to 8x speedup on KNL. In practice we are able to obtain up to a 4x gain from vectorization due to limitations set by the data layout and memory latency.

Authors:
 [1];  [1];  [1];  [2]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
  2. INTEL Corp. (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1412519
DOE Contract Number:
AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: Cray User Group Conference 2017, Redmond, WA (United States), 9-11 May 2017
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Koskela, Tuomas S., Deslippe, Jack, Friesen, Brian, and Raman, Karthic. Fusion PIC code performance analysis on the Cori KNL system. United States: N. p., 2017. Web.
Koskela, Tuomas S., Deslippe, Jack, Friesen, Brian, & Raman, Karthic. Fusion PIC code performance analysis on the Cori KNL system. United States.
Koskela, Tuomas S., Deslippe, Jack, Friesen, Brian, and Raman, Karthic. Thu . "Fusion PIC code performance analysis on the Cori KNL system". United States. doi:. https://www.osti.gov/servlets/purl/1412519.
@article{osti_1412519,
title = {Fusion PIC code performance analysis on the Cori KNL system},
author = {Koskela, Tuomas S. and Deslippe, Jack and Friesen, Brian and Raman, Karthic},
abstractNote = {We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is shown to be the most beneficial optimization path with theoretical yield of up to 8x speedup on KNL. In practice we are able to obtain up to a 4x gain from vectorization due to limitations set by the data layout and memory latency.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu May 25 00:00:00 EDT 2017},
month = {Thu May 25 00:00:00 EDT 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • In this paper we present the results of optimizing the performance of the gyrokinetic full-f fusion PIC code XGC1 on the Cori Phase Two Knights Landing system. The code has undergone substantial development to enable the use of vector instructions in its most expensive kernels within the NERSC Exascale Science Applications Program. We study the single-node performance of the code on an absolute scale using the roofline methodology to guide optimization efforts. We have obtained 2x speedups in single node performance due to enabling vectorization and performing memory layout optimizations. On multiple nodes, the code is shown to scale wellmore » up to 4000 nodes, near half the size of the machine. We discuss some communication bottlenecks that were identified and resolved during the work.« less
  • The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions,more » and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.« less
  • Westinghouse Energy Systems Business Unit (ESBU) has worked with electric utility personnel to analyze the thermal performance of essential cooling water systems at nuclear generating stations. The primary goal of these analyses has been to demonstrate the operability of the cooling water systems during postulated limiting post-accident operation. In previous cooling water system thermal analyses, peak containment operating conditions were generally used as input assuming steady-state conditions. This approach is conservative as it does not take into account the improvement in containment conditions and cooling water system temperatures over time. This approach can, also, lead to an inconsistent set ofmore » assumptions between the two distinct analyses which may result in overly conservative calculated system operating conditions. These conditions inevitably impose unnecessary restrictions on cooling water system operation. Over the last few years, Westinghouse ESBU has coupled both the containment integrity and the cooling water system thermal calculations into an integrated analysis. This allows the use of a consistent set of input parameters and assumptions in the calculation of limiting cooling water system operating conditions. This approach has been successfully used to increase system operating margins. This paper provides an overview of this coupled thermal analysis along with examples of where increased operating margins can be applied.« less
  • A new computer code, Analysis of Radionuclide Source-Term with Chemical Transport (AREST-CT), is described in this paper. The code is being designed to support performance assessment analyses of engineered systems for subsurface isolation of hazardous and radioactive wastes. Radionuclide releases from an engineered system are modeled by solving governing equations describing conservation of water mass, air mass, thermal energy, and chemical species mass. As such, the AREST-CT code will be capable of simulating radionuclide release and transport in a non-isothermal, unsaturated-saturated setting. Constitutive equations are implemented that describe corrosion of iron-based container materials, glass, and spent fuel waste forms. Themore » governing equations are solved in a two-dimensional domain using an integrated finite-volume method. A third-order total variation diminishing (TVD) numerical scheme is evaluated to minimize numerical oscillations and dissipation of steep concentration gradients in advection-dominated transport problems.« less
  • We review our work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is capable of massive thread parallelism, data parallelism, and high on-board memory bandwidth and is being adopted in supercomputing centers for scientific research. The CG solver consumes the majority of time in production running, so we have spent most of our effort on it. We compare performance of an MPI+OpenMP baseline version of the MILC code with a version incorporating the QPhiX staggeredmore » CG solver, for both one-node and multi-node runs.« less