An efficient and portable SIMD algorithm for charge/current deposition in ParticleInCell codes
Abstract
In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word ondie to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use manycore processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. As a consequence, ParticleInCell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implementedmore »
 Authors:
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Alternative Energies and Atomic Energy Commission (CEA), GifSurYvette (France). Lasers Interactions and Dynamics Laboratory (LIDyL)
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Intel Corporation, OR (United States)
 Publication Date:
 Research Org.:
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), High Energy Physics (HEP) (SC25)
 OSTI Identifier:
 1393602
 Alternate Identifier(s):
 OSTI ID: 1396479
 Grant/Contract Number:
 AC0205CH11231; 624543
 Resource Type:
 Journal Article: Accepted Manuscript
 Journal Name:
 Computer Physics Communications
 Additional Journal Information:
 Journal Volume: 210; Journal Issue: C; Journal ID: ISSN 00104655
 Publisher:
 Elsevier
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING; ParticleInCell method; OpenMP; SIMD Vectorization; AVX2; AVX512; Tiling; Cache reuse; Manycore architectures
Citation Formats
Vincenti, H., Lobet, M., Lehe, R., Sasanka, R., and Vay, J. L.. An efficient and portable SIMD algorithm for charge/current deposition in ParticleInCell codes. United States: N. p., 2016.
Web. doi:10.1016/j.cpc.2016.08.023.
Vincenti, H., Lobet, M., Lehe, R., Sasanka, R., & Vay, J. L.. An efficient and portable SIMD algorithm for charge/current deposition in ParticleInCell codes. United States. doi:10.1016/j.cpc.2016.08.023.
Vincenti, H., Lobet, M., Lehe, R., Sasanka, R., and Vay, J. L.. 2016.
"An efficient and portable SIMD algorithm for charge/current deposition in ParticleInCell codes". United States.
doi:10.1016/j.cpc.2016.08.023. https://www.osti.gov/servlets/purl/1393602.
@article{osti_1393602,
title = {An efficient and portable SIMD algorithm for charge/current deposition in ParticleInCell codes},
author = {Vincenti, H. and Lobet, M. and Lehe, R. and Sasanka, R. and Vay, J. L.},
abstractNote = {In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word ondie to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use manycore processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. As a consequence, ParticleInCell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2256 bits wide data registers). Results show a factor of ×2 to ×2.5 speedup in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale architectures will have manycore processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing ParticleInCell codes on these future manycore architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D ParticleInCell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).},
doi = {10.1016/j.cpc.2016.08.023},
journal = {Computer Physics Communications},
number = C,
volume = 210,
place = {United States},
year = 2016,
month = 9
}
Web of Science

A general concurrent algorithm for plasma particleincell simulation codes. [JPL Mark III Hypercube parallel computer]
We have developed a new algorithm for implementing plasma particleincell (PIC) simulation codes on concurrent processors with distributed memory. This algorithm, named the general concurrent PIC algorithm (GCPIC), has been used to implement an electrostatic PIC code on the 33node JPL Mark III Hypercube parallel computer. To decompose at PIC code using the GCPIC algorithm, the physical domain of the particle simulation is divided into subdomains, equal in number to the number of processors, such that all subdomains have roughly equal numbers of particles. For problems with nonuniform particle densities, these subdomains will be of unequal physical size. Each processormore » 
An efficient mixedprecision, hybrid CPUGPU implementation of a nonlinearly implicit onedimensional particleincell algorithm
Recently, a fully implicit, energy and chargeconserving particleincell method has been developed for multiscale, fullf kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobianfree NewtonKrylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully selfconsistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » 
EnergyChargeconserving, Implicit, Electrostatic ParticleinCell Algorithm
This paper discusses a novel fully implicit formulation for a onedimensional electrostatic particleincell (PIC) plasma simulation approach. Unlike earlier implicit electrostatic PIC approaches (which are based on a linearized VlasovPoisson formulation), ours is based on a nonlinearly converged VlasovAmpere (VA) model. By iterating particles and fields to a tight nonlinear convergence tolerance, the approach features superior stability and accuracy properties, avoiding most of the accuracy pitfalls in earlier implicit PIC implementations. In particular, the formulation is stable against temporal (CourantFriedrichsLewy) and spatial (aliasing) instabilities. It is charge and energyconserving to numerical roundoff for arbitrary implicit time steps (unlike the earliermore »