Deploy threading in Nalu solver stack
Abstract
The goal of the ExaWind project is to enable predictive simulations of wind farms composed of many MWscale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multiturbine wind farm simulations will require exascaleclass resources. The primary code in the ExaWind project is Nalu, which is an unstructuredgrid solver for the acousticallyincompressible NavierStokes equations, and mass continuity is maintained through pressure projection. The model consists of the masscontinuity Poissontype equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linearsystem setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step In this Milestone, we examine the effect of threading on the solver stack performance against flatMPI results obtained from previous milestones using Haswell performance data fullturbine simulations. Whereas the momentum equations are solved only with the Trilinos solvers, we investigatemore »
 Authors:

 Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
 National Renewable Energy Lab. (NREL), Golden, CO (United States)
 Sandia National Lab. (SNLNM), Albuquerque, NM (United States)
 Publication Date:
 Research Org.:
 Sandia National Lab. (SNLCA), Livermore, CA (United States); Sandia National Lab. (SNLNM), Albuquerque, NM (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC)
 OSTI Identifier:
 1481562
 Report Number(s):
 SAND201812209R
669126
 DOE Contract Number:
 AC0494AL85000
 Resource Type:
 Technical Report
 Country of Publication:
 United States
 Language:
 English
 Subject:
 17 WIND ENERGY
Citation Formats
Prokopenko, Andrey, Thomas, Stephen, Swirydowicz, Kasia, Ananthan, Shreyas, Hu, Jonathan J., Williams, Alan B., and Sprague, Michael. Deploy threading in Nalu solver stack. United States: N. p., 2018.
Web. doi:10.2172/1481562.
Prokopenko, Andrey, Thomas, Stephen, Swirydowicz, Kasia, Ananthan, Shreyas, Hu, Jonathan J., Williams, Alan B., & Sprague, Michael. Deploy threading in Nalu solver stack. United States. https://doi.org/10.2172/1481562
Prokopenko, Andrey, Thomas, Stephen, Swirydowicz, Kasia, Ananthan, Shreyas, Hu, Jonathan J., Williams, Alan B., and Sprague, Michael. 2018.
"Deploy threading in Nalu solver stack". United States. https://doi.org/10.2172/1481562. https://www.osti.gov/servlets/purl/1481562.
@article{osti_1481562,
title = {Deploy threading in Nalu solver stack},
author = {Prokopenko, Andrey and Thomas, Stephen and Swirydowicz, Kasia and Ananthan, Shreyas and Hu, Jonathan J. and Williams, Alan B. and Sprague, Michael},
abstractNote = {The goal of the ExaWind project is to enable predictive simulations of wind farms composed of many MWscale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multiturbine wind farm simulations will require exascaleclass resources. The primary code in the ExaWind project is Nalu, which is an unstructuredgrid solver for the acousticallyincompressible NavierStokes equations, and mass continuity is maintained through pressure projection. The model consists of the masscontinuity Poissontype equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linearsystem setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step In this Milestone, we examine the effect of threading on the solver stack performance against flatMPI results obtained from previous milestones using Haswell performance data fullturbine simulations. Whereas the momentum equations are solved only with the Trilinos solvers, we investigate two algebraicmultigrid preconditioners for the continuity equations: Trilinos/Muelu and HYPRE/BoomerAMG. These two packages embody smoothedaggregation and classical RugeStiiben AMG methods, respectively. In our FY18 Q2 report, we described our efforts to improve setup and solve of the continuity equations under flatMPI parallelism. While significant improvement was demonstrated in the solve phase, setup times remained larger than expected. Starting with the optimized settings described in the Q2 report, we explore here simulation performance where OpenMP threading is employed in the solver stack. For Trilinos, threading is acheived through the Kokkos abstraction where, whereas HYPRE/BoomerAMG employs straight OpenMP. We examined results for our midresolution baseline turbine simulation configuration (229M DOF). Simulations on 2048 Haswell cores explored the effect of decreasing the number of MPI ranks while increasing the number of threads. Both HYPRE and Trilinos exhibited similar overal solution times, and both showed dramatic increases in simulation time in the shift from MPI ranks to OpenMP threads. This increase is attributed to the large amount of work per MPI rank starting at the singlethread configuration. Decreasing MPI ranks, while increasing threads, may be increasing simulation time due to thread synchronization and startup overhead contributing to the latency and serial time in the model. These result showed that an MPI+OpenMP parallel decomposition will be more effective as the amount per MPI rank computation per MPI rank decreases and the communication latency increases. This idea was demonstrated in a strong scaling study of our lowresolution baseline model (29M DOF) with the TrilinosHYPRE configuration. While MPIonly results showed scaling improvement out to about 1536 cores, engaging threading carried scaling improvements out to 4128 cores — roughly 7000 DOF per core. This is an important result as improved strong scaling is needed for simulations to be executed over sufficiently long simulated durations (i.e., for many timesteps). In addition to threading work described above, the team examined solverperformance improvements by exploring communicationoverhead in the HYPREGMRES implementation through a communicationoptimal GMRE algorithm (COGMRES), and offloading computeintensive solver actions to GPUs. To those ends, a HYPRE miniapp was allow us to easily test different solver approaches and HYPRE parameter settings without running the entire Nalu code. With GPU acceleration on the Summitdev supercomputer, a 20x speedup was achieved for the overall preconditioner and solver execution time for the miniapp. A study on Haswell processors showed that COGMRES provides benefits as one increases MPI ranks.},
doi = {10.2172/1481562},
url = {https://www.osti.gov/biblio/1481562},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {10}
}