Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking.

Domino, Stefan P.; Ananthan, Shreyas; Knaus, Robert C.; Williams, Alan B.

doi:10.2172/1398334

Title: Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking.

Technical Report · Fri Sep 29 00:00:00 EDT 2017

DOI:https://doi.org/10.2172/1398334· OSTI ID:1398334

Domino, Stefan P. ^[1]; Ananthan, Shreyas ^[1]; Knaus, Robert C. ^[1]; Williams, Alan B. ^[1]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

The former Nalu interior heterogeneous algorithm design, which was originally designed to manage matrix assembly operations over all elemental topology types, has been modified to operate over homogeneous collections of mesh entities. This newly templated kernel design allows for removal of workset variable resize operations that were formerly required at each loop over a Sierra ToolKit (STK) bucket (nominally, 512 entities in size). Extensive usage of the Standard Template Library (STL) std::vector has been removed in favor of intrinsic Kokkos memory views. In this milestone effort, the transition to Kokkos as the underlying infrastructure to support performance and portability on many-core architectures has been deployed for key matrix algorithmic kernels. A unit-test driven design effort has developed a homogeneous entity algorithm that employs a team-based thread parallelism construct. The STK Single Instruction Multiple Data (SIMD) infrastructure is used to interleave data for improved vectorization. The collective algorithm design, which allows for concurrent threading and SIMD management, has been deployed for the core low-Mach element- based algorithm. Several tests to ascertain SIMD performance on Intel KNL and Haswell architectures have been carried out. The performance test matrix includes evaluation of both low- and higher-order methods. The higher-order low-Mach methodology builds on polynomial promotion of the core low-order control volume nite element method (CVFEM). Performance testing of the Kokkos-view/SIMD design indicates low-order matrix assembly kernel speed-up ranging between two and four times depending on mesh loading and node count. Better speedups are observed for higher-order meshes (currently only P=2 has been tested) especially on KNL. The increased workload per element on higher-order meshes bene ts from the wide SIMD width on KNL machines. Combining multiple threads with SIMD on KNL achieves a 4.6x speedup over the baseline, with assembly timings faster than that observed on Haswell architecture. The computational workload of higher-order meshes, therefore, seems ideally suited for the many-core architecture and justi es further exploration of higher-order on NGP platforms. A Trilinos/Tpetra-based multi-threaded GMRES preconditioned by symmetric Gauss Seidel (SGS) represents the core solver infrastructure for the low-Mach advection/diffusion implicit solves. The threaded solver stack has been tested on small problems on NREL's Peregrine system using the newly developed and deployed Kokkos-view/SIMD kernels. fforts are underway to deploy the Tpetra-based solver stack on NERSC Cori system to benchmark its performance at scale on KNL machines.

View Technical Report

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1398334

Report Number(s):: SAND-2017-10549R; 657410

Country of Publication:: United States

Language:: English

Similar Records

Deploy production sliding mesh capability with linear solver benchmarking (ECP Milestone Report, Ver. 1.0)

Technical Report · Fri Dec 22 00:00:00 EST 2017 · OSTI ID:1398334

Domino, Stefan P.; Thomas, Stephen; Barone, Matthew F.; +6 more

Deploy production sliding mesh capability with linear solver benchmarking.

Technical Report · Thu Feb 01 00:00:00 EST 2018 · OSTI ID:1398334

Domino, Stefan P.; Thomas, Stephen; Barone, Matthew F.; +6 more

Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

Conference · Tue May 23 00:00:00 EDT 2017 · OSTI ID:1398334

Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack; +1 more

Related Subjects

97 MATHEMATICS AND COMPUTING

Title: Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking.

Citation Formats

Similar Records

Related Subjects