skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

Abstract

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP to exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.

Authors:
; ; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Environmental Molecular Sciences Laboratory (EMSL)
Sponsoring Org.:
USDOE
OSTI Identifier:
1411925
Report Number(s):
PNNL-SA-130899
49691; KC0302060
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: International Conference on High Performance Computing, June 18-22, 2017, Frankfurt, Germany. Lecture Notes in Computer Science, 10524:404-418
Country of Publication:
United States
Language:
English
Subject:
Environmental Molecular Sciences Laboratory

Citation Formats

Bylaska, Eric J., Jacquelin, Mathias, De Jong, Wibe A., Hammond, Jeff R., and Klemm, Michael. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor. United States: N. p., 2017. Web. doi:10.1007/978-3-319-67630-2_30.
Bylaska, Eric J., Jacquelin, Mathias, De Jong, Wibe A., Hammond, Jeff R., & Klemm, Michael. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor. United States. doi:10.1007/978-3-319-67630-2_30.
Bylaska, Eric J., Jacquelin, Mathias, De Jong, Wibe A., Hammond, Jeff R., and Klemm, Michael. 2017. "Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor". United States. doi:10.1007/978-3-319-67630-2_30.
@article{osti_1411925,
title = {Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor},
author = {Bylaska, Eric J. and Jacquelin, Mathias and De Jong, Wibe A. and Hammond, Jeff R. and Klemm, Michael},
abstractNote = {Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP to exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.},
doi = {10.1007/978-3-319-67630-2_30},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month =
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
  • In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore » powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device. After recent optimizations to its collision kernel [4], most of the computing time is spent in the electron push (pushe) kernel, where these optimization efforts have been focused. The kernel code scaled well with MPI+OpenMP but had almost no automatic compiler vectorization, in part due to indirect memory addresses and in part due to low trip counts of low-level loops that would be candidates for vectorization. Particle blocking and sorting have been implemented to increase trip counts of low-level loops and improve memory locality, and OpenMP directives have been added to vectorize compute-intensive loops that were identified by Advisor. The optimizations have improved the performance of the pushe kernel 2x on Haswell processors and 1.7x on KNL. The KNL node-for-node performance has been brought to within 30% of a NERSC Cori phase I Haswell node and we expect to bridge this gap by reducing the memory footprint of compute intensive routines to improve cache reuse. ## PICSAR is a Fortran/Python high-performance Particle-In-Cell library targeting at MIC architectures first designed to be coupled with the PIC code WARP for the simulation of laser-matter interaction and particle accelerators. PICSAR also contains a FORTRAN stand-alone kernel for performance studies and benchmarks. A MPI domain decomposition is used between NUMA domains and a tile decomposition (cache-blocking) handled by OpenMP has been added for shared-memory parallelism and better cache management. The so-called current deposition and field gathering steps that compose the PIC time loop constitute major hotspots that have been rewritten to enable more efficient vectorization. Particle communications between tiles and MPI domain has been merged and parallelized. All considered, these improvements provide speedups of 3.1 for order 1 and 4.6 for order 3 interpolation shape factors on KNL configured in SNC4 quadrant flat mode. Performance is similar between a node of cori phase 1 and KNL at order 1 and better on KNL by a factor 1.6 at order 3 with the considered test case (homogeneous thermal plasma).« less
  • Accelerators are used in about 13% of the current Top500 List. Supercomputers leveraging accelerators grew by a factor of 2.2x in 2012 and are expected to completely dominate the Top500 by 2015. Though most of these deployments use NVIDIA GPGPU accelerators, Intel’s Xeon Phi architecture will likely grow in popularity in the coming years. Unfortunately, there are few studies analyzing the performance and energy efficiency of systems leveraging the Intel Xeon Phi. We extend our systemic measurement methodology to isolate system power by component including accelerators. We use this methodology to present a detailed study of the performance-energy tradeoffs ofmore » the Xeon Phi architecture.« less
  • An overview of the parallel algorithms for ab initio molecular dynamics (AIMD) used in the NWChem program package is presented, including recent developments for computing exact exchange. These algorithms make use of a two-dimensional processor geometry proposed by Gygi et al for use in AIMD algorithms. Using this strategy, a highly scalable algorithm for exact exchange has been developed and incorporated it into AIMD. This new algorithm for exact exchange employs an incomplete butterfly to overcome the bottleneck associated with exact exchange term, and it makes judicious use of data replication. Initial testing has shown that this algorithm can scalemore » to over 20,000 CPUs even for modest size simulation.« less