skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

Abstract

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313–328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory,more » USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.« less

Authors:
; ORCiD logo; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1565649
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article
Journal Name:
Computer Physics Communications
Additional Journal Information:
Journal Volume: 228; Journal Issue: C; Journal ID: ISSN 0010-4655
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
Computer Science; Physics

Citation Formats

Clay, M. P., Buaria, D., Yeung, P. K., and Gotoh, T. GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5. United States: N. p., 2018. Web. doi:10.1016/j.cpc.2018.02.020.
Clay, M. P., Buaria, D., Yeung, P. K., & Gotoh, T. GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5. United States. https://doi.org/10.1016/j.cpc.2018.02.020
Clay, M. P., Buaria, D., Yeung, P. K., and Gotoh, T. 2018. "GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5". United States. https://doi.org/10.1016/j.cpc.2018.02.020.
@article{osti_1565649,
title = {GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5},
author = {Clay, M. P. and Buaria, D. and Yeung, P. K. and Gotoh, T.},
abstractNote = {This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313–328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.},
doi = {10.1016/j.cpc.2018.02.020},
url = {https://www.osti.gov/biblio/1565649}, journal = {Computer Physics Communications},
issn = {0010-4655},
number = C,
volume = 228,
place = {United States},
year = {2018},
month = {7}
}