Efficient GPU Implementation of Automatic Differentiation for Computational Fluid Dynamics
- Old Dominion Univ., Norfolk, VA (United States)
- NASA Langley Research Center, Hampton, VA (United States)
- National Institute of Aerospace, Hampton, VA (United States)
- Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
- Northwestern Univ., Evanston, IL (United States)
- Univ. of Maryland, College Park, MD (United States)
Many scientific and engineering applications require repeated calculations of derivatives of output functions with respect to input parameters. Automatic Differentiation (AD) is a method that automates derivative calculations and can significantly speed up code development. In Computational Fluid Dynamics (CFD), derivatives of flux functions with respect to state variables (Jacobian) are needed for efficient solutions of the nonlinear governing equations. AD of flux functions on graphics processing units (GPUs) is challenging as flux computations involve many intermediate variables that create high register pressure and require significant memory traffic because of the need to store the derivatives. This paper presents a forward-mode AD method based on multivariate dual numbers that addresses these challenges and simultaneously reduces the floating-point operation count. The dimension of the multivariate dual numbers is optimized for performance. The flux computations are restructured to minimize the number of temporary variables and reduce register pressure. For effective utilization of memory bandwidth, shared memory is used to store the local flux Jacobian. This AD implementation is compared with several other Jacobian implementations on an NVIDIA V100 GPU (V100). For three-dimensional perfect-gas compressible-flow equations implemented in a practical CFD code, the AD implementation of a flux Jacobian based on multivariate dual numbers of dimension 5 outperforms all other GPU AD implementations on V100. Its performance is comparable with the optimized hand-differentiated version. Finally, the implementation achieves 75% of the peak floating-point throughput and 61 % of the peak global device memory bandwidth usage.
- Research Organization:
- Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA (United States); Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), High Energy Physics (HEP); National Institute of Aerospace
- Grant/Contract Number:
- AC02-07CH11359; AC05-00OR22725
- OSTI ID:
- 1993463
- Report Number(s):
- FERMILAB-CONF--23-342-CSAID; oai:inspirehep.net:2679722
- Journal Information:
- Proceedings ... International Conference on High Performance Computing (Online), Journal Name: Proceedings ... International Conference on High Performance Computing (Online) Vol. 2023; ISSN 2640-0316
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures
CUDA Computation of the Feynman Distribution