Optimizing the hypre solver for manycore and GPU architectures

Sahasrabudhe, Damodar; Zambre, Rohit; Chandramowlishwaran, Aparna; Berzins, Martin

doi:10.1016/j.jocs.2020.101279

Title: Optimizing the hypre solver for manycore and GPU architectures

Journal Article · Thu Dec 24 00:00:00 EST 2020 · Journal of Computational Science

DOI:https://doi.org/10.1016/j.jocs.2020.101279· OSTI ID:1850315

Sahasrabudhe, Damodar ^[1]; Zambre, Rohit ^[2]; Chandramowlishwaran, Aparna ^[2]; Berzins, Martin ^[1]

Univ. of Utah, Salt Lake City, UT (United States). SCI Inst.
Univ. of California, Irvine, CA (United States). EECS

The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44) faster than Hypre's MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16—1.44 compared to the baseline GPU implementation. The above optimization strategies were published in the International Conference on Computational Science 2020 [1]. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. Additionally, this includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism [2]. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2 on 256 nodes of Intel Knight's Landing processor.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Univ. of Utah, Salt Lake City, UT (United States); Argonne National Laboratory (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: NA0002375; AC02-06CH11357

OSTI ID:: 1850315

Alternate ID(s):: OSTI ID: 1780319

Journal Information:: Journal of Computational Science, Vol. 49, Issue C; ISSN 1877-7503

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (12)

Large Scale Parallel Solution of Incompressible Flow Problems Using Uintah and Hypre Schmidt, J.; Berzins, M.; Thornock, J. 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing https://doi.org/10.1109/CCGrid.2013.10	conference	May 2013
Demonstrating GPU code portability and scalability for radiative heat transfer computations Peterson, Brad; Humphrey, Alan; Holmen, John Journal of Computational Science, Vol. 27 https://doi.org/10.1016/j.jocs.2018.06.005	journal	July 2018
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices Berzins, Martin; Beckvermit, Jacqueline; Harman, Todd SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1023270	journal	January 2016
Enabling MPI interoperability through flexible communication endpoints Dinan, James; Balaji, Pavan; Goodell, David Proceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13 https://doi.org/10.1145/2488551.2488553	conference	January 2013
Pursuing scalability for hypre 's conceptual interfaces Falgout, Robert D.; Jones, Jim E.; Yang, Ulrike Meier ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089018	journal	September 2005
Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs Patinyasakdikul, Thananon; Eberius, David; Bosilca, George 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891015	conference	September 2019
Scaling Hypre’s Multigrid Solvers to 100,000 Cores Baker, Allison H.; Falgout, Robert D.; Kolev, Tzanio V. High-Performance Scientific Computing https://doi.org/10.1007/978-1-4471-2437-5_13	book	January 2012
Enabling communication concurrency through flexible MPI endpoints Dinan, James; Grant, Ryan E.; Balaji, Pavan The International Journal of High Performance Computing Applications, Vol. 28, Issue 4 https://doi.org/10.1177/1094342014548772	journal	September 2014
An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah Humphrey, Alan; Berzins, Martin 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) https://doi.org/10.1109/COMPSAC.2019.10282	conference	July 2019
Scalable Communication Endpoints for MPI+Threads Applications Zambre, Rohit; Chandramowlishwaran, Aparna; Balaji, Pavan 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/PADSW.2018.8645059	conference	December 2018
Communication Avoiding Multigrid Preconditioned Conjugate Gradient Method for Extreme Scale Multiphase CFD Simulations Idomura, Yasuhiro; Ina, Takuya; Yamashita, Susumu 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (scalA) https://doi.org/10.1109/ScalA.2018.00006	conference	November 2018
Modeling the Performance of an Algebraic Multigrid Cycle Using Hybrid MPI/OpenMP Gahvari, Hormozd; Gropp, William; Jordan, Kirk E. 2012 41st International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2012.41	conference	September 2012

Similar Records

Deploy threading in Nalu solver stack

Technical Report · Mon Oct 01 00:00:00 EDT 2018 · OSTI ID:1850315

Prokopenko, Andrey; Thomas, Stephen; Swirydowicz, Kasia; +4 more

Quantum Monte Carlo Endstation for Petascale Computing

Technical Report · Wed Mar 02 00:00:00 EST 2011 · OSTI ID:1850315

Ceperley, David

Petascale Computing Enabling Technologies Project Final Report

Technical Report · Sun Feb 14 00:00:00 EST 2010 · OSTI ID:1850315

de Supinski, B R

Related Subjects

97 MATHEMATICS AND COMPUTING
Computer Science
Hypre
MPI EndPoints
Multithreading
Manycore processors
GPUs
Performance optimizations

Title: Optimizing the hypre solver for manycore and GPU architectures

Citation Formats

References (12)

Similar Records

Related Subjects