Logically Parallel Communication for Fast MPI+Threads Applications

Zambre, Rohit; Sahasrabudhe, Damodar; Zhou, Hui; Berzins, Martin; Chandramowlishwaran, Aparna; Balaji, Pavan

doi:10.1109/tpds.2021.3075157

Logically Parallel Communication for Fast MPI+Threads Applications

Journal Article · Thu Apr 22 00:00:00 EDT 2021 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2021.3075157· OSTI ID:1846741

^[1]; Sahasrabudhe, Damodar ^[2]; Zhou, Hui ^[3]; Berzins, Martin ^[2]; Chandramowlishwaran, Aparna ^[1]; Balaji, Pavan ^[3]

University of California, Irvine, CA (United States)
University of Utah, Salt Lake City, UT (United States)
Argonne National Laboratory (ANL), Argonne, IL (United States)

Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional "MPI everywhere" approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI's ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard's semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. Here in this article, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.

View Accepted Manuscript (DOE)

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE Office of Science; National Science Foundation (NSF); University of Utah

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 1846741

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 12 Vol. 32; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

References (36)

Scalable large-scale fluid-structure interaction solvers in the Uintah framework via hybrid task-based parallelism algorithms: UINTAH HYBRID TASK-BASED PARALLELISM ALGORITHMS Meng, Qingyu; Berzins, Martin Concurrency and Computation: Practice and Experience, Vol. 26, Issue 7 https://doi.org/10.1002/cpe.3099	journal	July 2013
OpenMP in VASP: Threading and SIMD Wende, Florian; Marsman, Martijn; Kim, Jeongnim International Journal of Quantum Chemistry, Vol. 119, Issue 12 https://doi.org/10.1002/qua.25851	journal	December 2018
hypre: A Library of High Performance Preconditioners Falgout, Robert D.; Yang, Ulrike Meier; Goos, Gerhard Computational Science — ICCS 2002: International Conference Amsterdam, The Netherlands, April 21–24, 2002 Proceedings, Part III https://doi.org/10.1007/3-540-47789-6_66	book	April 2002
Scaling Hypre’s Multigrid Solvers to 100,000 Cores Baker, Allison H.; Falgout, Robert D.; Kolev, Tzanio V. High-Performance Scientific Computing https://doi.org/10.1007/978-1-4471-2437-5_13	book	January 2012
Finepoints: Partitioned Multithreaded MPI Communication Grant, Ryan E.; Dosanjh, Matthew G. F.; Levenhagen, Michael J. High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 330-350 https://doi.org/10.1007/978-3-030-20656-7_17	book	May 2019
A Scalable Algorithm for Radiative Heat Transfer Using Reverse Monte Carlo Ray Tracing Humphrey, Alan; Harman, Todd; Berzins, Martin Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-20119-1_16	book	January 2015
Toward Efficient Support for Multithreaded MPI Communication Balaji, Pavan; Buntinas, Darius; Goodell, David Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/978-3-540-87475-1_20	book	January 2008
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory Hoefler, Torsten; Dinan, James; Buntinas, Darius Computing, Vol. 95, Issue 12 https://doi.org/10.1007/s00607-013-0324-2	journal	May 2013
Smilei : A collaborative, open-source, multi-purpose particle-in-cell code for plasma simulation Derouillat, J.; Beck, A.; Pérez, F. Computer Physics Communications, Vol. 222 https://doi.org/10.1016/j.cpc.2017.09.024	journal	January 2018
High performance computing using MPI and OpenMP on multi-core parallel systems Jin, Haoqiang; Jespersen, Dennis; Mehrotra, Piyush Parallel Computing, Vol. 37, Issue 9 https://doi.org/10.1016/j.parco.2011.02.002	journal	September 2011
Collectives in hybrid MPI+MPI code: Design, practice and performance Zhou, Huan; Gracia, José; Zhou, Naweiluo Parallel Computing, Vol. 99 https://doi.org/10.1016/j.parco.2020.102669	journal	November 2020
The energy band memory server algorithm for parallel Monte Carlo transport calculations Felker, Kyle G.; Siegel, Andrew R.; Smith, Kord S. SNA + MC 2013 - Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo https://doi.org/10.1051/snamc/201404207	conference	January 2014
Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs Patinyasakdikul, Thananon; Eberius, David; Bosilca, George 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891015	conference	September 2019
An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah Humphrey, Alan; Berzins, Martin 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) https://doi.org/10.1109/COMPSAC.2019.10282	conference	July 2019
UCX: An Open Source Framework for HPC Network APIs and Beyond Shamis, Pavel; Venkata, Manjunath Gorentla; Lopez, M. Graham 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI) https://doi.org/10.1109/HOTI.2015.13	conference	August 2015
A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency Grun, Paul; Hefty, Sean; Sur, Sayantan 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI) https://doi.org/10.1109/HOTI.2015.19	conference	August 2015
Improved MPI Multi-Threaded Performance using OFI Scalable Endpoints Gopalakrishnan, Aravind; Cabral, Matias A.; Erwin, James P. 2019 IEEE Symposium on High-Performance Interconnects (HOTI) https://doi.org/10.1109/HOTI.2019.00022	conference	August 2019
An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes Holmen, John K.; Peterson, Brad; Berzins, Martin 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) https://doi.org/10.1109/P3HPC49587.2019.00009	conference	November 2019
Scalable Communication Endpoints for MPI+Threads Applications Zambre, Rohit; Chandramowlishwaran, Aparna; Balaji, Pavan 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/PADSW.2018.8645059	conference	December 2018
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Rabenseifner, Rolf; Hager, Georg; Jost, Gabriele 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2009.43	conference	February 2009
Legion: Expressing locality and independence with logical regions Bauer, Michael; Treichler, Sean; Slaughter, Elliott 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.71	conference	November 2012
Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints Sridharan, Srinivas; Dinan, James; Kalamkar, Dhiraj D. SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.45	conference	November 2014
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices Berzins, Martin; Beckvermit, Jacqueline; Harman, Todd SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1023270	journal	January 2016
Productivity and performance using partitioned global address space languages Yelick, Katherine; Husbands, Parry; Iancu, Costin Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07 https://doi.org/10.1145/1278177.1278183	conference	January 2007
Using hybrid parallelism to improve memory use in the Uintah framework Meng, Qingyu; Berzins, Martin; Schmidt, John Proceedings of the 2011 TeraGrid Conference on Extreme Digital Discovery - TG '11 https://doi.org/10.1145/2016741.2016767	conference	January 2011
Enabling MPI interoperability through flexible communication endpoints Dinan, James; Balaji, Pavan; Goodell, David Proceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13 https://doi.org/10.1145/2488551.2488553	conference	January 2013
Language support for dynamic, hierarchical data partitioning Treichler, Sean; Bauer, Michael; Aiken, Alex ACM SIGPLAN Notices, Vol. 48, Issue 10 https://doi.org/10.1145/2544173.2509545	journal	November 2013
Realm Treichler, Sean; Bauer, Michael; Aiken, Alex Proceedings of the 23rd international conference on Parallel architectures and compilation https://doi.org/10.1145/2628071.2628084	conference	August 2014
MPI+Threads: runtime contention and remedies Amer, Abdelhalim; Lu, Huiwei; Wei, Yanjie ACM SIGPLAN Notices, Vol. 50, Issue 8 https://doi.org/10.1145/2858788.2688522	journal	January 2015
Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1 Raffenetti, Ken; Blocksome, Michael; Si, Min Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126963	conference	January 2017
Paths to Fast Barrier Synchronization on the Node Hetland, Conor; Tziantzioulis, Georgios; Suchy, Brian Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325402	conference	June 2019
Multi-criteria partitioning of multi-block structured grids Wang, Hengjie; Chandramowlishwaran, Aparna Proceedings of the ACM International Conference on Supercomputing https://doi.org/10.1145/3330345.3330369	conference	June 2019
Software combining to mitigate multithreaded MPI contention Amer, Abdelhalim; Archer, Charles; Blocksome, Michael Proceedings of the ACM International Conference on Supercomputing https://doi.org/10.1145/3330345.3330378	conference	June 2019
Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming Balaji, Pavan; Buntinas, Darius; Goodell, David The International Journal of High Performance Computing Applications, Vol. 24, Issue 1 https://doi.org/10.1177/1094342009360206	journal	February 2010
Enabling communication concurrency through flexible MPI endpoints Dinan, James; Grant, Ryan E.; Balaji, Pavan The International Journal of High Performance Computing Applications, Vol. 28, Issue 4 https://doi.org/10.1177/1094342014548772	journal	September 2014
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code Mendygral, P. J.; Radcliffe, N.; Kandalla, K. The Astrophysical Journal Supplement Series, Vol. 228, Issue 2 https://doi.org/10.3847/1538-4365/aa5b9c	journal	February 2017

Similar Records

Thread safety in an MPI implementation : requirements and analysis.

Journal Article · Sat Sep 01 00:00:00 EDT 2007 · Parallel Comput. · OSTI ID:920216

MPI as a coordination layer for communicating HPF tasks

Conference · Mon Dec 30 23:00:00 EST 1996 · OSTI ID:418494

Optimizing point‐to‐point communication between adaptive MPI endpoints in shared memory

Journal Article · Mon Mar 12 00:00:00 EDT 2018 · Concurrency and Computation. Practice and Experience · OSTI ID:1582085

Related Subjects

97 MATHEMATICS AND COMPUTING
Legion
MPI Endpoints
MPI+OpenMP
MPI+threads
MPI_THREAD_MULTIPLE
Uintah
Wombat
exascale MPI
hypre

Logically Parallel Communication for Fast MPI+Threads Applications

Citation Formats

References (36)

Similar Records

Related Subjects