Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional "MPI everywhere" approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI's ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard's semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. Here in this article, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
Zambre, Rohit, et al. "Logically Parallel Communication for Fast MPI+Threads Applications." IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 12, Apr. 2021. https://doi.org/10.1109/tpds.2021.3075157
Zambre, Rohit, Sahasrabudhe, Damodar, Zhou, Hui, et al., "Logically Parallel Communication for Fast MPI+Threads Applications," IEEE Transactions on Parallel and Distributed Systems 32, no. 12 (2021), https://doi.org/10.1109/tpds.2021.3075157
@article{osti_1846741,
author = {Zambre, Rohit and Sahasrabudhe, Damodar and Zhou, Hui and Berzins, Martin and Chandramowlishwaran, Aparna and Balaji, Pavan},
title = {Logically Parallel Communication for Fast MPI+Threads Applications},
annote = {Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional "MPI everywhere" approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI's ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard's semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. Here in this article, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.},
doi = {10.1109/tpds.2021.3075157},
url = {https://www.osti.gov/biblio/1846741},
journal = {IEEE Transactions on Parallel and Distributed Systems},
issn = {ISSN 1045-9219},
number = {12},
volume = {32},
place = {United States},
publisher = {IEEE},
year = {2021},
month = {04}}
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science; National Science Foundation (NSF); University of Utah
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
1846741
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 12 Vol. 32; ISSN 1045-9219
Falgout, Robert D.; Yang, Ulrike Meier; Goos, Gerhard
Computational Science — ICCS 2002: International Conference Amsterdam, The Netherlands, April 21–24, 2002 Proceedings, Part IIIhttps://doi.org/10.1007/3-540-47789-6_66
Grant, Ryan E.; Dosanjh, Matthew G. F.; Levenhagen, Michael J.
High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 330-350https://doi.org/10.1007/978-3-030-20656-7_17
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2012.71
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17https://doi.org/10.1145/3126908.3126963