Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Improved MPI collectives for MPI processes in shared address spaces

Journal Article · · Cluster Computing
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations-Open MPI, MPICH2, and MVAPICH2-that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.
Research Organization:
Argonne National Laboratory (ANL)
Sponsoring Organization:
USDOE Office of Science
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1392899
Journal Information:
Cluster Computing, Journal Name: Cluster Computing Journal Issue: 4 Vol. 17; ISSN 1386-7857
Country of Publication:
United States
Language:
English

References (7)

Two algorithms for barrier synchronization journal February 1988
NUMA-aware shared-memory collective communication for MPI
  • Li, Shigang; Hoefler, Torsten; Snir, Marc
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462903
conference January 2013
Fast collective operations using shared and remote memory access protocols on clusters
  • Tipparaju, V.; Nieplocha, J.; Panda, D.
  • International Parallel and Distributed Processing Symposium (IPDPS 2003), Proceedings International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2003.1213188
conference January 2003
Optimization of Collective Communication Operations in MPICH journal February 2005
Optimizing threaded MPI execution on SMP clusters conference January 2001
Synchronization without contention journal April 1991
Optimization of MPI collectives on clusters of large-scale SMP's conference January 1999

Similar Records

Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation
Conference · Mon Dec 31 23:00:00 EST 2012 · OSTI ID:1095156

Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.
Technical Report · Thu Dec 01 23:00:00 EST 2005 · OSTI ID:881588

Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective
Conference · Mon Mar 05 23:00:00 EST 2007 · OSTI ID:908380