Improved MPI collectives for MPI processes in shared address spaces
Journal Article
·
· Cluster Computing
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations-Open MPI, MPICH2, and MVAPICH2-that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.
- Research Organization:
- Argonne National Laboratory (ANL)
- Sponsoring Organization:
- USDOE Office of Science
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1392899
- Journal Information:
- Cluster Computing, Journal Name: Cluster Computing Journal Issue: 4 Vol. 17; ISSN 1386-7857
- Country of Publication:
- United States
- Language:
- English
Two algorithms for barrier synchronization
|
journal | February 1988 |
NUMA-aware shared-memory collective communication for MPI
|
conference | January 2013 |
Fast collective operations using shared and remote memory access protocols on clusters
|
conference | January 2003 |
Optimization of Collective Communication Operations in MPICH
|
journal | February 2005 |
Optimizing threaded MPI execution on SMP clusters
|
conference | January 2001 |
Synchronization without contention
|
journal | April 1991 |
Optimization of MPI collectives on clusters of large-scale SMP's
|
conference | January 1999 |
Similar Records
Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation
Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.
Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective
Conference
·
Mon Dec 31 23:00:00 EST 2012
·
OSTI ID:1095156
Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.
Technical Report
·
Thu Dec 01 23:00:00 EST 2005
·
OSTI ID:881588
Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective
Conference
·
Mon Mar 05 23:00:00 EST 2007
·
OSTI ID:908380