skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A CASE FOR NON-BLOCKING COLLECTIVES IN THE MPI STANDARD

Authors:
 [1];  [2];  [2];  [2];  [2]
  1. Los Alamos National Laboratory
  2. NON LANL
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1337128
Report Number(s):
LA-UR-07-3159
DOE Contract Number:
AC52-06NA25396
Resource Type:
Conference
Resource Relation:
Conference: 14TH EUROPEAN PVMMPI USERS' GROUP MEETING ; 200709 ; PARIS
Country of Publication:
United States
Language:
English

Citation Formats

SHIPMAN, GALEN M., HOEFLER, TORSTEN, KAMBADUR, PRABHANJAN, GRAHAM, RICHARD L., and LUMSDAINE, ANDREW. A CASE FOR NON-BLOCKING COLLECTIVES IN THE MPI STANDARD. United States: N. p., 2007. Web.
SHIPMAN, GALEN M., HOEFLER, TORSTEN, KAMBADUR, PRABHANJAN, GRAHAM, RICHARD L., & LUMSDAINE, ANDREW. A CASE FOR NON-BLOCKING COLLECTIVES IN THE MPI STANDARD. United States.
SHIPMAN, GALEN M., HOEFLER, TORSTEN, KAMBADUR, PRABHANJAN, GRAHAM, RICHARD L., and LUMSDAINE, ANDREW. Fri . "A CASE FOR NON-BLOCKING COLLECTIVES IN THE MPI STANDARD". United States. doi:. https://www.osti.gov/servlets/purl/1337128.
@article{osti_1337128,
title = {A CASE FOR NON-BLOCKING COLLECTIVES IN THE MPI STANDARD},
author = {SHIPMAN, GALEN M. and HOEFLER, TORSTEN and KAMBADUR, PRABHANJAN and GRAHAM, RICHARD L. and LUMSDAINE, ANDREW},
abstractNote = {},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri May 11 00:00:00 EDT 2007},
month = {Fri May 11 00:00:00 EDT 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • In this paper we make the case for adding standard non-blocking collective operations to the MPI standard. The non-blocking point-to-point and blocking collective operations currently defined by MPI provide important performance and abstraction benefits. To allow these benefits to be simultaneously realized, we present an application programming interface for non-blocking collective operations in MPI. Microbenchmark and application-based performance results demonstrate that non-blocking collective operations offer not only improved convenience, but improved performance as well, when compared to manual use of threads with blocking collectives.
  • With local core counts on the rise, taking advantage of shared memory to optimize collective operations can improve performance. We study several on-host shared memory optimized algorithms for MPI Bcast, MPI Reduce, and MPI Allreduce, using tree-based, and reduce-scatter algorithms. For small data operations with relatively large synchronization costs fan-in/fan-out algorithms generally perform best. For large messages data manipulation constitute the largest cost and reduce-scatter algorithms are best for reductions. These optimization improve performance by up to a factor of three. Memory and cache sharing effect require deliberate process layout and careful radix selection for tree-based methods
  • As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations-Open MPI, MPICH2, and MVAPICH2-that utilize systemmore » shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.« less