Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Conference · · 2013 IEEE International Conference on Cluster Computing (CLUSTER)
Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI_Allreduce and MPI_Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI_Allreduce and MPI_Reduce operations (and its nonblocking variants MPI_Iallreduce and MPI_Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini-Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science; USDOE
OSTI ID:
1567567
Conference Information:
Journal Name: 2013 IEEE International Conference on Cluster Computing (CLUSTER)
Country of Publication:
United States
Language:
English

Similar Records

Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation
Conference · Mon Dec 31 23:00:00 EST 2012 · OSTI ID:1095156

Cheetah: A Framework for Scalable Hierarchical Collective Operations
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1035530

Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1042820

Related Subjects