skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

Conference ·
OSTI ID:1095156

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
1095156
Resource Relation:
Conference: IEEE Cluster 2013, Indianapolis, IN, USA, 20130923, 20130927
Country of Publication:
United States
Language:
English

Similar Records

Related Subjects