Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

Gorentla Venkata, Manjunath; Shamis, Pavel; Graham, Richard L; Ladd, Joshua S; Sampath, Rahul S

Title: Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

Conference · Tue Jan 01 00:00:00 EST 2013

OSTI ID:1095156

Gorentla Venkata, Manjunath ^[1]; Shamis, Pavel ^[1]; Graham, Richard L ^[1]; Ladd, Joshua S ^[1]; Sampath, Rahul S ^[1]

ORNL

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1095156

Resource Relation:: Conference: IEEE Cluster 2013, Indianapolis, IN, USA, 20130923, 20130927

Country of Publication:: United States

Language:: English

Similar Records

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Conference · Sun Sep 01 00:00:00 EDT 2013 · 2013 IEEE International Conference on Cluster Computing (CLUSTER) · OSTI ID:1095156

Venkata, Manjunath Gorentla; Shamis, Pavel; Sampath, Rahul; +2 more

Cheetah: A Framework for Scalable Hierarchical Collective Operations

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1095156

Graham, Richard L; Gorentla Venkata, Manjunath; Ladd, Joshua S; +4 more

Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:1095156

Shamis, Pavel; Graham, Richard L; Gorentla Venkata, Manjunath; +1 more

Title: Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

Citation Formats

Similar Records

Related Subjects