Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather

Kandalla, Krishna; Subramoni, Hari; Vishnu, Abhinav; Panda, Dhabaleswar K

doi:10.1109/IPDPSW.2010.5470853

Title: Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather

Full Record
Other Related Research

Abstract

Modern high performance computing systems are being increasingly deployed in a hierarchical fashion with multi-core computing platforms forming the base of the hierarchy. These systems are usually comprised of multiple racks, with each rack consisting of a finite number of chassis, with each chassis having multiple compute nodes or blades, based on multi-core architectures. The networks are also hierarchical with multiple levels of switches. Message exchange operations between processes that belong to different racks involve multiple hops across different switches and this directly affects the performance of collective operations. In this paper, we take on the challenges involved in detecting the topology of large scale InfiniBand clusters and leveraging this knowledge to design efficient topology-aware algorithms for collective operations. We also propose a communication model to analyze the communication costs involved in collective operations on large scale supercomputing systems. We have analyzed the performance characteristics of two collectives, MPI_Gather and MPI_Scatter on such systems and we have proposed topology-aware algorithms for these operations. Our experimental results have shown that the proposed algorithms can improve the performance of these collective operations by almost 54% at the micro-benchmark level.

Authors:: Kandalla, Krishna; Subramoni, Hari; Vishnu, Abhinav; Panda, Dhabaleswar K

Publication Date:: Thu Apr 01 00:00:00 EDT 2010

Research Org.:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 986268

Report Number(s):: PNNL-SA-71043
KJ0402000; TRN: US201017%%42

DOE Contract Number:: AC05-76RL01830

Resource Type:: Conference

Resource Relation:: Conference: IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW 2010)

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICAL METHODS AND COMPUTING; ALGORITHMS; DESIGN; PERFORMANCE; TOPOLOGY; COMPUTER NETWORKS; COMPUTER ARCHITECTURE; DATA TRANSMISSION; SUPERCOMPUTERS

Citation Formats


                    Kandalla, Krishna, Subramoni, Hari, Vishnu, Abhinav, and Panda, Dhabaleswar K. Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather.  United States: N. p., 2010. 
        Web.  doi:10.1109/IPDPSW.2010.5470853.

Copy to clipboard


                    Kandalla, Krishna, Subramoni, Hari, Vishnu, Abhinav, & Panda, Dhabaleswar K. Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather.  United States.  https://doi.org/10.1109/IPDPSW.2010.5470853

Copy to clipboard


                    Kandalla, Krishna, Subramoni, Hari, Vishnu, Abhinav, and Panda, Dhabaleswar K. 2010.  
        "Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather".  United States.  https://doi.org/10.1109/IPDPSW.2010.5470853.

Copy to clipboard


                    
@article{osti_986268,

  title        = {Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies with Scatter and Gather},

  author       = {Kandalla, Krishna and Subramoni, Hari and Vishnu, Abhinav and Panda, Dhabaleswar K},

  abstractNote = {Modern high performance computing systems are being increasingly deployed in a hierarchical fashion with multi-core computing platforms forming the base of the hierarchy. These systems are usually comprised of multiple racks, with each rack consisting of a finite number of chassis, with each chassis having multiple compute nodes or blades, based on multi-core architectures. The networks are also hierarchical with multiple levels of switches. Message exchange operations between processes that belong to different racks involve multiple hops across different switches and this directly affects the performance of collective operations. In this paper, we take on the challenges involved in detecting the topology of large scale InfiniBand clusters and leveraging this knowledge to design efficient topology-aware algorithms for collective operations. We also propose a communication model to analyze the communication costs involved in collective operations on large scale supercomputing systems. We have analyzed the performance characteristics of two collectives, MPI_Gather and MPI_Scatter on such systems and we have proposed topology-aware algorithms for these operations. Our experimental results have shown that the proposed algorithms can improve the performance of these collective operations by almost 54% at the micro-benchmark level.},

  doi          = {10.1109/IPDPSW.2010.5470853},

  url          = {https://www.osti.gov/biblio/986268},
  journal      = {},
number       = ,

  volume       = ,

  place        = {United States},

  year         = {Thu Apr 01 00:00:00 EDT 2010},

  month        = {Thu Apr 01 00:00:00 EDT 2010}

}

Copy to clipboard

Conference:

https://doi.org/10.1109/IPDPSW.2010.5470853

Other availability

Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Export Metadata

Save to My Library

Similar records in OSTI.GOV collections:

Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective

Conference Vishnu, A; Koop, M; Moody, A; ...

Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 Supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studiesmore »« less
https://doi.org/10.1109/CCGRID.2007.60

Full Text Available
Dynamic Time-Variant Connection Management for PGAS Models on InfiniBand

Conference Vishnu, Abhinav; Krishnan, Manoj; Balaji, Pavan

InfiniBand (IB) has established itself as a promising network infrastructure for high-end cluster computing systems as evidenced by its usage in the Top500 supercomputers today. While the IB standard describes multiple communication models (including reliable-connection (RC), and unreliable datagram (UD)), most of its promising features such as remote direct memory access (RDMA), hardware atomics and network fault tolerance are only available for the RC model which requires connections between communicating process pairs. In the past, several researchers have proposed on-demand connection management techniques that establish connections when there is a need to communicate, and not before. While such techniques workmore »« less
Future SDN-HPON Control Plane Architecture and Protocol for On-Demand Terabit End-to-End Extreme-Scale Science Applications

Technical Report Yoo, S.J.

The UC Davis team conducted architecture, algorithm and experimental studies for next generation ultra-high-bandwidth optical networks in support of extreme-scale science applications. In particular, over the course of this 3-year project, the UC Davis team achieved the following main accomplishments: Designed a network architecture for automated multi-domain service provisioning: Designed information exchange schemes between autonomous systems (domains) Designed application programming interfaces to orchestrate resource reservation in data centers, high performance computation facilities, and communication networks belonging to different autonomous systems Experimentally assessed the multi-domain network architecture in a distributed field trial set-up connecting premises in three continents Designed a broker-basedmore »« less
https://doi.org/10.2172/1579715

Full Text Available
X-SRQ - Improving Scalability and Performance of Multi-Core InfiniBand Clusters

Conference Shipman, Galen; Poole, Stephen

To improve the scalability of InfiniBand on large scale clusters Open MPI introduced a protocol known as B-SRQ [2]. This protocol was shown to provide much better memory utilization of send and receive buffers for a wide variety of benchmarks and real-world applications. Unfortunately B-SRQ increases the number of connections between communicating peers. While addressing one scalability problem of InfiniBand the protocol introduced another. To alleviate the connection scalability problem of the B-SRQ protocol a small enhancement to the reliable connection transport was requested which would allow multiple shared receive queues to be attached to a single reliable connection. Thismore »« less
Efficient On-demand Connection Management Mechanisms with PGAS Models on InfiniBand

Conference Vishnu, Abhinav; Krishnan, Manoj

In the last decade or so, clusters have observed a tremendous rise in popularity due to the excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. At the same time, multiple programming models have emerged in order to meet the requirements of various applications and their programming models. To support requirements of multiple programming models, InfiniBand provides multiple transport semantics, ranging from unreliable connectionless to reliable connected characteristics. Among them, the reliable connection (RC) semantics is being widely due to itsmore »« less
https://doi.org/10.1109/CCGRID.2010.58

Similar Records