skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective

Abstract

Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 Supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studies for switches and adapters to implement congestion control have been proposed in the literature. However, these studies have focused on providing congestion control for the communication path, and not on utilizing multiple paths in the network for hot-spot avoidance. In this paper, we design an MPI functionality, which provides hot-spot avoidance for different communications, without a priori knowledge of the pattern. We leverage LMC (LID Mask Count) mechanism of InfiniBand to create multiple paths in the network and present the design issues (scheduling policies, selecting number of paths, scalability aspects) of our design. We implement our design and evaluate itmore » with Pallas collective communication and MPI applications. On an InfiniBand cluster with 48 processes, collective operations like MPI All-to-all Personalized and MPI Reduce Scatter show an improvement of 27% and 19% respectively. Our evaluation with MPI applications like NAS Parallel Benchmarks and PSTSWM on 64 processes shows significant improvement in execution time with this functionality.« less

Authors:
; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
908380
Report Number(s):
UCRL-CONF-228725
TRN: US200722%%617
DOE Contract Number:
W-7405-ENG-48
Resource Type:
Conference
Resource Relation:
Conference: Presented at: CCGrid 07 - Seventh IEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil, May 14 - May 17, 2007
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; AVOIDANCE; BENCHMARKS; COMMUNICATIONS; CONFIGURATION; DESIGN; EVALUATION; HOT SPOTS; ROUTING; SIMULATION; SUPERCOMPUTERS; SWITCHES; TOPOLOGY

Citation Formats

Vishnu, A, Koop, M, Moody, A, Mamidala, A R, Narravula, S, and Panda, D K. Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective. United States: N. p., 2007. Web.
Vishnu, A, Koop, M, Moody, A, Mamidala, A R, Narravula, S, & Panda, D K. Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective. United States.
Vishnu, A, Koop, M, Moody, A, Mamidala, A R, Narravula, S, and Panda, D K. Tue . "Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective". United States. doi:. https://www.osti.gov/servlets/purl/908380.
@article{osti_908380,
title = {Hot-Spot Avoidance With Multi-Pathing Over Infiniband: An MPI Perspective},
author = {Vishnu, A and Koop, M and Moody, A and Mamidala, A R and Narravula, S and Panda, D K},
abstractNote = {Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 Supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studies for switches and adapters to implement congestion control have been proposed in the literature. However, these studies have focused on providing congestion control for the communication path, and not on utilizing multiple paths in the network for hot-spot avoidance. In this paper, we design an MPI functionality, which provides hot-spot avoidance for different communications, without a priori knowledge of the pattern. We leverage LMC (LID Mask Count) mechanism of InfiniBand to create multiple paths in the network and present the design issues (scheduling policies, selecting number of paths, scalability aspects) of our design. We implement our design and evaluate it with Pallas collective communication and MPI applications. On an InfiniBand cluster with 48 processes, collective operations like MPI All-to-all Personalized and MPI Reduce Scatter show an improvement of 27% and 19% respectively. Our evaluation with MPI applications like NAS Parallel Benchmarks and PSTSWM on 64 processes shows significant improvement in execution time with this functionality.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Mar 06 00:00:00 EST 2007},
month = {Tue Mar 06 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • InfiniBand has become a very popular interconnect, due to its advanced features and open standard. Large scale InfiniBand clusters are becoming very popular, as reflected by the TOP 500 supercomputer rankings. However, even with popular topologies like constant bi-section bandwidth Fat Tree, hot-spots may occur with InfiniBand, due to inappropriate configuration of network paths, presence of other jobs in the network and un-availability of adaptive routing. In this paper, we present a hot-spot avoidance layer (HSAL) for InfiniBand, which provides hot-spot avoidance using path bandwidth estimation and multi-pathing using LMC mechanism, without taking the network topology into account. We proposemore » an adaptive striping policy with batch based striping and sorting approach, for efficient utilization of disjoint network paths. Integration of HSAL with MPI, the de facto programming model of clusters, shows promising results with collective communication primitives and MPI applications.« less
  • InfiniBand (IB) is a popular network technology for modern high-performance computing systems. MPI implementations traditionally support IB using a reliable, connection-oriented (RC) transport. However, per-process resource usage that grows linearly with the number of processes, makes this approach prohibitive for large-scale systems. IB provides an alternative in the form of a connectionless unreliable datagram transport (UD), which allows for near-constant resource usage and initialization overhead as the process count increases. This paper describes a UD-based implementation for IB in Open MPI as a scalable alternative to existing RC-based schemes. We use the software reliability capabilities of Open MPI to providemore » the guaranteed delivery semantics required by MPI. Results show that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times. A connectionless model also improves performance for applications that tend to send small messages to many different processes.« less
  • No abstract prepared.
  • Harness is an adaptable and plug-in-based middleware framework able to support distributed parallel computing. By now, it is based on the Ethernet protocol which cannot guarantee high performance throughput and real time (determinism) performance. During last years, both, the research and industry environments have developed new network architectures (InfiniBand, Myrinet, iWARP, etc.) to avoid those limits. This paper concerns the integration between Harness and InfiniBand focusing on two solutions: IP over InfiniBand (IPoIB) and Socket Direct Protocol (SDP) technology. They allow the Harness middleware to take advantage of the enhanced features provided by the InfiniBand Architecture.
  • Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interfacemore » (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.« less