skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Cheetah: A Framework for Scalable Hierarchical Collective Operations

Abstract

Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passing Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.

Authors:
 [1];  [1];  [1];  [1];  [2];  [2];  [2]
  1. ORNL
  2. Mellanox Technologies, Inc.
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1035530
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: International Symposium on Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 20110523, 20110526
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; BUFFERS; CLOUDS; COMMUNICATIONS; COMPUTERS; IMPLEMENTATION; MANAGEMENT; PERFORMANCE

Citation Formats

Graham, Richard L, Gorentla Venkata, Manjunath, Ladd, Joshua S, Shamis, Pavel, Rabinovitz, Ishai, Filipov, Vasily, and Shainer, Gilad. Cheetah: A Framework for Scalable Hierarchical Collective Operations. United States: N. p., 2011. Web.
Graham, Richard L, Gorentla Venkata, Manjunath, Ladd, Joshua S, Shamis, Pavel, Rabinovitz, Ishai, Filipov, Vasily, & Shainer, Gilad. Cheetah: A Framework for Scalable Hierarchical Collective Operations. United States.
Graham, Richard L, Gorentla Venkata, Manjunath, Ladd, Joshua S, Shamis, Pavel, Rabinovitz, Ishai, Filipov, Vasily, and Shainer, Gilad. Sat . "Cheetah: A Framework for Scalable Hierarchical Collective Operations". United States.
@article{osti_1035530,
title = {Cheetah: A Framework for Scalable Hierarchical Collective Operations},
author = {Graham, Richard L and Gorentla Venkata, Manjunath and Ladd, Joshua S and Shamis, Pavel and Rabinovitz, Ishai and Filipov, Vasily and Shainer, Gilad},
abstractNote = {Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passing Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2011},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: