skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Reaching bandwidth saturation using transparent injection parallelization

Abstract

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. This is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a “multi-threaded” runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime offloads communication requests from application level tasks to multiple communication servers. The servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup onmore » 512 GPUs.« less

Authors:
 [1];  [2];  [2];  [2]
  1. Univ. of Oregon, Eugene, OR (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Univ. of California, Oakland, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1565625
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 31; Journal Issue: 5; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
Computer Science

Citation Formats

Chaimov, Nicholas, Ibrahim, Khaled Z., Williams, Samuel, and Iancu, Costin. Reaching bandwidth saturation using transparent injection parallelization. United States: N. p., 2016. Web. doi:10.1177/1094342016672720.
Chaimov, Nicholas, Ibrahim, Khaled Z., Williams, Samuel, & Iancu, Costin. Reaching bandwidth saturation using transparent injection parallelization. United States. doi:10.1177/1094342016672720.
Chaimov, Nicholas, Ibrahim, Khaled Z., Williams, Samuel, and Iancu, Costin. Wed . "Reaching bandwidth saturation using transparent injection parallelization". United States. doi:10.1177/1094342016672720.
@article{osti_1565625,
title = {Reaching bandwidth saturation using transparent injection parallelization},
author = {Chaimov, Nicholas and Ibrahim, Khaled Z. and Williams, Samuel and Iancu, Costin},
abstractNote = {Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. This is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a “multi-threaded” runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime offloads communication requests from application level tasks to multiple communication servers. The servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup on 512 GPUs.},
doi = {10.1177/1094342016672720},
journal = {International Journal of High Performance Computing Applications},
issn = {1094-3420},
number = 5,
volume = 31,
place = {United States},
year = {2016},
month = {10}
}

Works referenced in this record:

The NAS parallel benchmarks---summary and preliminary results
conference, January 1991

  • Bailey, D. H.; Schreiber, R. S.; Simon, H. D.
  • Proceedings of the 1991 ACM/IEEE conference on Supercomputing - Supercomputing '91
  • DOI: 10.1145/125826.125925

Mpi on Millions of Cores
journal, March 2011

  • Balaji, Pavan; Buntinas, Darius; Goodell, David
  • Parallel Processing Letters, Vol. 21, Issue 01
  • DOI: 10.1142/S0129626411000060

Hybrid PGAS runtime support for multicore nodes
conference, January 2010

  • Blagojević, Filip; Hargrove, Paul; Iancu, Costin
  • Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model - PGAS '10
  • DOI: 10.1145/2020373.2020376

MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks
conference, January 2000


X10: an object-oriented approach to non-uniform cluster computing
conference, January 2005

  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05
  • DOI: 10.1145/1094811.1094852

Integrating Asynchronous Task Parallelism with MPI
conference, May 2013

  • Chatterjee, Sanjay; Tasirlar, Sagnak; Budimlic, Zoran
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.78

Enabling MPI interoperability through flexible communication endpoints
conference, January 2013

  • Dinan, James; Balaji, Pavan; Goodell, David
  • Proceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13
  • DOI: 10.1145/2488551.2488553

The Design and Implementation of FFTW3
journal, February 2005


Minimizing MPI Resource Contention in Multithreaded Multicore Environments
conference, September 2010

  • Goodell, David; Balaji, Pavan; Buntinas, Darius
  • 2010 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2010.11

On the conditions for efficient interoperability with threads: an experience with PGAS languages using cray communication domains
conference, January 2014

  • Ibrahim, Khaled Z.; Yelick, Katherine
  • Proceedings of the 28th ACM international conference on Supercomputing - ICS '14
  • DOI: 10.1145/2597652.2597657

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect
conference, May 2014

  • Ibrahim, Khaled Z.; Hargrove, Paul H.; Iancu, Costin
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.116

Scaling all-to-all multicast on fat-tree networks
conference, January 2004

  • Kumar, S.; Kale, L. V.
  • Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004.
  • DOI: 10.1109/ICPADS.2004.1316097

Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems
journal, February 2014


Congestion avoidance on manycore high performance computing systems
conference, January 2012

  • Luo, Miao; Panda, Dhabaleswar K.; Ibrahim, Khaled Z.
  • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
  • DOI: 10.1145/2304576.2304594

Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems
conference, January 2011

  • Madduri, Kamesh; Ibrahim, Khaled Z.; Williams, Samuel
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063415

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
conference, February 2009

  • Rabenseifner, Rolf; Hager, Georg; Jost, Gabriele
  • 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
  • DOI: 10.1109/PDP.2009.43

MT-MPI: multithreaded MPI for many-core environments
conference, January 2014

  • Si, Min; Peña, Antonio J.; Balaji, Pavan
  • Proceedings of the 28th ACM international conference on Supercomputing - ICS '14
  • DOI: 10.1145/2597652.2597658

Test suite for evaluating performance of multithreaded MPI communication
journal, December 2009


Optimization of Collective Communication Operations in MPICH
journal, February 2005

  • Thakur, Rajeev; Rabenseifner, Rolf; Gropp, William
  • The International Journal of High Performance Computing Applications, Vol. 19, Issue 1
  • DOI: 10.1177/1094342005051521

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

  • Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.85

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
conference, January 2011

  • Williams, Samuel; Oliker, Leonid; Carter, Jonathan
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063458

Efficient all-to-all broadcast in all-port mesh and torus networks
conference, January 1999

  • Yuanyuan Yang,
  • Proceedings Fifth International Symposium on High-Performance Computer Architecture
  • DOI: 10.1109/HPCA.1999.744382

Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori
journal, January 2002

  • Yuanyuan Yang,
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 13, Issue 2
  • DOI: 10.1109/71.983941