Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Reaching bandwidth saturation using transparent injection parallelization

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [2];  [2]
  1. Univ. of Oregon, Eugene, OR (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. This is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a “multi-threaded” runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime offloads communication requests from application level tasks to multiple communication servers. The servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup on 512 GPUs.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Univ. of California, Oakland, CA (United States)
Sponsoring Organization:
USDOE Office of Science
DOE Contract Number:
AC02-05CH11231
OSTI ID:
1565625
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 5 Vol. 31; ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English

References (26)

Minimizing MPI Resource Contention in Multithreaded Multicore Environments conference September 2010
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
  • Williams, Samuel; Oliker, Leonid; Carter, Jonathan
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063458
conference January 2011
Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems journal February 2014
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems
  • Madduri, Kamesh; Ibrahim, Khaled Z.; Williams, Samuel
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063415
conference January 2011
Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark report December 2012
Congestion avoidance on manycore high performance computing systems conference January 2012
Test suite for evaluating performance of multithreaded MPI communication journal December 2009
MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks conference January 2000
Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems book January 2010
MT-MPI: multithreaded MPI for many-core environments conference January 2014
Hybrid PGAS runtime support for multicore nodes conference January 2010
An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect
  • Ibrahim, Khaled Z.; Hargrove, Paul H.; Iancu, Costin
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.116
conference May 2014
The Design and Implementation of FFTW3 journal February 2005
Mpi on Millions of Cores journal March 2011
Integrating Asynchronous Task Parallelism with MPI
  • Chatterjee, Sanjay; Tasirlar, Sagnak; Budimlic, Zoran
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.78
conference May 2013
Optimization of Collective Communication Operations in MPICH journal February 2005
On the conditions for efficient interoperability with threads: an experience with PGAS languages using cray communication domains conference January 2014
Enabling MPI interoperability through flexible communication endpoints conference January 2013
Efficient all-to-all broadcast in all-port mesh and torus networks conference January 1999
The NAS parallel benchmarks---summary and preliminary results conference January 1991
ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems book January 1999
Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori journal January 2002
Scaling all-to-all multicast on fat-tree networks conference January 2004
X10: an object-oriented approach to non-uniform cluster computing
  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852
conference January 2005
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
  • Rabenseifner, Rolf; Hager, Georg; Jost, Gabriele
  • 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2009.43
conference February 2009
Optimization of geometric multigrid for emerging multi- and manycore processors
  • Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.85
conference November 2012

Similar Records

Reaching bandwidth saturation using transparent injection parallelization
Journal Article · Tue Nov 08 23:00:00 EST 2016 · International Journal of High Performance Computing Applications · OSTI ID:1437694

Exploiting communication concurrency on high performance computing systems
Conference · Wed Dec 31 23:00:00 EST 2014 · OSTI ID:1407278

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM
Conference · Thu Nov 30 23:00:00 EST 2017 · OSTI ID:1427708

Related Subjects