Reaching bandwidth saturation using transparent injection parallelization

Chaimov, Nicholas; Ibrahim, Khaled Z.; Williams, Samuel; Iancu, Costin

doi:10.1177/1094342016672720

Reaching bandwidth saturation using transparent injection parallelization

Journal Article · Wed Oct 05 04:00:00 EDT 2016 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342016672720· OSTI ID:1565625

Chaimov, Nicholas ^[1]; Ibrahim, Khaled Z. ^[2]; Williams, Samuel ^[2]; Iancu, Costin ^[2]

Univ. of Oregon, Eugene, OR (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. This is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a “multi-threaded” runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime offloads communication requests from application level tasks to multiple communication servers. The servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup on 512 GPUs.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Univ. of California, Oakland, CA (United States)

Sponsoring Organization:: USDOE Office of Science

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1565625

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 5 Vol. 31; ISSN 1094-3420

Publisher:: SAGE

Country of Publication:: United States

Language:: English

References (26)

Minimizing MPI Resource Contention in Multithreaded Multicore Environments Goodell, David; Balaji, Pavan; Buntinas, Darius 2010 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2010.11	conference	September 2010
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning Williams, Samuel; Oliker, Leonid; Carter, Jonathan Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063458	conference	January 2011
Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems Luo, Miao; Lu, Xiaoyi; Hamidouche, Khaled ACM SIGPLAN Notices, Vol. 49, Issue 8 https://doi.org/10.1145/2692916.2555287	journal	February 2014
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems Madduri, Kamesh; Ibrahim, Khaled Z.; Williams, Samuel Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063415	conference	January 2011
Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik https://doi.org/10.2172/1136783	report	December 2012
Congestion avoidance on manycore high performance computing systems Luo, Miao; Panda, Dhabaleswar K.; Ibrahim, Khaled Z. Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304594	conference	January 2012
Test suite for evaluating performance of multithreaded MPI communication Thakur, Rajeev; Gropp, William Parallel Computing, Vol. 35, Issue 12 https://doi.org/10.1016/j.parco.2008.12.013	journal	December 2009
MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks Cappello, F.; Etiemble, D. ACM/IEEE SC 2000 Conference (SC'00) https://doi.org/10.1109/SC.2000.10001	conference	January 2000
Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Dózsa, Gábor; Kumar, Sameer; Balaji, Pavan Recent Advances in the Message Passing Interface https://doi.org/10.1007/978-3-642-15646-5_2	book	January 2010
MT-MPI: multithreaded MPI for many-core environments Si, Min; Peña, Antonio J.; Balaji, Pavan Proceedings of the 28th ACM international conference on Supercomputing - ICS '14 https://doi.org/10.1145/2597652.2597658	conference	January 2014
Hybrid PGAS runtime support for multicore nodes Blagojević, Filip; Hargrove, Paul; Iancu, Costin Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model - PGAS '10 https://doi.org/10.1145/2020373.2020376	conference	January 2010
An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect Ibrahim, Khaled Z.; Hargrove, Paul H.; Iancu, Costin 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.116	conference	May 2014
The Design and Implementation of FFTW3 Frigo, M.; Johnson, S. G. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840301	journal	February 2005
Mpi on Millions of Cores Balaji, Pavan; Buntinas, Darius; Goodell, David Parallel Processing Letters, Vol. 21, Issue 01 https://doi.org/10.1142/S0129626411000060	journal	March 2011
Integrating Asynchronous Task Parallelism with MPI Chatterjee, Sanjay; Tasirlar, Sagnak; Budimlic, Zoran 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.78	conference	May 2013
Optimization of Collective Communication Operations in MPICH Thakur, Rajeev; Rabenseifner, Rolf; Gropp, William The International Journal of High Performance Computing Applications, Vol. 19, Issue 1 https://doi.org/10.1177/1094342005051521	journal	February 2005
On the conditions for efficient interoperability with threads: an experience with PGAS languages using cray communication domains Ibrahim, Khaled Z.; Yelick, Katherine Proceedings of the 28th ACM international conference on Supercomputing - ICS '14 https://doi.org/10.1145/2597652.2597657	conference	January 2014
Enabling MPI interoperability through flexible communication endpoints Dinan, James; Balaji, Pavan; Goodell, David Proceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13 https://doi.org/10.1145/2488551.2488553	conference	January 2013
Efficient all-to-all broadcast in all-port mesh and torus networks No authors listed Proceedings Fifth International Symposium on High-Performance Computer Architecture https://doi.org/10.1109/HPCA.1999.744382	conference	January 1999
The NAS parallel benchmarks---summary and preliminary results Bailey, D. H.; Schreiber, R. S.; Simon, H. D. Proceedings of the 1991 ACM/IEEE conference on Supercomputing - Supercomputing '91 https://doi.org/10.1145/125826.125925	conference	January 1991
ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems Nieplocha, Jarek; Carpenter, Bryan Lecture Notes in Computer Science https://doi.org/10.1007/BFb0097937	book	January 1999
Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori No authors listed IEEE Transactions on Parallel and Distributed Systems, Vol. 13, Issue 2 https://doi.org/10.1109/71.983941	journal	January 2002
Scaling all-to-all multicast on fat-tree networks Kumar, S.; Kale, L. V. Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004. https://doi.org/10.1109/ICPADS.2004.1316097	conference	January 2004
X10: an object-oriented approach to non-uniform cluster computing Charles, Philippe; Grothoff, Christian; Saraswat, Vijay Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852	conference	January 2005
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes Rabenseifner, Rolf; Hager, Georg; Jost, Gabriele 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2009.43	conference	February 2009
Optimization of geometric multigrid for emerging multi- and manycore processors Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.85	conference	November 2012

Similar Records

Reaching bandwidth saturation using transparent injection parallelization

Journal Article · Tue Nov 08 19:00:00 EST 2016 · International Journal of High Performance Computing Applications · OSTI ID:1437694

Exploiting communication concurrency on high performance computing systems

Conference · Wed Dec 31 23:00:00 EST 2014 · OSTI ID:1407278

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Conference · Thu Nov 30 23:00:00 EST 2017 · OSTI ID:1427708

Related Subjects

Computer Science

Reaching bandwidth saturation using transparent injection parallelization

Citation Formats

References (26)

Similar Records

Related Subjects