Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Ren, Bin; Balakrishna, Shruthi; Jo, Youngjoon; Krishnamoorthy, Sriram; Agrawal, Kunal; Kulkarni, Milind

doi:10.1145/3365663

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Journal Article · Sun Dec 01 23:00:00 EST 2019 · ACM Transactions on Parallel Computing

DOI:https://doi.org/10.1145/3365663· OSTI ID:1592696

Ren, Bin ^[1]; Balakrishna, Shruthi ^[2]; Jo, Youngjoon ^[2]; Krishnamoorthy, Sriram ^[1]; Agrawal, Kunal ^[3]; Kulkarni, Milind ^[2]

BATTELLE (PACIFIC NW LAB)
Purdue University
Washington University In Saint Louis

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1592696

Report Number(s):: PNNL-SA-111445

Journal Information:: ACM Transactions on Parallel Computing, Journal Name: ACM Transactions on Parallel Computing Journal Issue: 4 Vol. 6

Country of Publication:: United States

Language:: English

References (20)

Fine-grain task aggregation and coordination on GPUs Orr, Marc S.; Beckmann, Bradford M.; Reinhardt, Steven K. ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3, p. 181-192 https://doi.org/10.1145/2678373.2665701	journal	October 2014
SIMD parallelization of applications that traverse irregular data structures Ren, B.; Agrawal, G.; Larus, J. R. Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) https://doi.org/10.1109/CGO.2013.6494989	conference	February 2013
Efficient execution of recursive programs on commodity vector hardware Ren, Bin; Jo, Youngjoon; Krishnamoorthy, Sriram Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2015 https://doi.org/10.1145/2737924.2738004	conference	January 2015
Cilk: an efficient multithreaded runtime system Blumofe, Robert D.; Joerg, Christopher F.; Kuszmaul, Bradley C. ACM SIGPLAN Notices, Vol. 30, Issue 8 https://doi.org/10.1145/209937.209958	journal	August 1995
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels Steffen, Michael; Zambreno, Joseph 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2010) https://doi.org/10.1109/MICRO.2010.45	conference	December 2010
Efficient scheduling of recursive control flow on GPUs Huo, Xin; Krishnamoorthy, Sriram; Agrawal, Gagan Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2479870	conference	January 2013
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems Chhugani, Jatin; Kim, Changkyu; Shukla, Hemant 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.24	conference	November 2012
Work-first and help-first scheduling policies for async-finish task parallelism No authors listed Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5161079	conference	May 2009
Outer-loop vectorization: revisited for short SIMD architectures Nuzman, Dorit; Zaks, Ayal Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08 https://doi.org/10.1145/1454115.1454119	conference	January 2008
From relational verification to SIMD loop synthesis Barthe, Gilles; Crespo, Juan Manuel; Kunz, Cesar Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442529	conference	January 2013
Enhancing locality for recursive traversals of recursive structures Jo, Youngjoon; Kulkarni, Milind ACM SIGPLAN Notices, Vol. 46, Issue 10 https://doi.org/10.1145/2076021.2048104	journal	October 2011
The implementation of the Cilk-5 multithreaded language Frigo, Matteo; Leiserson, Charles E.; Randall, Keith H. ACM SIGPLAN Notices, Vol. 33, Issue 5 https://doi.org/10.1145/277652.277725	journal	May 1998
Data-parallel finite-state machines Mytkowicz, Todd; Musuvathi, Madanlal; Schulte, Wolfram Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14 https://doi.org/10.1145/2541940.2541988	conference	January 2014
Reducers and other Cilk++ hyperobjects Frigo, Matteo; Halpern, Pablo; Leiserson, Charles E. Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures - SPAA '09 https://doi.org/10.1145/1583991.1584017	conference	January 2009
Graphinators and the duality of SIMD and MIMD Hudak, Paul; Hohr, Eric Proceedings of the 1988 ACM conference on LISP and functional programming - LFP '88 https://doi.org/10.1145/62678.62714	conference	January 1988
Effectively sharing a cache among threads Blelloch, Guy E.; Gibbons, Phillip B. Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures - SPAA '04 https://doi.org/10.1145/1007912.1007948	conference	January 2004
Generating random binary trees — A survey Mäkinen, Erkki Information Sciences, Vol. 115, Issue 1-4 https://doi.org/10.1016/S0020-0255(98)10080-4	journal	April 1999
Understanding the efficiency of ray traversal on GPUs Aila, Timo; Laine, Samuli Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09 https://doi.org/10.1145/1572769.1572792	conference	January 2009
Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays Dammertz, H.; Hanika, J.; Keller, A. Computer Graphics Forum, Vol. 27, Issue 4 https://doi.org/10.1111/j.1467-8659.2008.01261.x	journal	June 2008
A random binary tree generator Martin, H. W.; Orr, B. J. Proceedings of the seventeenth annual ACM conference on Computer science : Computing trends in the 1990's Computing trends in the 1990's - CSC '89 https://doi.org/10.1145/75427.75429	conference	January 1989

Similar Records

Efficient Execution of Recursive Programs on Commodity Vector Hardware

Conference · Sat Jun 13 00:00:00 EDT 2015 · OSTI ID:1194297

Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

Conference · Wed Jan 25 23:00:00 EST 2017 · OSTI ID:1349171

High Performance Computing Based Parallel HIearchical Modal Association Clustering (HPAR HMAC)

Software · Thu Jan 12 00:00:00 EST 2017 · OSTI ID:1365649

Related Subjects

Recursive Programs
task parallelism
vectorization

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Citation Formats

References (20)

Similar Records

Related Subjects