Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Journal Article · · ACM Transactions on Parallel Computing
DOI:https://doi.org/10.1145/3365663· OSTI ID:1592696

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1592696
Report Number(s):
PNNL-SA-111445
Journal Information:
ACM Transactions on Parallel Computing, Journal Name: ACM Transactions on Parallel Computing Journal Issue: 4 Vol. 6
Country of Publication:
United States
Language:
English

References (20)

Fine-grain task aggregation and coordination on GPUs journal October 2014
SIMD parallelization of applications that traverse irregular data structures conference February 2013
Efficient execution of recursive programs on commodity vector hardware
  • Ren, Bin; Jo, Youngjoon; Krishnamoorthy, Sriram
  • Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2015 https://doi.org/10.1145/2737924.2738004
conference January 2015
Cilk: an efficient multithreaded runtime system journal August 1995
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels conference December 2010
Efficient scheduling of recursive control flow on GPUs
  • Huo, Xin; Krishnamoorthy, Sriram; Agrawal, Gagan
  • Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2479870
conference January 2013
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
  • Chhugani, Jatin; Kim, Changkyu; Shukla, Hemant
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.24
conference November 2012
Work-first and help-first scheduling policies for async-finish task parallelism conference May 2009
Outer-loop vectorization: revisited for short SIMD architectures conference January 2008
From relational verification to SIMD loop synthesis
  • Barthe, Gilles; Crespo, Juan Manuel; Kunz, Cesar
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442529
conference January 2013
Enhancing locality for recursive traversals of recursive structures journal October 2011
The implementation of the Cilk-5 multithreaded language journal May 1998
Data-parallel finite-state machines
  • Mytkowicz, Todd; Musuvathi, Madanlal; Schulte, Wolfram
  • Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14 https://doi.org/10.1145/2541940.2541988
conference January 2014
Reducers and other Cilk++ hyperobjects
  • Frigo, Matteo; Halpern, Pablo; Leiserson, Charles E.
  • Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures - SPAA '09 https://doi.org/10.1145/1583991.1584017
conference January 2009
Graphinators and the duality of SIMD and MIMD conference January 1988
Effectively sharing a cache among threads conference January 2004
Generating random binary trees — A survey journal April 1999
Understanding the efficiency of ray traversal on GPUs conference January 2009
Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays journal June 2008
A random binary tree generator
  • Martin, H. W.; Orr, B. J.
  • Proceedings of the seventeenth annual ACM conference on Computer science : Computing trends in the 1990's Computing trends in the 1990's - CSC '89 https://doi.org/10.1145/75427.75429
conference January 1989

Similar Records

Efficient Execution of Recursive Programs on Commodity Vector Hardware
Conference · Sat Jun 13 00:00:00 EDT 2015 · OSTI ID:1194297

Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs
Conference · Wed Jan 25 23:00:00 EST 2017 · OSTI ID:1349171

High Performance Computing Based Parallel HIearchical Modal Association Clustering (HPAR HMAC)
Software · Thu Jan 12 00:00:00 EST 2017 · OSTI ID:1365649