Extracting SIMD Parallelism from Recursive Task-Parallel Programs
- BATTELLE (PACIFIC NW LAB)
- Purdue University
- Washington University In Saint Louis
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1592696
- Report Number(s):
- PNNL-SA-111445
- Journal Information:
- ACM Transactions on Parallel Computing, Journal Name: ACM Transactions on Parallel Computing Journal Issue: 4 Vol. 6
- Country of Publication:
- United States
- Language:
- English
Fine-grain task aggregation and coordination on GPUs
|
journal | October 2014 |
SIMD parallelization of applications that traverse irregular data structures
|
conference | February 2013 |
Efficient execution of recursive programs on commodity vector hardware
|
conference | January 2015 |
Cilk: an efficient multithreaded runtime system
|
journal | August 1995 |
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels
|
conference | December 2010 |
Efficient scheduling of recursive control flow on GPUs
|
conference | January 2013 |
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
|
conference | November 2012 |
Work-first and help-first scheduling policies for async-finish task parallelism
|
conference | May 2009 |
Outer-loop vectorization: revisited for short SIMD architectures
|
conference | January 2008 |
From relational verification to SIMD loop synthesis
|
conference | January 2013 |
Enhancing locality for recursive traversals of recursive structures
|
journal | October 2011 |
The implementation of the Cilk-5 multithreaded language
|
journal | May 1998 |
Data-parallel finite-state machines
|
conference | January 2014 |
Reducers and other Cilk++ hyperobjects
|
conference | January 2009 |
Graphinators and the duality of SIMD and MIMD
|
conference | January 1988 |
Effectively sharing a cache among threads
|
conference | January 2004 |
Generating random binary trees — A survey
|
journal | April 1999 |
Understanding the efficiency of ray traversal on GPUs
|
conference | January 2009 |
Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays
|
journal | June 2008 |
A random binary tree generator
|
conference | January 1989 |
Similar Records
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs
High Performance Computing Based Parallel HIearchical Modal Association Clustering (HPAR HMAC)