Efficient Execution of Recursive Programs on Commodity Vector Hardware
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly-wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations; in the absence of data parallelism, it seems that such algorithms are not well-suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data-parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units as well as accelerators using Intel's AVX512 units.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1194297
- Report Number(s):
- PNNL-SA-107984; KJ0402000
- Country of Publication:
- United States
- Language:
- English
Similar Records
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining