skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

Authors:
 [1];  [2];  [2];  [1];  [3];  [2]
  1. BATTELLE (PACIFIC NW LAB)
  2. Purdue University
  3. Washington University In Saint Louis
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1592696
Report Number(s):
PNNL-SA-111445
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Journal Article
Journal Name:
ACM Transactions on Parallel Computing
Additional Journal Information:
Journal Volume: 6; Journal Issue: 4
Country of Publication:
United States
Language:
English
Subject:
Recursive Programs, task parallelism, vectorization

Citation Formats

Ren, Bin, Balakrishna, Shruthi, Jo, Youngjoon, Krishnamoorthy, Sriram, Agrawal, Kunal, and Kulkarni, Milind. Extracting SIMD Parallelism from Recursive Task-Parallel Programs. United States: N. p., 2019. Web. doi:10.1145/3365663.
Ren, Bin, Balakrishna, Shruthi, Jo, Youngjoon, Krishnamoorthy, Sriram, Agrawal, Kunal, & Kulkarni, Milind. Extracting SIMD Parallelism from Recursive Task-Parallel Programs. United States. doi:10.1145/3365663.
Ren, Bin, Balakrishna, Shruthi, Jo, Youngjoon, Krishnamoorthy, Sriram, Agrawal, Kunal, and Kulkarni, Milind. Mon . "Extracting SIMD Parallelism from Recursive Task-Parallel Programs". United States. doi:10.1145/3365663.
@article{osti_1592696,
title = {Extracting SIMD Parallelism from Recursive Task-Parallel Programs},
author = {Ren, Bin and Balakrishna, Shruthi and Jo, Youngjoon and Krishnamoorthy, Sriram and Agrawal, Kunal and Kulkarni, Milind},
abstractNote = {The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task- parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.},
doi = {10.1145/3365663},
journal = {ACM Transactions on Parallel Computing},
number = 4,
volume = 6,
place = {United States},
year = {2019},
month = {12}
}

Works referenced in this record:

Understanding the efficiency of ray traversal on GPUs
conference, January 2009

  • Aila, Timo; Laine, Samuli
  • Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09
  • DOI: 10.1145/1572769.1572792

From relational verification to SIMD loop synthesis
conference, January 2013

  • Barthe, Gilles; Crespo, Juan Manuel; Kunz, Cesar
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
  • DOI: 10.1145/2442516.2442529

Cilk: an efficient multithreaded runtime system
journal, August 1995

  • Blumofe, Robert D.; Joerg, Christopher F.; Kuszmaul, Bradley C.
  • ACM SIGPLAN Notices, Vol. 30, Issue 8
  • DOI: 10.1145/209937.209958

Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
conference, November 2012

  • Chhugani, Jatin; Kim, Changkyu; Shukla, Hemant
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.24

Fine-grain task aggregation and coordination on GPUs
journal, October 2014

  • Orr, Marc S.; Beckmann, Bradford M.; Reinhardt, Steven K.
  • ACM SIGARCH Computer Architecture News, Vol. 42, Issue 3, p. 181-192
  • DOI: 10.1145/2678373.2665701