skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

Abstract

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1349171
Report Number(s):
PNNL-SA-124366
KJ0402000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PRoPP 2017), February 4-8, 2017, Austin, Texas, 117-130
Country of Publication:
United States
Language:
English

Citation Formats

Ren, Bin, Krishnamoorthy, Sriram, Agrawal, Kunal, and Kulkarni, Milind. Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs. United States: N. p., 2017. Web. doi:10.1145/3018743.3018763.
Ren, Bin, Krishnamoorthy, Sriram, Agrawal, Kunal, & Kulkarni, Milind. Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs. United States. doi:10.1145/3018743.3018763.
Ren, Bin, Krishnamoorthy, Sriram, Agrawal, Kunal, and Kulkarni, Milind. Thu . "Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs". United States. doi:10.1145/3018743.3018763.
@article{osti_1349171,
title = {Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs},
author = {Ren, Bin and Krishnamoorthy, Sriram and Agrawal, Kunal and Kulkarni, Milind},
abstractNote = {Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.},
doi = {10.1145/3018743.3018763},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 26 00:00:00 EST 2017},
month = {Thu Jan 26 00:00:00 EST 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compiler framework, and such a compiler can map a single source program in many different ways onto a parallel machine. The tradeoffs between task and data parallelism are complex and depend on the characteristics of the program to be executed, most significantly the memory and communication requirements, and the performance parameters of the target parallel machine. In this paper the authors presentmore » a framework to isolate and examine the specific characteristics of programs that determine the performance for different mappings. Their focus is on applications that process a stream of input, and whose computation structure is fairly static and predictable. They describe three such applications that were developed with the compiler: fast Fourier transforms, narrowband tracking radar and multibaseline stereo. They examine the tradeoffs between various mappings for them and show how the framework is used to obtain efficient mappings.« less
  • The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly-wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations; in the absence of data parallelism, it seems that such algorithms are not well-suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data-parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policiesmore » that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units as well as accelerators using Intel's AVX512 units.« less
  • Abstract not provided.
  • Abstract not provided.
  • Here in this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes withmore » highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost.« less