skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Auto-tuning Stencil Computations on Multicore and Accelerators

Abstract

The recent transformation from an environment where gains in computational performance came from increasing clock frequency and other hardware engineering innovations, to an environment where gains are realized through the deployment of ever increasing numbers of modest performance cores has profoundly changed the landscape of scientific application programming. This exponential increase in core count represents both an opportunity and a challenge: access to petascale simulation capabilities and beyond will require that this concurrency be efficiently exploited.

Authors:
 [1];  [2];  [1];  [2];  [2];  [2];  [2]
  1. Univ. of California, Berkeley, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1407093
DOE Contract Number:
AC02-05CH11231
Resource Type:
Book
Resource Relation:
Journal Volume: 20102756; Related Information: Book Title: Scientific Computing with Multicore and Accelerators
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 43 PARTICLE ACCELERATORS

Citation Formats

Datta, Kaushik, Williams, Samuel, Volkov, Vasily, Carter, Jonathan, Oliker, Leonid, Shalf, John, and Yelick, Katherine. Auto-tuning Stencil Computations on Multicore and Accelerators. United States: N. p., 2010. Web. doi:10.1201/b10376-18.
Datta, Kaushik, Williams, Samuel, Volkov, Vasily, Carter, Jonathan, Oliker, Leonid, Shalf, John, & Yelick, Katherine. Auto-tuning Stencil Computations on Multicore and Accelerators. United States. doi:10.1201/b10376-18.
Datta, Kaushik, Williams, Samuel, Volkov, Vasily, Carter, Jonathan, Oliker, Leonid, Shalf, John, and Yelick, Katherine. 2010. "Auto-tuning Stencil Computations on Multicore and Accelerators". United States. doi:10.1201/b10376-18. https://www.osti.gov/servlets/purl/1407093.
@article{osti_1407093,
title = {Auto-tuning Stencil Computations on Multicore and Accelerators},
author = {Datta, Kaushik and Williams, Samuel and Volkov, Vasily and Carter, Jonathan and Oliker, Leonid and Shalf, John and Yelick, Katherine},
abstractNote = {The recent transformation from an environment where gains in computational performance came from increasing clock frequency and other hardware engineering innovations, to an environment where gains are realized through the deployment of ever increasing numbers of modest performance cores has profoundly changed the landscape of scientific application programming. This exponential increase in core count represents both an opportunity and a challenge: access to petascale simulation capabilities and beyond will require that this concurrency be efficiently exploited.},
doi = {10.1201/b10376-18},
journal = {},
number = ,
volume = 20102756,
place = {United States},
year = 2010,
month =
}

Book:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this book.

Save / Share:
  • Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature,more » including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.« less
  • Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature,more » including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.« less
  • In this, chapter, we discuss the optimization of three memory-intensive computational kernels — sparse matrix-vector multiplication, the Laplacian differential operator applied to structured grids, and the collision() operator with the lattice Boltzmann magnetohydrodynamics (LBMHD) application. They are all implemented using a single-process, (POSIX) threaded, SPMD model. Unlike their computationally-intense dense linear algebra cousins, performance is ultimately limited by DRAM bandwidth and the volume of data that must be transfered. To provide performance portability across current and future multicore architectures, we utilize automatic performance tuning, or auto-tuning.
  • This work introduces a generalized framework for automatically tuning stencil computations to achieve superior performance on a broad range of multicore architectures. Stencil (nearest-neighbor) based kernels constitute the core of many important scientific applications involving block-structured grids. Auto-tuning systems search over optimization strategies to find the combination of tunable parameters that maximizes computational efficiency for a given algorithmic kernel. Although the auto-tuning strategy has been successfully applied to libraries, generalized stencil kernels are not amenable to packaging as libraries. Studied kernels in this work include both memory-bound kernels as well as a computation-bound bilateral filtering kernel. We introduce a generalizedmore » stencil auto-tuning framework that takes a straightforward Fortran expression of a stencil kernel and automatically generates tuned implementations of the kernel in C or Fortran to achieve performance portability across diverse computer architectures.« less
  • This work introduces a generalized framework for automatically tuning stencil computations to achieve superior performance on a broad range of multicore architectures. Stencil (nearest-neighbor) based kernels constitute the core of many important scientific applications involving block-structured grids. Auto-tuning systems search over optimization strategies to find the combination of tunable parameters that maximizes computational efficiency for a given algorithmic kernel. Although the auto-tuning strategy has been successfully applied to libraries, generalized stencil kernels are not amenable to packaging as libraries. Studied kernels in this work include both memory-bound kernels as well as a computation-bound bilateral filtering kernel. We introduce a generalizedmore » stencil auto-tuning framework that takes a straightforward Fortran expression of a stencil kernel and automatically generates tuned implementations of the kernel in C or Fortran to achieve performance portability across diverse computer architectures.« less