Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications

Ma, Wenjing; Krishnamoorthy, Sriram; Agrawal, Gagan

doi:10.1145/2212908.2212938

Title: Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications

Conference · Tue May 15 00:00:00 EDT 2012

DOI:https://doi.org/10.1145/2212908.2212938· OSTI ID:1239508

Ma, Wenjing; Krishnamoorthy, Sriram; Agrawal, Gagan

Auto-tuning has emerged as an important practical method for creating highly optimized implementations of key computational kernels and applications. However, the growing complexity of architectures and applications is creating new challenges for auto-tuning. Complex applications can involve a prohibitively large search space that precludes empirical auto-tuning. Similarly, architectures are becoming increasingly complicated, making it hard to model performance. In this paper, we focus on the challenge to auto-tuning presented by applications with a large number of kernels and kernel instantiations. While these kernels may share a somewhat similar pattern, they differ considerably in problem sizes and the exact computation performed. We propose and evaluate a new approach to auto-tuning which we refer to as parameterized micro-benchmarking. It is an alternative to the two existing classes of approaches to auto-tuning: analytical model-based and empirical search-based. Particularly, we argue that the former may not be able to capture all the architectural features that impact performance, whereas the latter might be too expensive for an application that has several different kernels. In our approach, different expressions in the application, different possible implementations of each expression, and the key architectural features, are used to derive a simple micro-benchmark and a small parameter space. This allows us to learn the most significant features of the architecture that can impact the choice of implementation for each kernel. We have evaluated our approach in the context of GPU implementations of tensor contraction expressions encountered in excited state calculations in quantum chemistry. We have focused on two aspects of GPUs that affect tensor contraction execution: memory access patterns and kernel consolidation. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations, but no auto-tuning. We demonstrate that observations made from microbenchmarks match the behavior seen from real expressions. In the process, we make important observations about the memory hierarchy of two of the most recent NVIDIA GPUs, which can be used in other optimization frameworks as well.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1239508

Report Number(s):: PNNL-SA-86263

Resource Relation:: Conference: CF 2012: Proceedings of the 9th Conference on Computing Frontiers, May 15-17, 2012, Cagliari, Italy, 213-222

Country of Publication:: United States

Language:: English

Similar Records

FPGA-based HPC accelerators: An evaluation on performance and energy efficiency

Journal Article · Sun Aug 22 00:00:00 EDT 2021 · Concurrency and Computation. Practice and Experience · OSTI ID:1239508

Nguyen, Tan; MacLean, Colin; Siracusa, Marco; +3 more

Closeout Report for DE-SC0018121

Technical Report · Fri Apr 28 00:00:00 EDT 2023 · OSTI ID:1239508

Detmold, William

A Generalized Framework for Auto-tuning Stencil Computations

Conference · Mon Aug 24 00:00:00 EDT 2009 · OSTI ID:1239508

Kamil, Shoaib; Chan, Cy; Williams, Samuel; +5 more

Related Subjects

auto-tuning
micro-benchmarks
GPUs

Title: Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications

Citation Formats

Similar Records

Related Subjects