skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications

Conference ·

Auto-tuning has emerged as an important practical method for creating highly optimized implementations of key computational kernels and applications. However, the growing complexity of architectures and applications is creating new challenges for auto-tuning. Complex applications can involve a prohibitively large search space that precludes empirical auto-tuning. Similarly, architectures are becoming increasingly complicated, making it hard to model performance. In this paper, we focus on the challenge to auto-tuning presented by applications with a large number of kernels and kernel instantiations. While these kernels may share a somewhat similar pattern, they differ considerably in problem sizes and the exact computation performed. We propose and evaluate a new approach to auto-tuning which we refer to as parameterized micro-benchmarking. It is an alternative to the two existing classes of approaches to auto-tuning: analytical model-based and empirical search-based. Particularly, we argue that the former may not be able to capture all the architectural features that impact performance, whereas the latter might be too expensive for an application that has several different kernels. In our approach, different expressions in the application, different possible implementations of each expression, and the key architectural features, are used to derive a simple micro-benchmark and a small parameter space. This allows us to learn the most significant features of the architecture that can impact the choice of implementation for each kernel. We have evaluated our approach in the context of GPU implementations of tensor contraction expressions encountered in excited state calculations in quantum chemistry. We have focused on two aspects of GPUs that affect tensor contraction execution: memory access patterns and kernel consolidation. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations, but no auto-tuning. We demonstrate that observations made from microbenchmarks match the behavior seen from real expressions. In the process, we make important observations about the memory hierarchy of two of the most recent NVIDIA GPUs, which can be used in other optimization frameworks as well.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1239508
Report Number(s):
PNNL-SA-86263
Resource Relation:
Conference: CF 2012: Proceedings of the 9th Conference on Computing Frontiers, May 15-17, 2012, Cagliari, Italy, 213-222
Country of Publication:
United States
Language:
English

Similar Records

FPGA-based HPC accelerators: An evaluation on performance and energy efficiency
Journal Article · Sun Aug 22 00:00:00 EDT 2021 · Concurrency and Computation. Practice and Experience · OSTI ID:1239508

Closeout Report for DE-SC0018121
Technical Report · Fri Apr 28 00:00:00 EDT 2023 · OSTI ID:1239508

A Generalized Framework for Auto-tuning Stencil Computations
Conference · Mon Aug 24 00:00:00 EDT 2009 · OSTI ID:1239508

Related Subjects