Instruction Roofline: An insightful visual performance model for GPUs
The Roofline performance model provides an intuitive approach to identify performance bottlenecks and guide performance optimization. However, the classic FLOP-centric approach is inappropriate for the emerging applications that perform more integer operations than floating point operations. In this article, we reintroduce our Instruction Roofline Model on NVIDIA GPUs and expand our evaluation of it. The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together, and provides more performance insights than the FLOP-oriented Roofline Model, that is, instruction throughput, stride memory access patterns, bank conflicts, and thread predication. We use our Instruction Roofline methodology to analyze eight proxy applications: HPGMG from AMReX, Matrix Transpose benchmarks, ADEPT from MetaHipMer's sequence alignment phase, EXTENSION from MetaHipMer's local assembly phase, CUSP, cuSPARSE, cudaTensorCoreGemm, and cuBLAS. We demonstrate the ability of our methodology to understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and motivate code optimizations.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1844927
- Resource Relation:
- Conference: Concurrency and Computation: Practice and Experience
- Country of Publication:
- United States
- Language:
- English
Similar Records
GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and hermitian eigenproblems
An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability