skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

Authors:
; ;
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
Computational Research Division
OSTI Identifier:
963540
Report Number(s):
LBNL-2141E
TRN: US200918%%382
DOE Contract Number:  
DE-AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
Communications of the Association for Computing Machinery
Additional Journal Information:
Journal Name: Communications of the Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; ARCHITECTS; PERFORMANCE; PARALLEL PROCESSING

Citation Formats

Williams, Samuel, Waterman, Andrew, and Patterson, David. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. United States: N. p., 2009. Web. doi:10.1145/1498765.1498785.
Williams, Samuel, Waterman, Andrew, & Patterson, David. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. United States. https://doi.org/10.1145/1498765.1498785
Williams, Samuel, Waterman, Andrew, and Patterson, David. 2009. "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures". United States. https://doi.org/10.1145/1498765.1498785. https://www.osti.gov/servlets/purl/963540.
@article{osti_963540,
title = {Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures},
author = {Williams, Samuel and Waterman, Andrew and Patterson, David},
abstractNote = {We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.},
doi = {10.1145/1498765.1498785},
url = {https://www.osti.gov/biblio/963540}, journal = {Communications of the Association for Computing Machinery},
number = ,
volume = ,
place = {United States},
year = {Sun Feb 01 00:00:00 EST 2009},
month = {Sun Feb 01 00:00:00 EST 2009}
}

Works referenced in this record:

Validity of the single processor approach to achieving large scale computing capabilities
conference, January 1967


A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1
conference, January 1994


Estimating interlock and improving balance for pipelined architectures
journal, August 1988


Improving the ratio of memory operations to floating-point operations in loops
journal, November 1994


Self-Adapting Linear Algebra Algorithms and Software
journal, February 2005


Performance of Synchronized Iterative Processes in Multiprocessor Systems
journal, July 1982


The Design and Implementation of FFTW3
journal, February 2005


Mapping computational concepts to GPUs
conference, January 2005


Evaluating associativity in CPU caches
journal, January 1989


A Proof for the Queuing Formula: L = λ W
journal, June 1961


Latency lags bandwith
journal, October 2004


Analytic Queueing Network Models for Parallel Processing of Task Systems
journal, December 1986


A genetic algorithms approach to modeling the performance of memory-bound computations
conference, January 2007


Lattice Boltzmann simulation optimization on leading multicore platforms
conference, April 2008

  • Williams, Samuel; Carter, Jonathan; Oliker, Leonid
  • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
  • https://doi.org/10.1109/IPDPS.2008.4536295

Optimization of sparse matrix-vector multiplication on emerging multicore platforms
conference, January 2007


The SPLASH-2 programs: characterization and methodological considerations
conference, January 1995


Works referencing / citing this record:

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM
journal, October 2014


Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach
journal, May 2017


Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
journal, March 2017

  • Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio
  • Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12
  • https://doi.org/10.1002/cpe.4143

An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures
journal, April 2018


AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
journal, July 2018

  • Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A.
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1
  • https://doi.org/10.1002/cpe.4864

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
journal, August 2018

  • Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18
  • https://doi.org/10.1002/cpe.4929

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution
journal, August 2018


Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture
journal, September 2018


High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor
journal, November 2018


Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
journal, November 2019


Use of model-based architecture attributes to construct a component-level trade space
journal, February 2019


LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation
book, December 2018


Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography
book, December 2018


DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs
book, January 2019


GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation
book, July 2019

  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213
  • https://doi.org/10.1007/978-3-030-25636-4_16

LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU
book, August 2019

  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151
  • https://doi.org/10.1007/978-3-030-28163-2_10

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL
book, October 2016


Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
book, May 2017


A High-Throughput Kalman Filter for Modern SIMD Architectures
book, January 2018


Approximate FPGA-Based LSTMs Under Computation Time Constraints
book, January 2018


On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors
book, January 2018


Software Design Space Exploration for Exascale Combustion Co-design
book, January 2013


How Many Threads will be too Many? On the Scalability of OpenMP Implementations
book, January 2015


Measuring energy consumption using EML (energy measurement library)
journal, July 2014


Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures
journal, November 2016


GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
journal, October 2016


Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs
journal, April 2018


3DyRM: a dynamic roofline model including memory latency information
journal, March 2014


Optimization of parallel iterated local search algorithms on graphics processing unit
journal, May 2016


The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes
journal, June 2018


Efficient scheduling of streams on GPGPUs
journal, February 2020


High performance FDTD algorithm for GPGPU supercomputers
journal, October 2016


Ultrafast analysis of individual grain behavior during grain growth by parallel computing
journal, August 2015


A real-time, all-sky, high time resolution, direct imager for the long wavelength array
journal, May 2019


Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
journal, October 2019


Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
journal, May 2018


Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
conference, January 2015


Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015


Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
conference, January 2015


Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model
conference, January 2015


Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
conference, January 2015

  • Hoefler, Torsten; Belli, Roberto
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • https://doi.org/10.1145/2807591.2807644

Harnessing energy efficiency of heterogeneous-ISA platforms
conference, January 2015


Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
conference, January 2015


Variation Among Processors Under Turbo Boost in HPC Systems
conference, January 2016


Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells
journal, January 2017


Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks
conference, November 2016

  • Zhang, Chen; Fang, Zhenman; Zhou, Peipei
  • ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design
  • https://doi.org/10.1145/2966986.2967011

Resource Conscious Reuse-Driven Tiling for GPUs
conference, January 2016

  • Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh
  • Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16
  • https://doi.org/10.1145/2967938.2967967

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
conference, October 2016

  • Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen
  • MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems
  • https://doi.org/10.1145/2989081.2989087

Sparse Matrix-Vector Multiplication on GPGPUs
journal, January 2017


FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
conference, January 2017

  • Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio
  • Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17
  • https://doi.org/10.1145/3020078.3021744

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
conference, June 2017

  • Xiao, Qingcheng; Liang, Yun; Lu, Liqiang
  • DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017
  • https://doi.org/10.1145/3061639.3062244

A Survey of Power and Energy Predictive Models in HPC Systems and Applications
journal, October 2017


In-Datacenter Performance Analysis of a Tensor Processing Unit
conference, January 2017


In-Datacenter Performance Analysis of a Tensor Processing Unit
journal, June 2017


Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
journal, April 2018


Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
journal, July 2018


A Survey on Compiler Autotuning using Machine Learning
journal, January 2019


Efficient sparse-matrix multi-vector product on GPUs
conference, January 2018

  • Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
  • https://doi.org/10.1145/3208040.3208062

FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
journal, December 2018

  • Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J.
  • ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3
  • https://doi.org/10.1145/3242897

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
journal, April 2019


Metric Selection for GPU Kernel Classification
journal, January 2019

  • Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud
  • ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4
  • https://doi.org/10.1145/3295690

Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators
journal, August 2019


On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
conference, January 2020

  • Helm, Christian; Taura, Kenjiro
  • HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
  • https://doi.org/10.1145/3368474.3368476

Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC
journal, March 2019


ExaSAT: An exascale co-design tool for performance modeling
journal, April 2014


Modeling high-throughput applications for in situ analytics
journal, May 2019


Analytic performance modeling and analysis of detailed neuron simulations
journal, April 2020


Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
journal, November 2012


Data Management in Machine Learning Systems
journal, February 2019


Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance
journal, November 2016


Compression Challenges in Large Scale Partial Differential Equation Solvers
journal, September 2019


DiamondTorre Algorithm for High-Performance Wave Modeling
journal, August 2016


An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution
journal, March 2019


Developing Efficient Discrete Simulations on Multicore and GPU Architectures
journal, January 2020


Fog vs. Cloud Computing: Should I Stay or Should I Go?
journal, February 2019


A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture
journal, December 2018


CPMIP: measurements of real computational performance of Earth system models in CMIP6
journal, January 2017


Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0
journal, January 2018


Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation
text, January 2021


Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
text, January 2020


Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
journal, January 2019


Harnessing Energy Efficiency of Heterogeneous-ISA Platforms
journal, January 2016


Ultrafast analysis of individual grain behavior during grain growth by parallel computing
text, January 2015


FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
text, January 2016


A Survey on Compiler Autotuning using Machine Learning
text, January 2018


Performance optimization and modeling of fine-grained irregular communication in UPC
text, January 2019


In situ and in-transit analysis of cosmological simulations
journal, August 2016


Characterizing Task-Based OpenMP Programs
journal, April 2015