skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Journal Article · · Communications of the Association for Computing Machinery

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
Computational Research Division
DOE Contract Number:
DE-AC02-05CH11231
OSTI ID:
963540
Report Number(s):
LBNL-2141E; TRN: US200918%%382
Journal Information:
Communications of the Association for Computing Machinery, Journal Name: Communications of the Association for Computing Machinery
Country of Publication:
United States
Language:
English

References (17)

Validity of the single processor approach to achieving large scale computing capabilities conference January 1967
A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1 conference January 1994
Estimating interlock and improving balance for pipelined architectures journal August 1988
Improving the ratio of memory operations to floating-point operations in loops journal November 1994
Self-Adapting Linear Algebra Algorithms and Software journal February 2005
Performance of Synchronized Iterative Processes in Multiprocessor Systems journal July 1982
The Design and Implementation of FFTW3 journal February 2005
Mapping computational concepts to GPUs conference January 2005
Amdahl's Law in the Multicore Era journal July 2008
Evaluating associativity in CPU caches journal January 1989
A Proof for the Queuing Formula: L = λ W journal June 1961
Latency lags bandwith journal October 2004
Analytic Queueing Network Models for Parallel Processing of Task Systems journal December 1986
A genetic algorithms approach to modeling the performance of memory-bound computations conference January 2007
Lattice Boltzmann simulation optimization on leading multicore platforms
  • Williams, Samuel; Carter, Jonathan; Oliker, Leonid
  • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2008.4536295
conference April 2008
Optimization of sparse matrix-vector multiplication on emerging multicore platforms conference January 2007
The SPLASH-2 programs: characterization and methodological considerations conference January 1995

Cited By (98)

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM journal October 2014
Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach journal May 2017
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
  • Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio
  • Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12 https://doi.org/10.1002/cpe.4143
journal March 2017
An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures journal April 2018
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
  • Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A.
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1 https://doi.org/10.1002/cpe.4864
journal July 2018
Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
  • Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18 https://doi.org/10.1002/cpe.4929
journal August 2018
Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU journal September 2018
Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution journal August 2018
Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture journal September 2018
High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor journal November 2018
Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system journal November 2019
Use of model-based architecture attributes to construct a component-level trade space journal February 2019
LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation book December 2018
Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography book December 2018
DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs book January 2019
GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation
  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213 https://doi.org/10.1007/978-3-030-25636-4_16
book July 2019
LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU
  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151 https://doi.org/10.1007/978-3-030-28163-2_10
book August 2019
Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL book October 2016
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels book May 2017
A High-Throughput Kalman Filter for Modern SIMD Architectures book January 2018
Approximate FPGA-Based LSTMs Under Computation Time Constraints book January 2018
On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors book January 2018
Software Design Space Exploration for Exascale Combustion Co-design book January 2013
How Many Threads will be too Many? On the Scalability of OpenMP Implementations book January 2015
Measuring energy consumption using EML (energy measurement library) journal July 2014
Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures journal November 2016
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems journal October 2016
Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs journal April 2018
3DyRM: a dynamic roofline model including memory latency information journal March 2014
Optimization of parallel iterated local search algorithms on graphics processing unit journal May 2016
The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes journal June 2018
Efficient scheduling of streams on GPGPUs journal February 2020
Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm journal October 2019
High performance FDTD algorithm for GPGPU supercomputers journal October 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing journal August 2015
A real-time, all-sky, high time resolution, direct imager for the long wavelength array journal May 2019
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields journal October 2019
Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation journal May 2018
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks conference January 2015
Optimizing Sparse Matrix—Matrix Multiplication for the GPU journal October 2015
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications conference January 2015
Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model conference January 2015
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
  • Hoefler, Torsten; Belli, Roberto
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807644
conference January 2015
Harnessing energy efficiency of heterogeneous-ISA platforms conference January 2015
Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance conference January 2015
Variation Among Processors Under Turbo Boost in HPC Systems conference January 2016
Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells journal January 2017
Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks
  • Zhang, Chen; Fang, Zhenman; Zhou, Peipei
  • ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design https://doi.org/10.1145/2966986.2967011
conference November 2016
Resource Conscious Reuse-Driven Tiling for GPUs
  • Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh
  • Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16 https://doi.org/10.1145/2967938.2967967
conference January 2016
Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
  • Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen
  • MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems https://doi.org/10.1145/2989081.2989087
conference October 2016
Sparse Matrix-Vector Multiplication on GPGPUs journal January 2017
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
  • Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio
  • Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17 https://doi.org/10.1145/3020078.3021744
conference January 2017
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
  • Xiao, Qingcheng; Liang, Yun; Lu, Liqiang
  • DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017 https://doi.org/10.1145/3061639.3062244
conference June 2017
A Survey of Power and Energy Predictive Models in HPC Systems and Applications journal October 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit conference January 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit journal June 2017
Design of a High-Performance GEMM-like Tensor–Tensor Multiplication journal April 2018
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions journal July 2018
A Survey on Compiler Autotuning using Machine Learning journal January 2019
Efficient sparse-matrix multi-vector product on GPUs
  • Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18 https://doi.org/10.1145/3208040.3208062
conference January 2018
FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
  • Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J.
  • ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3 https://doi.org/10.1145/3242897
journal December 2018
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms journal April 2019
Metric Selection for GPU Kernel Classification
  • Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud
  • ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4 https://doi.org/10.1145/3295690
journal January 2019
Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators journal August 2019
On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
  • Helm, Christian; Taura, Kenjiro
  • HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368476
conference January 2020
Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC journal March 2019
ExaSAT: An exascale co-design tool for performance modeling journal April 2014
Modeling high-throughput applications for in situ analytics journal May 2019
Analytic performance modeling and analysis of detailed neuron simulations journal April 2020
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) journal November 2012
Data Management in Machine Learning Systems journal February 2019
Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance journal November 2016
Compression Challenges in Large Scale Partial Differential Equation Solvers journal September 2019
DiamondTorre Algorithm for High-Performance Wave Modeling journal August 2016
An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution journal March 2019
Developing Efficient Discrete Simulations on Multicore and GPU Architectures journal January 2020
Fog vs. Cloud Computing: Should I Stay or Should I Go? journal February 2019
A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture journal December 2018
CPMIP: measurements of real computational performance of Earth system models in CMIP6 journal January 2017
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 journal January 2018
Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0) journal January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration posted_content January 2018
Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation text January 2021
Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: a First Look at Knights Landing text January 2016
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields text January 2020
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration journal January 2019
Harnessing Energy Efficiency of Heterogeneous-ISA Platforms journal January 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing text January 2015
Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model text January 2014
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems text January 2015
Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing preprint January 2016
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference text January 2016
A Survey on Compiler Autotuning using Machine Learning text January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration text January 2018
A Real-Time, All-Sky, High Time Resolution, Direct Imager for the Long Wavelength Array text January 2019
Performance optimization and modeling of fine-grained irregular communication in UPC text January 2019
In situ and in-transit analysis of cosmological simulations journal August 2016
Characterizing Task-Based OpenMP Programs journal April 2015