Roofline: an insightful visual performance model for multicore architectures

Williams, Samuel; Waterman, Andrew; Patterson, David

doi:10.1145/1498765.1498785

Title: Roofline: an insightful visual performance model for multicore architectures

Journal Article · Sat Apr 04 00:00:00 EDT 2009 · Communications of the ACM

DOI:https://doi.org/10.1145/1498765.1498785· OSTI ID:1407073

Williams, Samuel ^[1]; Waterman, Andrew ^[1]; Patterson, David ^[1]

Univ. of California, Berkeley, CA (United States). Parallel Computing Lab.

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 1407073

Journal Information:: Communications of the ACM, Vol. 52, Issue 4; ISSN 0001-0782

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 1138 works

Citation information provided by
Web of Science

References (17)

Validity of the single processor approach to achieving large scale computing capabilities Amdahl, Gene M. Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring) https://doi.org/10.1145/1465482.1465560	conference	January 1967
A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1 Boyd, E. L.; Azeem, W.; Hsien-Hsin Lee, Hsien-Hsin Lee 1994 International Conference on Parallel Processing Vol. 3 https://doi.org/10.1109/ICPP.1994.30	conference	January 1994
Estimating interlock and improving balance for pipelined architectures Callahan, David; Cocke, John; Kennedy, Ken Journal of Parallel and Distributed Computing, Vol. 5, Issue 4 https://doi.org/10.1016/0743-7315(88)90002-0	journal	August 1988
Improving the ratio of memory operations to floating-point operations in loops Carr, Steve; Kennedy, Ken ACM Transactions on Programming Languages and Systems, Vol. 16, Issue 6 https://doi.org/10.1145/197320.197366	journal	November 1994
Self-Adapting Linear Algebra Algorithms and Software Demmel, J.; Dongarra, J.; Eijkhout, V. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840848	journal	February 2005
Performance of Synchronized Iterative Processes in Multiprocessor Systems Dubois, M.; Briggs, F. A. IEEE Transactions on Software Engineering, Vol. SE-8, Issue 4 https://doi.org/10.1109/TSE.1982.235576	journal	July 1982
The Design and Implementation of FFTW3 Frigo, M.; Johnson, S. G. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840301	journal	February 2005
Mapping computational concepts to GPUs Harris, Mark ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05 https://doi.org/10.1145/1198555.1198768	conference	January 2005
Amdahl's Law in the Multicore Era Hill, Mark D.; Marty, Michael R. Computer, Vol. 41, Issue 7 https://doi.org/10.1109/MC.2008.209	journal	July 2008
Evaluating associativity in CPU caches Hill, M. D.; Smith, A. J. IEEE Transactions on Computers, Vol. 38, Issue 12 https://doi.org/10.1109/12.40842	journal	January 1989
A Proof for the Queuing Formula: L = λ W Little, John D. C. Operations Research, Vol. 9, Issue 3 https://doi.org/10.1287/opre.9.3.383	journal	June 1961
Latency lags bandwith Patterson, David A. Communications of the ACM, Vol. 47, Issue 10 https://doi.org/10.1145/1022594.1022596	journal	October 2004
Analytic Queueing Network Models for Parallel Processing of Task Systems Thomasian, A. IEEE Transactions on Computers, Vol. C-35, Issue 12, p. 1045-1054 https://doi.org/10.1109/TC.1986.1676712	journal	December 1986
A genetic algorithms approach to modeling the performance of memory-bound computations Tikir, Mustafa M.; Carrington, Laura; Strohmaier, Erich Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07 https://doi.org/10.1145/1362622.1362686	conference	January 2007
Lattice Boltzmann simulation optimization on leading multicore platforms Williams, Samuel; Carter, Jonathan; Oliker, Leonid Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2008.4536295	conference	April 2008
Optimization of sparse matrix-vector multiplication on emerging multicore platforms Williams, Samuel; Oliker, Leonid; Vuduc, Richard Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07 https://doi.org/10.1145/1362622.1362674	conference	January 2007
The SPLASH-2 programs: characterization and methodological considerations Woo, Steven Cameron; Ohara, Moriyoshi; Torrie, Evan Proceedings of the 22nd annual international symposium on Computer architecture - ISCA '95 https://doi.org/10.1145/223982.223990	conference	January 1995

Cited By (98)

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM Codreanu, Valeriu; Dröge, Bob; Williams, David Concurrency and Computation: Practice and Experience, Vol. 28, Issue 7 https://doi.org/10.1002/cpe.3413	journal	October 2014
Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach Kuckuk, Sebastian; Haase, Gundolf; Vasco, Diego A. Concurrency and Computation: Practice and Experience, Vol. 29, Issue 17 https://doi.org/10.1002/cpe.4062	journal	May 2017
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12 https://doi.org/10.1002/cpe.4143	journal	March 2017
An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures Cámpora Pérez, Daniel Hugo; Awile, Omar Concurrency and Computation: Practice and Experience, Vol. 30, Issue 23 https://doi.org/10.1002/cpe.4483	journal	April 2018
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A. Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1 https://doi.org/10.1002/cpe.4864	journal	July 2018
Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18 https://doi.org/10.1002/cpe.4929	journal	August 2018
Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU Yamashita, Kohei; Ito, Yasuaki; Nakano, Koji Concurrency and Computation: Practice and Experience, Vol. 31, Issue 19 https://doi.org/10.1002/cpe.4947	journal	September 2018
Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution Reddy Manumachu, Ravi; Lastovetsky, Alexey L. Concurrency and Computation: Practice and Experience, Vol. 31, Issue 4 https://doi.org/10.1002/cpe.4958	journal	August 2018
Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture Kwack, JaeHyuk; Arnold, Galen; Mendes, Celso Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.4963	journal	September 2018
High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor Robertsén, Fredrik; Mattila, Keijo; Westerholm, Jan Concurrency and Computation: Practice and Experience, Vol. 31, Issue 13 https://doi.org/10.1002/cpe.5072	journal	November 2018
Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system Yang, Charlene; Kurth, Thorsten; Williams, Samuel Concurrency and Computation: Practice and Experience, Vol. 32, Issue 20 https://doi.org/10.1002/cpe.5547	journal	November 2019
Use of model-based architecture attributes to construct a component-level trade space McKean, David; Moreland, James D.; Doskey, Steven Systems Engineering, Vol. 22, Issue 2 https://doi.org/10.1002/sys.21478	journal	February 2019
LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation Perepelkina, Anastasia; Levchenko, Vadim Communications in Computer and Information Science https://doi.org/10.1007/978-3-030-05807-4_9	book	December 2018
Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography Schrödter, Tobias; Pallasch, David; Wienke, Sandra Lecture Notes in Computer Science https://doi.org/10.1007/978-3-030-10549-5_33	book	December 2018
DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs Fickenscher, Jörg; Hannig, Frank; Teich, Jürgen Architecture of Computing Systems – ARCS 2019 https://doi.org/10.1007/978-3-030-18656-2_6	book	January 2019
GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213 https://doi.org/10.1007/978-3-030-25636-4_16	book	July 2019
LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151 https://doi.org/10.1007/978-3-030-28163-2_10	book	August 2019
Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL Joó, Bálint; Kalamkar, Dhiraj D.; Kurth, Thorsten Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-46079-6_30	book	October 2016
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels Hammer, Julian; Eitzinger, Jan; Hager, Georg Tools for High Performance Computing 2016 https://doi.org/10.1007/978-3-319-56702-0_1	book	May 2017
A High-Throughput Kalman Filter for Modern SIMD Architectures Cámpora Pérez, Daniel Hugo; Awile, Omar; Potterat, Cédric Euro-Par 2017: Parallel Processing Workshops https://doi.org/10.1007/978-3-319-75178-8_31	book	January 2018
Approximate FPGA-Based LSTMs Under Computation Time Constraints Rizakis, Michalis; Venieris, Stylianos I.; Kouris, Alexandros Applied Reconfigurable Computing. Architectures, Tools, and Applications https://doi.org/10.1007/978-3-319-78890-6_1	book	January 2018
On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors Hofmann, Johannes; Hager, Georg; Fey, Dietmar Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-92040-5_2	book	January 2018
Software Design Space Exploration for Exascale Combustion Co-design Chan, Cy; Unat, Didem; Lijewski, Michael Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-38750-0_15	book	January 2013
How Many Threads will be too Many? On the Scalability of OpenMP Implementations Iwainsky, Christian; Shudler, Sergei; Calotoiu, Alexandru Lecture Notes in Computer Science https://doi.org/10.1007/978-3-662-48096-0_35	book	January 2015
Measuring energy consumption using EML (energy measurement library) Cabrera, Alberto; Almeida, Francisco; Arteaga, Javier Computer Science - Research and Development, Vol. 30, Issue 2 https://doi.org/10.1007/s00450-014-0269-5	journal	July 2014
Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures Ciznicki, Milosz; Kurowski, Krzysztof; Weglarz, Jan Cluster Computing, Vol. 20, Issue 3 https://doi.org/10.1007/s10586-016-0686-2	journal	November 2016
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven International Journal of Parallel Programming, Vol. 45, Issue 5 https://doi.org/10.1007/s10766-016-0464-z	journal	October 2016
Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs Vanderbauwhede, Wim; Nabi, Syed Waqar; Urlea, Cristian International Journal of Parallel Programming, Vol. 47, Issue 1 https://doi.org/10.1007/s10766-018-0572-z	journal	April 2018
3DyRM: a dynamic roofline model including memory latency information Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C. The Journal of Supercomputing, Vol. 70, Issue 2 https://doi.org/10.1007/s11227-014-1163-4	journal	March 2014
Optimization of parallel iterated local search algorithms on graphics processing unit Zhou, Yi; He, Fazhi; Qiu, Yimin The Journal of Supercomputing, Vol. 72, Issue 6 https://doi.org/10.1007/s11227-016-1738-3	journal	May 2016
The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes Perepelkina, Anastasia; Levchenko, Vadim; Khilkov, Sergey The Journal of Supercomputing, Vol. 75, Issue 12 https://doi.org/10.1007/s11227-018-2461-z	journal	June 2018
Efficient scheduling of streams on GPGPUs Beheshti Roui, Mohamad; Shekofteh, S. Kazem; Noori, Hamid The Journal of Supercomputing, Vol. 76, Issue 11 https://doi.org/10.1007/s11227-020-03209-x	journal	February 2020
Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm Kuo, F. A.; Chiang, C. H.; Lo, M. C. Journal of Mechanics, Vol. 36, Issue 1 https://doi.org/10.1017/jmech.2019.9	journal	October 2019
High performance FDTD algorithm for GPGPU supercomputers Zakirov, Andrey; Levchenko, Vadim; Perepelkina, Anastasia Journal of Physics: Conference Series, Vol. 759 https://doi.org/10.1088/1742-6596/759/1/012100	journal	October 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing Kühbach, M.; Barrales-Mora, L. A.; Mießen, C. IOP Conference Series: Materials Science and Engineering, Vol. 89 https://doi.org/10.1088/1757-899x/89/1/012031	journal	August 2015
A real-time, all-sky, high time resolution, direct imager for the long wavelength array Kent, James; Dowell, Jayce; Beardsley, Adam Monthly Notices of the Royal Astronomical Society, Vol. 486, Issue 4 https://doi.org/10.1093/mnras/stz1206	journal	May 2019
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields Kent, James; Beardsley, Adam P.; Bester, Landman Monthly Notices of the Royal Astronomical Society, Vol. 491, Issue 1 https://doi.org/10.1093/mnras/stz3028	journal	October 2019
Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation Levchenko, V. D.; Perepelkina, A. Y. Lobachevskii Journal of Mathematics, Vol. 39, Issue 4 https://doi.org/10.1134/s1995080218040108	journal	May 2018
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Zhang, Chen; Li, Peng; Sun, Guangyu Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15 https://doi.org/10.1145/2684746.2689060	conference	January 2015
Optimizing Sparse Matrix—Matrix Multiplication for the GPU Dalton, Steven; Olson, Luke; Bell, Nathan ACM Transactions on Mathematical Software, Vol. 41, Issue 4 https://doi.org/10.1145/2699470	journal	October 2015
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications Wahib, Mohamed; Maruyama, Naoya Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749255	conference	January 2015
Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model Stengel, Holger; Treibig, Jan; Hager, Georg Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751240	conference	January 2015
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results Hoefler, Torsten; Belli, Roberto Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807644	conference	January 2015
Harnessing energy efficiency of heterogeneous-ISA platforms Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K. Proceedings of the Workshop on Power-Aware Computing and Systems - HotPower '15 https://doi.org/10.1145/2818613.2818747	conference	January 2015
Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance Ardalani, Newsha; Lestourgeon, Clint; Sankaralingam, Karthikeyan Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48 https://doi.org/10.1145/2830772.2830780	conference	January 2015
Variation Among Processors Under Turbo Boost in HPC Systems Acun, Bilge; Miller, Phil; Kale, Laxmikant V. Proceedings of the 2016 International Conference on Supercomputing - ICS '16 https://doi.org/10.1145/2925426.2926289	conference	January 2016
Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells Meister, Oliver; Rahnema, Kaveh; Bader, Michael ACM Transactions on Mathematical Software, Vol. 43, Issue 3 https://doi.org/10.1145/2947668	journal	January 2017
Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks Zhang, Chen; Fang, Zhenman; Zhou, Peipei ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design https://doi.org/10.1145/2966986.2967011	conference	November 2016
Resource Conscious Reuse-Driven Tiling for GPUs Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16 https://doi.org/10.1145/2967938.2967967	conference	January 2016
Data-Centric Computing Frontiers: A Survey On Processing-In-Memory Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems https://doi.org/10.1145/2989081.2989087	conference	October 2016
Sparse Matrix-Vector Multiplication on GPGPUs Filippone, Salvatore; Cardellini, Valeria; Barbieri, Davide ACM Transactions on Mathematical Software, Vol. 43, Issue 4 https://doi.org/10.1145/3017994	journal	January 2017
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17 https://doi.org/10.1145/3020078.3021744	conference	January 2017
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Xiao, Qingcheng; Liang, Yun; Lu, Liqiang DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017 https://doi.org/10.1145/3061639.3062244	conference	June 2017
A Survey of Power and Energy Predictive Models in HPC Systems and Applications O’brien, Kenneth; Pietri, Ilia; Reddy, Ravi ACM Computing Surveys, Vol. 50, Issue 3 https://doi.org/10.1145/3078811	journal	October 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17 https://doi.org/10.1145/3079856.3080246	conference	January 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick ACM SIGARCH Computer Architecture News, Vol. 45, Issue 2 https://doi.org/10.1145/3140659.3080246	journal	June 2017
Design of a High-Performance GEMM-like Tensor–Tensor Multiplication Springer, Paul; Bientinesi, Paolo ACM Transactions on Mathematical Software, Vol. 44, Issue 3 https://doi.org/10.1145/3157733	journal	April 2018
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions Venieris, Stylianos I.; Kouris, Alexandros; Bouganis, Christos-Savvas ACM Computing Surveys, Vol. 51, Issue 3 https://doi.org/10.1145/3186332	journal	July 2018
A Survey on Compiler Autotuning using Machine Learning Ashouri, Amir H.; Killian, William; Cavazos, John ACM Computing Surveys, Vol. 51, Issue 5 https://doi.org/10.1145/3197978	journal	January 2019
Efficient sparse-matrix multi-vector product on GPUs Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18 https://doi.org/10.1145/3208040.3208062	conference	January 2018
FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J. ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3 https://doi.org/10.1145/3242897	journal	December 2018
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms Choi, Young-Kyu; Cong, Jason; Fang, Zhenman ACM Transactions on Reconfigurable Technology and Systems, Vol. 12, Issue 1 https://doi.org/10.1145/3294054	journal	April 2019
Metric Selection for GPU Kernel Classification Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4 https://doi.org/10.1145/3295690	journal	January 2019
Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators Kronbichler, Martin; Kormann, Katharina ACM Transactions on Mathematical Software, Vol. 45, Issue 3 https://doi.org/10.1145/3325864	journal	August 2019
On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency Helm, Christian; Taura, Kenjiro HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368476	conference	January 2020
Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina Scientific Programming, Vol. 2019 https://doi.org/10.1155/2019/6825728	journal	March 2019
ExaSAT: An exascale co-design tool for performance modeling Unat, Didem; Chan, Cy; Zhang, Weiqun The International Journal of High Performance Computing Applications, Vol. 29, Issue 2 https://doi.org/10.1177/1094342014568690	journal	April 2014
Modeling high-throughput applications for in situ analytics Aupy, Guillaume; Goglin, Brice; Honoré, Valentin The International Journal of High Performance Computing Applications, Vol. 33, Issue 6 https://doi.org/10.1177/1094342019847263	journal	May 2019
Analytic performance modeling and analysis of detailed neuron simulations Cremonesi, Francesco; Hager, Georg; Wellein, Gerhard The International Journal of High Performance Computing Applications, Vol. 34, Issue 4 https://doi.org/10.1177/1094342020912528	journal	April 2020
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara Synthesis Lectures on Computer Architecture, Vol. 7, Issue 2 https://doi.org/10.2200/s00451ed1v01y201209cac020	journal	November 2012
Data Management in Machine Learning Systems Boehm, Matthias; Kumar, Arun; Yang, Jun Synthesis Lectures on Data Management, Vol. 14, Issue 1 https://doi.org/10.2200/s00895ed1v01y201901dtm057	journal	February 2019
Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance De Vuyst, Florian; Gasc, Thibault; Motte, Renaud Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, Vol. 71, Issue 6 https://doi.org/10.2516/ogst/2016019	journal	November 2016
Compression Challenges in Large Scale Partial Differential Equation Solvers Götschel, Sebastian; Weiser, Martin Algorithms, Vol. 12, Issue 9 https://doi.org/10.3390/a12090197	journal	September 2019
DiamondTorre Algorithm for High-Performance Wave Modeling Levchenko, Vadim; Perepelkina, Anastasia; Zakirov, Andrey Computation, Vol. 4, Issue 3 https://doi.org/10.3390/computation4030029	journal	August 2016
An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution Liu, Bing; Zou, Danyin; Feng, Lei Electronics, Vol. 8, Issue 3 https://doi.org/10.3390/electronics8030281	journal	March 2019
Developing Efficient Discrete Simulations on Multicore and GPU Architectures Cagigas-Muñiz, Daniel; Diaz-del-Rio, Fernando; López-Torres, Manuel Ramón Electronics, Vol. 9, Issue 1 https://doi.org/10.3390/electronics9010189	journal	January 2020
Fog vs. Cloud Computing: Should I Stay or Should I Go? Pisani, Flávia; Martins do Rosario, Vanderson; Borin, Edson Future Internet, Vol. 11, Issue 2 https://doi.org/10.3390/fi11020034	journal	February 2019
A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture Wan, Bo; Yang, Lin; Zhou, Shunping ISPRS International Journal of Geo-Information, Vol. 7, Issue 12 https://doi.org/10.3390/ijgi7120472	journal	December 2018
CPMIP: measurements of real computational performance of Earth system models in CMIP6 Balaji, Venkatramani; Maisonnave, Eric; Zadeh, Niki Geoscientific Model Development, Vol. 10, Issue 1 https://doi.org/10.5194/gmd-10-19-2017	journal	January 2017
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 Fuhrer, Oliver; Chadha, Tarun; Hoefler, Torsten Geoscientific Model Development, Vol. 11, Issue 4 https://doi.org/10.5194/gmd-11-1665-2018	journal	January 2018
Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0) Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike Geoscientific Model Development, Vol. 11, Issue 8 https://doi.org/10.5194/gmd-11-3447-2018	journal	January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio Geoscientific Model Development Discussions https://doi.org/10.5194/gmd-2018-189	posted_content	January 2018
Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation Platzer, Michael; Puschner, Peter Schloss Dagstuhl - Leibniz-Zentrum für Informatik https://doi.org/10.4230/lipics.ecrts.2021.1	text	January 2021
Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: a First Look at Knights Landing Bastrakov, Sergey; Meyerov, Iosif; Gonoskov, Arkady Unpublished https://doi.org/10.13140/rg.2.2.11832.96006	text	January 2016
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields Kent, James; Beardsley, Ap; Bester, L. Apollo - University of Cambridge Repository https://doi.org/10.17863/cam.48304	text	January 2020
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio Geoscientific Model Development, Vol. 12, Issue 3 https://doi.org/10.5194/gmd-12-1165-2019	journal	January 2019
Harnessing Energy Efficiency of Heterogeneous-ISA Platforms Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K. ACM SIGOPS Operating Systems Review, Vol. 49, Issue 2 https://doi.org/10.1145/2883591.2883605	journal	January 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing Kühbach, M.; Barrales-Mora, L. A.; Mießen, C. RWTH Aachen University https://doi.org/10.18154/rwth-2015-04763	text	January 2015
Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Stengel, Holger; Treibig, Jan; Hager, Georg arXiv https://doi.org/10.48550/arxiv.1410.5010	text	January 2014
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven arXiv https://doi.org/10.48550/arxiv.1507.08101	text	January 2015
Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing Surmin, Igor; Bastrakov, Sergey; Matveev, Zakhar arXiv https://doi.org/10.48550/arxiv.1608.01009	preprint	January 2016
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio arXiv https://doi.org/10.48550/arxiv.1612.07119	text	January 2016
A Survey on Compiler Autotuning using Machine Learning Ashouri, Amir H.; Killian, William; Cavazos, John arXiv https://doi.org/10.48550/arxiv.1801.04405	text	January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio arXiv https://doi.org/10.48550/arxiv.1808.01995	text	January 2018
A Real-Time, All-Sky, High Time Resolution, Direct Imager for the Long Wavelength Array Kent, James; Dowell, Jayce; Beardsley, Adam arXiv https://doi.org/10.48550/arxiv.1904.11422	text	January 2019
Performance optimization and modeling of fine-grained irregular communication in UPC Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina arXiv https://doi.org/10.48550/arxiv.1912.12701	text	January 2019
In situ and in-transit analysis of cosmological simulations Friesen, Brian; Almgren, Ann; Lukić, Zarija Computational Astrophysics and Cosmology, Vol. 3, Issue 1 https://doi.org/10.1186/s40668-016-0017-2	journal	August 2016
Characterizing Task-Based OpenMP Programs Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats PLOS ONE, Vol. 10, Issue 4 https://doi.org/10.1371/journal.pone.0123545	journal	April 2015

Similar Records

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Technical Report · Tue Sep 01 00:00:00 EDT 2009 · OSTI ID:1407073

Williams, Samuel; Waterman, Andrew; Patterson, David

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Journal Article · Sun Feb 01 00:00:00 EST 2009 · Communications of the Association for Computing Machinery · OSTI ID:1407073

Williams, Samuel; Waterman, Andrew; Patterson, David

Instruction Roofline: An insightful visual performance model for GPUs

Conference · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:1407073

Ding, N; Awan, M; Williams, S

Related Subjects

97 MATHEMATICS AND COMPUTING

Title: Roofline: an insightful visual performance model for multicore architectures

Citation Formats

References (17)

Cited By (98)

Similar Records

Related Subjects