Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Williams, Samuel; Waterman, Andrew; Patterson, David

doi:10.1145/1498765.1498785

Title: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Journal Article · Sun Feb 01 00:00:00 EST 2009 · Communications of the Association for Computing Machinery

DOI:https://doi.org/10.1145/1498765.1498785· OSTI ID:963540

Williams, Samuel; Waterman, Andrew; Patterson, David

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

View Journal Article

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: Computational Research Division

DOE Contract Number:: DE-AC02-05CH11231

OSTI ID:: 963540

Report Number(s):: LBNL-2141E; TRN: US200918%%382

Journal Information:: Communications of the Association for Computing Machinery, Journal Name: Communications of the Association for Computing Machinery

Country of Publication:: United States

Language:: English

References (17)

Validity of the single processor approach to achieving large scale computing capabilities Amdahl, Gene M. Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring) https://doi.org/10.1145/1465482.1465560	conference	January 1967
A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1 Boyd, E. L.; Azeem, W.; Hsien-Hsin Lee, Hsien-Hsin Lee 1994 International Conference on Parallel Processing Vol. 3 https://doi.org/10.1109/ICPP.1994.30	conference	January 1994
Estimating interlock and improving balance for pipelined architectures Callahan, David; Cocke, John; Kennedy, Ken Journal of Parallel and Distributed Computing, Vol. 5, Issue 4 https://doi.org/10.1016/0743-7315(88)90002-0	journal	August 1988
Improving the ratio of memory operations to floating-point operations in loops Carr, Steve; Kennedy, Ken ACM Transactions on Programming Languages and Systems, Vol. 16, Issue 6 https://doi.org/10.1145/197320.197366	journal	November 1994
Self-Adapting Linear Algebra Algorithms and Software Demmel, J.; Dongarra, J.; Eijkhout, V. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840848	journal	February 2005
Performance of Synchronized Iterative Processes in Multiprocessor Systems Dubois, M.; Briggs, F. A. IEEE Transactions on Software Engineering, Vol. SE-8, Issue 4 https://doi.org/10.1109/TSE.1982.235576	journal	July 1982
The Design and Implementation of FFTW3 Frigo, M.; Johnson, S. G. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840301	journal	February 2005
Mapping computational concepts to GPUs Harris, Mark ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05 https://doi.org/10.1145/1198555.1198768	conference	January 2005
Amdahl's Law in the Multicore Era Hill, Mark D.; Marty, Michael R. Computer, Vol. 41, Issue 7 https://doi.org/10.1109/MC.2008.209	journal	July 2008
Evaluating associativity in CPU caches Hill, M. D.; Smith, A. J. IEEE Transactions on Computers, Vol. 38, Issue 12 https://doi.org/10.1109/12.40842	journal	January 1989
A Proof for the Queuing Formula: L = λ W Little, John D. C. Operations Research, Vol. 9, Issue 3 https://doi.org/10.1287/opre.9.3.383	journal	June 1961
Latency lags bandwith Patterson, David A. Communications of the ACM, Vol. 47, Issue 10 https://doi.org/10.1145/1022594.1022596	journal	October 2004
Analytic Queueing Network Models for Parallel Processing of Task Systems Thomasian, A. IEEE Transactions on Computers, Vol. C-35, Issue 12, p. 1045-1054 https://doi.org/10.1109/TC.1986.1676712	journal	December 1986
A genetic algorithms approach to modeling the performance of memory-bound computations Tikir, Mustafa M.; Carrington, Laura; Strohmaier, Erich Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07 https://doi.org/10.1145/1362622.1362686	conference	January 2007
Lattice Boltzmann simulation optimization on leading multicore platforms Williams, Samuel; Carter, Jonathan; Oliker, Leonid Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2008.4536295	conference	April 2008
Optimization of sparse matrix-vector multiplication on emerging multicore platforms Williams, Samuel; Oliker, Leonid; Vuduc, Richard Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07 https://doi.org/10.1145/1362622.1362674	conference	January 2007
The SPLASH-2 programs: characterization and methodological considerations Woo, Steven Cameron; Ohara, Moriyoshi; Torrie, Evan Proceedings of the 22nd annual international symposium on Computer architecture - ISCA '95 https://doi.org/10.1145/223982.223990	conference	January 1995

Cited By (98)

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM Codreanu, Valeriu; Dröge, Bob; Williams, David Concurrency and Computation: Practice and Experience, Vol. 28, Issue 7 https://doi.org/10.1002/cpe.3413	journal	October 2014
Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach Kuckuk, Sebastian; Haase, Gundolf; Vasco, Diego A. Concurrency and Computation: Practice and Experience, Vol. 29, Issue 17 https://doi.org/10.1002/cpe.4062	journal	May 2017
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12 https://doi.org/10.1002/cpe.4143	journal	March 2017
An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures Cámpora Pérez, Daniel Hugo; Awile, Omar Concurrency and Computation: Practice and Experience, Vol. 30, Issue 23 https://doi.org/10.1002/cpe.4483	journal	April 2018
AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A. Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1 https://doi.org/10.1002/cpe.4864	journal	July 2018
Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18 https://doi.org/10.1002/cpe.4929	journal	August 2018
Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU Yamashita, Kohei; Ito, Yasuaki; Nakano, Koji Concurrency and Computation: Practice and Experience, Vol. 31, Issue 19 https://doi.org/10.1002/cpe.4947	journal	September 2018
Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution Reddy Manumachu, Ravi; Lastovetsky, Alexey L. Concurrency and Computation: Practice and Experience, Vol. 31, Issue 4 https://doi.org/10.1002/cpe.4958	journal	August 2018
Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture Kwack, JaeHyuk; Arnold, Galen; Mendes, Celso Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.4963	journal	September 2018
High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor Robertsén, Fredrik; Mattila, Keijo; Westerholm, Jan Concurrency and Computation: Practice and Experience, Vol. 31, Issue 13 https://doi.org/10.1002/cpe.5072	journal	November 2018
Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system Yang, Charlene; Kurth, Thorsten; Williams, Samuel Concurrency and Computation: Practice and Experience, Vol. 32, Issue 20 https://doi.org/10.1002/cpe.5547	journal	November 2019
Use of model-based architecture attributes to construct a component-level trade space McKean, David; Moreland, James D.; Doskey, Steven Systems Engineering, Vol. 22, Issue 2 https://doi.org/10.1002/sys.21478	journal	February 2019
LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation Perepelkina, Anastasia; Levchenko, Vadim Communications in Computer and Information Science https://doi.org/10.1007/978-3-030-05807-4_9	book	December 2018
Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography Schrödter, Tobias; Pallasch, David; Wienke, Sandra Lecture Notes in Computer Science https://doi.org/10.1007/978-3-030-10549-5_33	book	December 2018
DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs Fickenscher, Jörg; Hannig, Frank; Teich, Jürgen Architecture of Computing Systems – ARCS 2019 https://doi.org/10.1007/978-3-030-18656-2_6	book	January 2019
GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213 https://doi.org/10.1007/978-3-030-25636-4_16	book	July 2019
LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151 https://doi.org/10.1007/978-3-030-28163-2_10	book	August 2019
Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL Joó, Bálint; Kalamkar, Dhiraj D.; Kurth, Thorsten Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-46079-6_30	book	October 2016
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels Hammer, Julian; Eitzinger, Jan; Hager, Georg Tools for High Performance Computing 2016 https://doi.org/10.1007/978-3-319-56702-0_1	book	May 2017
A High-Throughput Kalman Filter for Modern SIMD Architectures Cámpora Pérez, Daniel Hugo; Awile, Omar; Potterat, Cédric Euro-Par 2017: Parallel Processing Workshops https://doi.org/10.1007/978-3-319-75178-8_31	book	January 2018
Approximate FPGA-Based LSTMs Under Computation Time Constraints Rizakis, Michalis; Venieris, Stylianos I.; Kouris, Alexandros Applied Reconfigurable Computing. Architectures, Tools, and Applications https://doi.org/10.1007/978-3-319-78890-6_1	book	January 2018
On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors Hofmann, Johannes; Hager, Georg; Fey, Dietmar Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-92040-5_2	book	January 2018
Software Design Space Exploration for Exascale Combustion Co-design Chan, Cy; Unat, Didem; Lijewski, Michael Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-38750-0_15	book	January 2013
How Many Threads will be too Many? On the Scalability of OpenMP Implementations Iwainsky, Christian; Shudler, Sergei; Calotoiu, Alexandru Lecture Notes in Computer Science https://doi.org/10.1007/978-3-662-48096-0_35	book	January 2015
Measuring energy consumption using EML (energy measurement library) Cabrera, Alberto; Almeida, Francisco; Arteaga, Javier Computer Science - Research and Development, Vol. 30, Issue 2 https://doi.org/10.1007/s00450-014-0269-5	journal	July 2014
Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures Ciznicki, Milosz; Kurowski, Krzysztof; Weglarz, Jan Cluster Computing, Vol. 20, Issue 3 https://doi.org/10.1007/s10586-016-0686-2	journal	November 2016
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven International Journal of Parallel Programming, Vol. 45, Issue 5 https://doi.org/10.1007/s10766-016-0464-z	journal	October 2016
Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs Vanderbauwhede, Wim; Nabi, Syed Waqar; Urlea, Cristian International Journal of Parallel Programming, Vol. 47, Issue 1 https://doi.org/10.1007/s10766-018-0572-z	journal	April 2018
3DyRM: a dynamic roofline model including memory latency information Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C. The Journal of Supercomputing, Vol. 70, Issue 2 https://doi.org/10.1007/s11227-014-1163-4	journal	March 2014
Optimization of parallel iterated local search algorithms on graphics processing unit Zhou, Yi; He, Fazhi; Qiu, Yimin The Journal of Supercomputing, Vol. 72, Issue 6 https://doi.org/10.1007/s11227-016-1738-3	journal	May 2016
The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes Perepelkina, Anastasia; Levchenko, Vadim; Khilkov, Sergey The Journal of Supercomputing, Vol. 75, Issue 12 https://doi.org/10.1007/s11227-018-2461-z	journal	June 2018
Efficient scheduling of streams on GPGPUs Beheshti Roui, Mohamad; Shekofteh, S. Kazem; Noori, Hamid The Journal of Supercomputing, Vol. 76, Issue 11 https://doi.org/10.1007/s11227-020-03209-x	journal	February 2020
Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm Kuo, F. A.; Chiang, C. H.; Lo, M. C. Journal of Mechanics, Vol. 36, Issue 1 https://doi.org/10.1017/jmech.2019.9	journal	October 2019
High performance FDTD algorithm for GPGPU supercomputers Zakirov, Andrey; Levchenko, Vadim; Perepelkina, Anastasia Journal of Physics: Conference Series, Vol. 759 https://doi.org/10.1088/1742-6596/759/1/012100	journal	October 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing Kühbach, M.; Barrales-Mora, L. A.; Mießen, C. IOP Conference Series: Materials Science and Engineering, Vol. 89 https://doi.org/10.1088/1757-899x/89/1/012031	journal	August 2015
A real-time, all-sky, high time resolution, direct imager for the long wavelength array Kent, James; Dowell, Jayce; Beardsley, Adam Monthly Notices of the Royal Astronomical Society, Vol. 486, Issue 4 https://doi.org/10.1093/mnras/stz1206	journal	May 2019
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields Kent, James; Beardsley, Adam P.; Bester, Landman Monthly Notices of the Royal Astronomical Society, Vol. 491, Issue 1 https://doi.org/10.1093/mnras/stz3028	journal	October 2019
Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation Levchenko, V. D.; Perepelkina, A. Y. Lobachevskii Journal of Mathematics, Vol. 39, Issue 4 https://doi.org/10.1134/s1995080218040108	journal	May 2018
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks Zhang, Chen; Li, Peng; Sun, Guangyu Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15 https://doi.org/10.1145/2684746.2689060	conference	January 2015
Optimizing Sparse Matrix—Matrix Multiplication for the GPU Dalton, Steven; Olson, Luke; Bell, Nathan ACM Transactions on Mathematical Software, Vol. 41, Issue 4 https://doi.org/10.1145/2699470	journal	October 2015
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications Wahib, Mohamed; Maruyama, Naoya Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749255	conference	January 2015
Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model Stengel, Holger; Treibig, Jan; Hager, Georg Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751240	conference	January 2015
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results Hoefler, Torsten; Belli, Roberto Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807644	conference	January 2015
Harnessing energy efficiency of heterogeneous-ISA platforms Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K. Proceedings of the Workshop on Power-Aware Computing and Systems - HotPower '15 https://doi.org/10.1145/2818613.2818747	conference	January 2015
Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance Ardalani, Newsha; Lestourgeon, Clint; Sankaralingam, Karthikeyan Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48 https://doi.org/10.1145/2830772.2830780	conference	January 2015
Variation Among Processors Under Turbo Boost in HPC Systems Acun, Bilge; Miller, Phil; Kale, Laxmikant V. Proceedings of the 2016 International Conference on Supercomputing - ICS '16 https://doi.org/10.1145/2925426.2926289	conference	January 2016
Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells Meister, Oliver; Rahnema, Kaveh; Bader, Michael ACM Transactions on Mathematical Software, Vol. 43, Issue 3 https://doi.org/10.1145/2947668	journal	January 2017
Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks Zhang, Chen; Fang, Zhenman; Zhou, Peipei ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design https://doi.org/10.1145/2966986.2967011	conference	November 2016
Resource Conscious Reuse-Driven Tiling for GPUs Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16 https://doi.org/10.1145/2967938.2967967	conference	January 2016
Data-Centric Computing Frontiers: A Survey On Processing-In-Memory Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems https://doi.org/10.1145/2989081.2989087	conference	October 2016
Sparse Matrix-Vector Multiplication on GPGPUs Filippone, Salvatore; Cardellini, Valeria; Barbieri, Davide ACM Transactions on Mathematical Software, Vol. 43, Issue 4 https://doi.org/10.1145/3017994	journal	January 2017
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17 https://doi.org/10.1145/3020078.3021744	conference	January 2017
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs Xiao, Qingcheng; Liang, Yun; Lu, Liqiang DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017 https://doi.org/10.1145/3061639.3062244	conference	June 2017
A Survey of Power and Energy Predictive Models in HPC Systems and Applications O’brien, Kenneth; Pietri, Ilia; Reddy, Ravi ACM Computing Surveys, Vol. 50, Issue 3 https://doi.org/10.1145/3078811	journal	October 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17 https://doi.org/10.1145/3079856.3080246	conference	January 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Norman P.; Borchers, Al; Boyle, Rick ACM SIGARCH Computer Architecture News, Vol. 45, Issue 2 https://doi.org/10.1145/3140659.3080246	journal	June 2017
Design of a High-Performance GEMM-like Tensor–Tensor Multiplication Springer, Paul; Bientinesi, Paolo ACM Transactions on Mathematical Software, Vol. 44, Issue 3 https://doi.org/10.1145/3157733	journal	April 2018
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions Venieris, Stylianos I.; Kouris, Alexandros; Bouganis, Christos-Savvas ACM Computing Surveys, Vol. 51, Issue 3 https://doi.org/10.1145/3186332	journal	July 2018
A Survey on Compiler Autotuning using Machine Learning Ashouri, Amir H.; Killian, William; Cavazos, John ACM Computing Surveys, Vol. 51, Issue 5 https://doi.org/10.1145/3197978	journal	January 2019
Efficient sparse-matrix multi-vector product on GPUs Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18 https://doi.org/10.1145/3208040.3208062	conference	January 2018
FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J. ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3 https://doi.org/10.1145/3242897	journal	December 2018
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms Choi, Young-Kyu; Cong, Jason; Fang, Zhenman ACM Transactions on Reconfigurable Technology and Systems, Vol. 12, Issue 1 https://doi.org/10.1145/3294054	journal	April 2019
Metric Selection for GPU Kernel Classification Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4 https://doi.org/10.1145/3295690	journal	January 2019
Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators Kronbichler, Martin; Kormann, Katharina ACM Transactions on Mathematical Software, Vol. 45, Issue 3 https://doi.org/10.1145/3325864	journal	August 2019
On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency Helm, Christian; Taura, Kenjiro HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368476	conference	January 2020
Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina Scientific Programming, Vol. 2019 https://doi.org/10.1155/2019/6825728	journal	March 2019
ExaSAT: An exascale co-design tool for performance modeling Unat, Didem; Chan, Cy; Zhang, Weiqun The International Journal of High Performance Computing Applications, Vol. 29, Issue 2 https://doi.org/10.1177/1094342014568690	journal	April 2014
Modeling high-throughput applications for in situ analytics Aupy, Guillaume; Goglin, Brice; Honoré, Valentin The International Journal of High Performance Computing Applications, Vol. 33, Issue 6 https://doi.org/10.1177/1094342019847263	journal	May 2019
Analytic performance modeling and analysis of detailed neuron simulations Cremonesi, Francesco; Hager, Georg; Wellein, Gerhard The International Journal of High Performance Computing Applications, Vol. 34, Issue 4 https://doi.org/10.1177/1094342020912528	journal	April 2020
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara Synthesis Lectures on Computer Architecture, Vol. 7, Issue 2 https://doi.org/10.2200/s00451ed1v01y201209cac020	journal	November 2012
Data Management in Machine Learning Systems Boehm, Matthias; Kumar, Arun; Yang, Jun Synthesis Lectures on Data Management, Vol. 14, Issue 1 https://doi.org/10.2200/s00895ed1v01y201901dtm057	journal	February 2019
Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance De Vuyst, Florian; Gasc, Thibault; Motte, Renaud Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, Vol. 71, Issue 6 https://doi.org/10.2516/ogst/2016019	journal	November 2016
Compression Challenges in Large Scale Partial Differential Equation Solvers Götschel, Sebastian; Weiser, Martin Algorithms, Vol. 12, Issue 9 https://doi.org/10.3390/a12090197	journal	September 2019
DiamondTorre Algorithm for High-Performance Wave Modeling Levchenko, Vadim; Perepelkina, Anastasia; Zakirov, Andrey Computation, Vol. 4, Issue 3 https://doi.org/10.3390/computation4030029	journal	August 2016
An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution Liu, Bing; Zou, Danyin; Feng, Lei Electronics, Vol. 8, Issue 3 https://doi.org/10.3390/electronics8030281	journal	March 2019
Developing Efficient Discrete Simulations on Multicore and GPU Architectures Cagigas-Muñiz, Daniel; Diaz-del-Rio, Fernando; López-Torres, Manuel Ramón Electronics, Vol. 9, Issue 1 https://doi.org/10.3390/electronics9010189	journal	January 2020
Fog vs. Cloud Computing: Should I Stay or Should I Go? Pisani, Flávia; Martins do Rosario, Vanderson; Borin, Edson Future Internet, Vol. 11, Issue 2 https://doi.org/10.3390/fi11020034	journal	February 2019
A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture Wan, Bo; Yang, Lin; Zhou, Shunping ISPRS International Journal of Geo-Information, Vol. 7, Issue 12 https://doi.org/10.3390/ijgi7120472	journal	December 2018
CPMIP: measurements of real computational performance of Earth system models in CMIP6 Balaji, Venkatramani; Maisonnave, Eric; Zadeh, Niki Geoscientific Model Development, Vol. 10, Issue 1 https://doi.org/10.5194/gmd-10-19-2017	journal	January 2017
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 Fuhrer, Oliver; Chadha, Tarun; Hoefler, Torsten Geoscientific Model Development, Vol. 11, Issue 4 https://doi.org/10.5194/gmd-11-1665-2018	journal	January 2018
Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0) Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike Geoscientific Model Development, Vol. 11, Issue 8 https://doi.org/10.5194/gmd-11-3447-2018	journal	January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio Geoscientific Model Development Discussions https://doi.org/10.5194/gmd-2018-189	posted_content	January 2018
Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation Platzer, Michael; Puschner, Peter Schloss Dagstuhl - Leibniz-Zentrum für Informatik https://doi.org/10.4230/lipics.ecrts.2021.1	text	January 2021
Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: a First Look at Knights Landing Bastrakov, Sergey; Meyerov, Iosif; Gonoskov, Arkady Unpublished https://doi.org/10.13140/rg.2.2.11832.96006	text	January 2016
Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields Kent, James; Beardsley, Ap; Bester, L. Apollo - University of Cambridge Repository https://doi.org/10.17863/cam.48304	text	January 2020
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio Geoscientific Model Development, Vol. 12, Issue 3 https://doi.org/10.5194/gmd-12-1165-2019	journal	January 2019
Harnessing Energy Efficiency of Heterogeneous-ISA Platforms Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K. ACM SIGOPS Operating Systems Review, Vol. 49, Issue 2 https://doi.org/10.1145/2883591.2883605	journal	January 2016
Ultrafast analysis of individual grain behavior during grain growth by parallel computing Kühbach, M.; Barrales-Mora, L. A.; Mießen, C. RWTH Aachen University https://doi.org/10.18154/rwth-2015-04763	text	January 2015
Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Stengel, Holger; Treibig, Jan; Hager, Georg arXiv https://doi.org/10.48550/arxiv.1410.5010	text	January 2014
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven arXiv https://doi.org/10.48550/arxiv.1507.08101	text	January 2015
Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing Surmin, Igor; Bastrakov, Sergey; Matveev, Zakhar arXiv https://doi.org/10.48550/arxiv.1608.01009	preprint	January 2016
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio arXiv https://doi.org/10.48550/arxiv.1612.07119	text	January 2016
A Survey on Compiler Autotuning using Machine Learning Ashouri, Amir H.; Killian, William; Cavazos, John arXiv https://doi.org/10.48550/arxiv.1801.04405	text	January 2018
Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration Louboutin, Mathias; Lange, Michael; Luporini, Fabio arXiv https://doi.org/10.48550/arxiv.1808.01995	text	January 2018
A Real-Time, All-Sky, High Time Resolution, Direct Imager for the Long Wavelength Array Kent, James; Dowell, Jayce; Beardsley, Adam arXiv https://doi.org/10.48550/arxiv.1904.11422	text	January 2019
Performance optimization and modeling of fine-grained irregular communication in UPC Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina arXiv https://doi.org/10.48550/arxiv.1912.12701	text	January 2019
In situ and in-transit analysis of cosmological simulations Friesen, Brian; Almgren, Ann; Lukić, Zarija Computational Astrophysics and Cosmology, Vol. 3, Issue 1 https://doi.org/10.1186/s40668-016-0017-2	journal	August 2016
Characterizing Task-Based OpenMP Programs Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats PLOS ONE, Vol. 10, Issue 4 https://doi.org/10.1371/journal.pone.0123545	journal	April 2015

Similar Records

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Technical Report · Tue Sep 01 00:00:00 EDT 2009 · OSTI ID:963540

Williams, Samuel; Waterman, Andrew; Patterson, David

Roofline: an insightful visual performance model for multicore architectures

Journal Article · Sat Apr 04 00:00:00 EDT 2009 · Communications of the ACM · OSTI ID:963540

Williams, Samuel; Waterman, Andrew; Patterson, David

Instruction Roofline: An insightful visual performance model for GPUs

Conference · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:963540

Ding, N; Awan, M; Williams, S

Related Subjects

97 MATHEMATICS AND COMPUTING
ARCHITECTS
PERFORMANCE
PARALLEL PROCESSING

Title: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Citation Formats

References (17)

Cited By (98)

Similar Records

Related Subjects