DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Roofline: an insightful visual performance model for multicore architectures

Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

Authors:
 [1];  [1];  [1]
  1. Univ. of California, Berkeley, CA (United States). Parallel Computing Lab.
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1407073
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Communications of the ACM
Additional Journal Information:
Journal Volume: 52; Journal Issue: 4; Journal ID: ISSN 0001-0782
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Williams, Samuel, Waterman, Andrew, and Patterson, David. Roofline: an insightful visual performance model for multicore architectures. United States: N. p., 2009. Web. doi:10.1145/1498765.1498785.
Williams, Samuel, Waterman, Andrew, & Patterson, David. Roofline: an insightful visual performance model for multicore architectures. United States. https://doi.org/10.1145/1498765.1498785
Williams, Samuel, Waterman, Andrew, and Patterson, David. Sat . "Roofline: an insightful visual performance model for multicore architectures". United States. https://doi.org/10.1145/1498765.1498785. https://www.osti.gov/servlets/purl/1407073.
@article{osti_1407073,
title = {Roofline: an insightful visual performance model for multicore architectures},
author = {Williams, Samuel and Waterman, Andrew and Patterson, David},
abstractNote = {We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.},
doi = {10.1145/1498765.1498785},
journal = {Communications of the ACM},
number = 4,
volume = 52,
place = {United States},
year = {Sat Apr 04 00:00:00 EDT 2009},
month = {Sat Apr 04 00:00:00 EDT 2009}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1138 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Validity of the single processor approach to achieving large scale computing capabilities
conference, January 1967

  • Amdahl, Gene M.
  • Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring)
  • DOI: 10.1145/1465482.1465560

A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1
conference, January 1994

  • Boyd, E. L.; Azeem, W.; Hsien-Hsin Lee, Hsien-Hsin Lee
  • 1994 International Conference on Parallel Processing Vol. 3
  • DOI: 10.1109/ICPP.1994.30

Estimating interlock and improving balance for pipelined architectures
journal, August 1988

  • Callahan, David; Cocke, John; Kennedy, Ken
  • Journal of Parallel and Distributed Computing, Vol. 5, Issue 4
  • DOI: 10.1016/0743-7315(88)90002-0

Improving the ratio of memory operations to floating-point operations in loops
journal, November 1994

  • Carr, Steve; Kennedy, Ken
  • ACM Transactions on Programming Languages and Systems, Vol. 16, Issue 6
  • DOI: 10.1145/197320.197366

Self-Adapting Linear Algebra Algorithms and Software
journal, February 2005


Performance of Synchronized Iterative Processes in Multiprocessor Systems
journal, July 1982

  • Dubois, M.; Briggs, F. A.
  • IEEE Transactions on Software Engineering, Vol. SE-8, Issue 4
  • DOI: 10.1109/TSE.1982.235576

The Design and Implementation of FFTW3
journal, February 2005


Mapping computational concepts to GPUs
conference, January 2005


Amdahl's Law in the Multicore Era
journal, July 2008


Evaluating associativity in CPU caches
journal, January 1989

  • Hill, M. D.; Smith, A. J.
  • IEEE Transactions on Computers, Vol. 38, Issue 12
  • DOI: 10.1109/12.40842

A Proof for the Queuing Formula: L = λ W
journal, June 1961


Latency lags bandwith
journal, October 2004


Analytic Queueing Network Models for Parallel Processing of Task Systems
journal, December 1986


A genetic algorithms approach to modeling the performance of memory-bound computations
conference, January 2007

  • Tikir, Mustafa M.; Carrington, Laura; Strohmaier, Erich
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362686

Lattice Boltzmann simulation optimization on leading multicore platforms
conference, April 2008

  • Williams, Samuel; Carter, Jonathan; Oliker, Leonid
  • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2008.4536295

Optimization of sparse matrix-vector multiplication on emerging multicore platforms
conference, January 2007

  • Williams, Samuel; Oliker, Leonid; Vuduc, Richard
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362674

The SPLASH-2 programs: characterization and methodological considerations
conference, January 1995

  • Woo, Steven Cameron; Ohara, Moriyoshi; Torrie, Evan
  • Proceedings of the 22nd annual international symposium on Computer architecture - ISCA '95
  • DOI: 10.1145/223982.223990

Works referencing / citing this record:

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM
journal, October 2014

  • Codreanu, Valeriu; Dröge, Bob; Williams, David
  • Concurrency and Computation: Practice and Experience, Vol. 28, Issue 7
  • DOI: 10.1002/cpe.3413

Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach
journal, May 2017

  • Kuckuk, Sebastian; Haase, Gundolf; Vasco, Diego A.
  • Concurrency and Computation: Practice and Experience, Vol. 29, Issue 17
  • DOI: 10.1002/cpe.4062

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
journal, March 2017

  • Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio
  • Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12
  • DOI: 10.1002/cpe.4143

An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures
journal, April 2018

  • Cámpora Pérez, Daniel Hugo; Awile, Omar
  • Concurrency and Computation: Practice and Experience, Vol. 30, Issue 23
  • DOI: 10.1002/cpe.4483

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
journal, July 2018

  • Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A.
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1
  • DOI: 10.1002/cpe.4864

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
journal, August 2018

  • Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18
  • DOI: 10.1002/cpe.4929

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU
journal, September 2018

  • Yamashita, Kohei; Ito, Yasuaki; Nakano, Koji
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 19
  • DOI: 10.1002/cpe.4947

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution
journal, August 2018

  • Reddy Manumachu, Ravi; Lastovetsky, Alexey L.
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 4
  • DOI: 10.1002/cpe.4958

Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture
journal, September 2018

  • Kwack, JaeHyuk; Arnold, Galen; Mendes, Celso
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.4963

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor
journal, November 2018

  • Robertsén, Fredrik; Mattila, Keijo; Westerholm, Jan
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 13
  • DOI: 10.1002/cpe.5072

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
journal, November 2019

  • Yang, Charlene; Kurth, Thorsten; Williams, Samuel
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 20
  • DOI: 10.1002/cpe.5547

Use of model-based architecture attributes to construct a component-level trade space
journal, February 2019

  • McKean, David; Moreland, James D.; Doskey, Steven
  • Systems Engineering, Vol. 22, Issue 2
  • DOI: 10.1002/sys.21478

LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation
book, December 2018


Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography
book, December 2018


DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs
book, January 2019


GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation
book, July 2019

  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213
  • DOI: 10.1007/978-3-030-25636-4_16

LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU
book, August 2019

  • Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
  • Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151
  • DOI: 10.1007/978-3-030-28163-2_10

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL
book, October 2016


Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
book, May 2017


A High-Throughput Kalman Filter for Modern SIMD Architectures
book, January 2018

  • Cámpora Pérez, Daniel Hugo; Awile, Omar; Potterat, Cédric
  • Euro-Par 2017: Parallel Processing Workshops
  • DOI: 10.1007/978-3-319-75178-8_31

Approximate FPGA-Based LSTMs Under Computation Time Constraints
book, January 2018

  • Rizakis, Michalis; Venieris, Stylianos I.; Kouris, Alexandros
  • Applied Reconfigurable Computing. Architectures, Tools, and Applications
  • DOI: 10.1007/978-3-319-78890-6_1

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors
book, January 2018


Software Design Space Exploration for Exascale Combustion Co-design
book, January 2013


How Many Threads will be too Many? On the Scalability of OpenMP Implementations
book, January 2015


Measuring energy consumption using EML (energy measurement library)
journal, July 2014

  • Cabrera, Alberto; Almeida, Francisco; Arteaga, Javier
  • Computer Science - Research and Development, Vol. 30, Issue 2
  • DOI: 10.1007/s00450-014-0269-5

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures
journal, November 2016


GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
journal, October 2016

  • Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven
  • International Journal of Parallel Programming, Vol. 45, Issue 5
  • DOI: 10.1007/s10766-016-0464-z

Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs
journal, April 2018

  • Vanderbauwhede, Wim; Nabi, Syed Waqar; Urlea, Cristian
  • International Journal of Parallel Programming, Vol. 47, Issue 1
  • DOI: 10.1007/s10766-018-0572-z

3DyRM: a dynamic roofline model including memory latency information
journal, March 2014

  • Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C.
  • The Journal of Supercomputing, Vol. 70, Issue 2
  • DOI: 10.1007/s11227-014-1163-4

Optimization of parallel iterated local search algorithms on graphics processing unit
journal, May 2016


The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes
journal, June 2018

  • Perepelkina, Anastasia; Levchenko, Vadim; Khilkov, Sergey
  • The Journal of Supercomputing, Vol. 75, Issue 12
  • DOI: 10.1007/s11227-018-2461-z

Efficient scheduling of streams on GPGPUs
journal, February 2020

  • Beheshti Roui, Mohamad; Shekofteh, S. Kazem; Noori, Hamid
  • The Journal of Supercomputing, Vol. 76, Issue 11
  • DOI: 10.1007/s11227-020-03209-x

Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm
journal, October 2019

  • Kuo, F. A.; Chiang, C. H.; Lo, M. C.
  • Journal of Mechanics, Vol. 36, Issue 1
  • DOI: 10.1017/jmech.2019.9

High performance FDTD algorithm for GPGPU supercomputers
journal, October 2016


Ultrafast analysis of individual grain behavior during grain growth by parallel computing
journal, August 2015

  • Kühbach, M.; Barrales-Mora, L. A.; Mießen, C.
  • IOP Conference Series: Materials Science and Engineering, Vol. 89
  • DOI: 10.1088/1757-899x/89/1/012031

A real-time, all-sky, high time resolution, direct imager for the long wavelength array
journal, May 2019

  • Kent, James; Dowell, Jayce; Beardsley, Adam
  • Monthly Notices of the Royal Astronomical Society, Vol. 486, Issue 4
  • DOI: 10.1093/mnras/stz1206

Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
journal, October 2019

  • Kent, James; Beardsley, Adam P.; Bester, Landman
  • Monthly Notices of the Royal Astronomical Society, Vol. 491, Issue 1
  • DOI: 10.1093/mnras/stz3028

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
journal, May 2018


Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
conference, January 2015

  • Zhang, Chen; Li, Peng; Sun, Guangyu
  • Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15
  • DOI: 10.1145/2684746.2689060

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

  • Dalton, Steven; Olson, Luke; Bell, Nathan
  • ACM Transactions on Mathematical Software, Vol. 41, Issue 4
  • DOI: 10.1145/2699470

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
conference, January 2015

  • Wahib, Mohamed; Maruyama, Naoya
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
  • DOI: 10.1145/2749246.2749255

Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model
conference, January 2015

  • Stengel, Holger; Treibig, Jan; Hager, Georg
  • Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
  • DOI: 10.1145/2751205.2751240

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
conference, January 2015

  • Hoefler, Torsten; Belli, Roberto
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807644

Harnessing energy efficiency of heterogeneous-ISA platforms
conference, January 2015

  • Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K.
  • Proceedings of the Workshop on Power-Aware Computing and Systems - HotPower '15
  • DOI: 10.1145/2818613.2818747

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
conference, January 2015

  • Ardalani, Newsha; Lestourgeon, Clint; Sankaralingam, Karthikeyan
  • Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48
  • DOI: 10.1145/2830772.2830780

Variation Among Processors Under Turbo Boost in HPC Systems
conference, January 2016

  • Acun, Bilge; Miller, Phil; Kale, Laxmikant V.
  • Proceedings of the 2016 International Conference on Supercomputing - ICS '16
  • DOI: 10.1145/2925426.2926289

Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells
journal, January 2017

  • Meister, Oliver; Rahnema, Kaveh; Bader, Michael
  • ACM Transactions on Mathematical Software, Vol. 43, Issue 3
  • DOI: 10.1145/2947668

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks
conference, November 2016

  • Zhang, Chen; Fang, Zhenman; Zhou, Peipei
  • ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design
  • DOI: 10.1145/2966986.2967011

Resource Conscious Reuse-Driven Tiling for GPUs
conference, January 2016

  • Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh
  • Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16
  • DOI: 10.1145/2967938.2967967

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
conference, October 2016

  • Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen
  • MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems
  • DOI: 10.1145/2989081.2989087

Sparse Matrix-Vector Multiplication on GPGPUs
journal, January 2017

  • Filippone, Salvatore; Cardellini, Valeria; Barbieri, Davide
  • ACM Transactions on Mathematical Software, Vol. 43, Issue 4
  • DOI: 10.1145/3017994

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
conference, January 2017

  • Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio
  • Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17
  • DOI: 10.1145/3020078.3021744

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
conference, June 2017

  • Xiao, Qingcheng; Liang, Yun; Lu, Liqiang
  • DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017
  • DOI: 10.1145/3061639.3062244

A Survey of Power and Energy Predictive Models in HPC Systems and Applications
journal, October 2017

  • O’brien, Kenneth; Pietri, Ilia; Reddy, Ravi
  • ACM Computing Surveys, Vol. 50, Issue 3
  • DOI: 10.1145/3078811

In-Datacenter Performance Analysis of a Tensor Processing Unit
conference, January 2017

  • Jouppi, Norman P.; Borchers, Al; Boyle, Rick
  • Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17
  • DOI: 10.1145/3079856.3080246

In-Datacenter Performance Analysis of a Tensor Processing Unit
journal, June 2017

  • Jouppi, Norman P.; Borchers, Al; Boyle, Rick
  • ACM SIGARCH Computer Architecture News, Vol. 45, Issue 2
  • DOI: 10.1145/3140659.3080246

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
journal, April 2018

  • Springer, Paul; Bientinesi, Paolo
  • ACM Transactions on Mathematical Software, Vol. 44, Issue 3
  • DOI: 10.1145/3157733

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
journal, July 2018

  • Venieris, Stylianos I.; Kouris, Alexandros; Bouganis, Christos-Savvas
  • ACM Computing Surveys, Vol. 51, Issue 3
  • DOI: 10.1145/3186332

A Survey on Compiler Autotuning using Machine Learning
journal, January 2019

  • Ashouri, Amir H.; Killian, William; Cavazos, John
  • ACM Computing Surveys, Vol. 51, Issue 5
  • DOI: 10.1145/3197978

Efficient sparse-matrix multi-vector product on GPUs
conference, January 2018

  • Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
  • DOI: 10.1145/3208040.3208062

FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
journal, December 2018

  • Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J.
  • ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3
  • DOI: 10.1145/3242897

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
journal, April 2019

  • Choi, Young-Kyu; Cong, Jason; Fang, Zhenman
  • ACM Transactions on Reconfigurable Technology and Systems, Vol. 12, Issue 1
  • DOI: 10.1145/3294054

Metric Selection for GPU Kernel Classification
journal, January 2019

  • Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud
  • ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4
  • DOI: 10.1145/3295690

Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators
journal, August 2019

  • Kronbichler, Martin; Kormann, Katharina
  • ACM Transactions on Mathematical Software, Vol. 45, Issue 3
  • DOI: 10.1145/3325864

On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
conference, January 2020

  • Helm, Christian; Taura, Kenjiro
  • HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
  • DOI: 10.1145/3368474.3368476

Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC
journal, March 2019

  • Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina
  • Scientific Programming, Vol. 2019
  • DOI: 10.1155/2019/6825728

ExaSAT: An exascale co-design tool for performance modeling
journal, April 2014

  • Unat, Didem; Chan, Cy; Zhang, Weiqun
  • The International Journal of High Performance Computing Applications, Vol. 29, Issue 2
  • DOI: 10.1177/1094342014568690

Modeling high-throughput applications for in situ analytics
journal, May 2019

  • Aupy, Guillaume; Goglin, Brice; Honoré, Valentin
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 6
  • DOI: 10.1177/1094342019847263

Analytic performance modeling and analysis of detailed neuron simulations
journal, April 2020

  • Cremonesi, Francesco; Hager, Georg; Wellein, Gerhard
  • The International Journal of High Performance Computing Applications, Vol. 34, Issue 4
  • DOI: 10.1177/1094342020912528

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
journal, November 2012


Data Management in Machine Learning Systems
journal, February 2019


Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance
journal, November 2016

  • De Vuyst, Florian; Gasc, Thibault; Motte, Renaud
  • Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, Vol. 71, Issue 6
  • DOI: 10.2516/ogst/2016019

Compression Challenges in Large Scale Partial Differential Equation Solvers
journal, September 2019

  • Götschel, Sebastian; Weiser, Martin
  • Algorithms, Vol. 12, Issue 9
  • DOI: 10.3390/a12090197

DiamondTorre Algorithm for High-Performance Wave Modeling
journal, August 2016


An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution
journal, March 2019


Developing Efficient Discrete Simulations on Multicore and GPU Architectures
journal, January 2020

  • Cagigas-Muñiz, Daniel; Diaz-del-Rio, Fernando; López-Torres, Manuel Ramón
  • Electronics, Vol. 9, Issue 1
  • DOI: 10.3390/electronics9010189

Fog vs. Cloud Computing: Should I Stay or Should I Go?
journal, February 2019

  • Pisani, Flávia; Martins do Rosario, Vanderson; Borin, Edson
  • Future Internet, Vol. 11, Issue 2
  • DOI: 10.3390/fi11020034

A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture
journal, December 2018

  • Wan, Bo; Yang, Lin; Zhou, Shunping
  • ISPRS International Journal of Geo-Information, Vol. 7, Issue 12
  • DOI: 10.3390/ijgi7120472

CPMIP: measurements of real computational performance of Earth system models in CMIP6
journal, January 2017

  • Balaji, Venkatramani; Maisonnave, Eric; Zadeh, Niki
  • Geoscientific Model Development, Vol. 10, Issue 1
  • DOI: 10.5194/gmd-10-19-2017

Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0
journal, January 2018

  • Fuhrer, Oliver; Chadha, Tarun; Hoefler, Torsten
  • Geoscientific Model Development, Vol. 11, Issue 4
  • DOI: 10.5194/gmd-11-1665-2018

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)
journal, January 2018

  • Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike
  • Geoscientific Model Development, Vol. 11, Issue 8
  • DOI: 10.5194/gmd-11-3447-2018

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
posted_content, January 2018

  • Louboutin, Mathias; Lange, Michael; Luporini, Fabio
  • Geoscientific Model Development Discussions
  • DOI: 10.5194/gmd-2018-189

Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation
text, January 2021


Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: a First Look at Knights Landing
text, January 2016


Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
text, January 2020

  • Kent, James; Beardsley, Ap; Bester, L.
  • Apollo - University of Cambridge Repository
  • DOI: 10.17863/cam.48304

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
journal, January 2019

  • Louboutin, Mathias; Lange, Michael; Luporini, Fabio
  • Geoscientific Model Development, Vol. 12, Issue 3
  • DOI: 10.5194/gmd-12-1165-2019

Harnessing Energy Efficiency of Heterogeneous-ISA Platforms
journal, January 2016

  • Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K.
  • ACM SIGOPS Operating Systems Review, Vol. 49, Issue 2
  • DOI: 10.1145/2883591.2883605

Ultrafast analysis of individual grain behavior during grain growth by parallel computing
text, January 2015


GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
text, January 2015


FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
text, January 2016


A Survey on Compiler Autotuning using Machine Learning
text, January 2018


Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
text, January 2018


A Real-Time, All-Sky, High Time Resolution, Direct Imager for the Long Wavelength Array
text, January 2019


Performance optimization and modeling of fine-grained irregular communication in UPC
text, January 2019


In situ and in-transit analysis of cosmological simulations
journal, August 2016

  • Friesen, Brian; Almgren, Ann; Lukić, Zarija
  • Computational Astrophysics and Cosmology, Vol. 3, Issue 1
  • DOI: 10.1186/s40668-016-0017-2

Characterizing Task-Based OpenMP Programs
journal, April 2015