Roofline: an insightful visual performance model for multicore architectures

Williams, Samuel; Waterman, Andrew; Patterson, David

doi:10.1145/1498765.1498785

Title: Roofline: an insightful visual performance model for multicore architectures

Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

Authors:

Williams, Samuel ^[1]; Waterman, Andrew ^[1]; Patterson, David ^[1]

Univ. of California, Berkeley, CA (United States). Parallel Computing Lab.

Publication Date:: Sat Apr 04 00:00:00 EDT 2009

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1407073

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Communications of the ACM

Additional Journal Information:: Journal Volume: 52; Journal Issue: 4; Journal ID: ISSN 0001-0782

Publisher:: Association for Computing Machinery

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Williams, Samuel, Waterman, Andrew, and Patterson, David. Roofline: an insightful visual performance model for multicore architectures.  United States: N. p., 2009. 
Web.  doi:10.1145/1498765.1498785.

Copy to clipboard


                    Williams, Samuel, Waterman, Andrew, & Patterson, David. Roofline: an insightful visual performance model for multicore architectures.  United States.  https://doi.org/10.1145/1498765.1498785

Copy to clipboard


                    Williams, Samuel, Waterman, Andrew, and Patterson, David. Sat .  
"Roofline: an insightful visual performance model for multicore architectures".  United States.  https://doi.org/10.1145/1498765.1498785.  https://www.osti.gov/servlets/purl/1407073.

Copy to clipboard


                    
@article{osti_1407073,

  title        = {Roofline: an insightful visual performance model for multicore architectures},

  author       = {Williams, Samuel and Waterman, Andrew and Patterson, David},

  abstractNote = {We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.},

  doi          = {10.1145/1498765.1498785},

  journal      = {Communications of the ACM},

  number       = 4,

  volume       = 52,

  place        = {United States},

  year         = {Sat Apr 04 00:00:00 EDT 2009},

  month        = {Sat Apr 04 00:00:00 EDT 2009}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1145/1498765.1498785

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 1138 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Validity of the single processor approach to achieving large scale computing capabilities
conference, January 1967

Amdahl, Gene M.
Proceedings of the April 18-20, 1967, spring joint computer conference on - AFIPS '67 (Spring)
DOI: 10.1145/1465482.1465560

A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1
conference, January 1994

Boyd, E. L.; Azeem, W.; Hsien-Hsin Lee, Hsien-Hsin Lee
1994 International Conference on Parallel Processing Vol. 3
DOI: 10.1109/ICPP.1994.30

Estimating interlock and improving balance for pipelined architectures
journal, August 1988

Callahan, David; Cocke, John; Kennedy, Ken
Journal of Parallel and Distributed Computing, Vol. 5, Issue 4
DOI: 10.1016/0743-7315(88)90002-0

Improving the ratio of memory operations to floating-point operations in loops
journal, November 1994

Carr, Steve; Kennedy, Ken
ACM Transactions on Programming Languages and Systems, Vol. 16, Issue 6
DOI: 10.1145/197320.197366

Self-Adapting Linear Algebra Algorithms and Software
journal, February 2005

Demmel, J.; Dongarra, J.; Eijkhout, V.
Proceedings of the IEEE, Vol. 93, Issue 2
DOI: 10.1109/JPROC.2004.840848

Performance of Synchronized Iterative Processes in Multiprocessor Systems
journal, July 1982

Dubois, M.; Briggs, F. A.
IEEE Transactions on Software Engineering, Vol. SE-8, Issue 4
DOI: 10.1109/TSE.1982.235576

The Design and Implementation of FFTW3
journal, February 2005

Frigo, M.; Johnson, S. G.
Proceedings of the IEEE, Vol. 93, Issue 2
DOI: 10.1109/JPROC.2004.840301

Mapping computational concepts to GPUs
conference, January 2005

Harris, Mark
ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05
DOI: 10.1145/1198555.1198768

Amdahl's Law in the Multicore Era
journal, July 2008

Hill, Mark D.; Marty, Michael R.
Computer, Vol. 41, Issue 7
DOI: 10.1109/MC.2008.209

Evaluating associativity in CPU caches
journal, January 1989

Hill, M. D.; Smith, A. J.
IEEE Transactions on Computers, Vol. 38, Issue 12
DOI: 10.1109/12.40842

A Proof for the Queuing Formula: L = λ W
journal, June 1961

Little, John D. C.
Operations Research, Vol. 9, Issue 3
DOI: 10.1287/opre.9.3.383

Latency lags bandwith
journal, October 2004

Patterson, David A.
Communications of the ACM, Vol. 47, Issue 10
DOI: 10.1145/1022594.1022596

Analytic Queueing Network Models for Parallel Processing of Task Systems
journal, December 1986

Thomasian, A.
IEEE Transactions on Computers, Vol. C-35, Issue 12, p. 1045-1054
DOI: 10.1109/TC.1986.1676712

A genetic algorithms approach to modeling the performance of memory-bound computations
conference, January 2007

Tikir, Mustafa M.; Carrington, Laura; Strohmaier, Erich
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
DOI: 10.1145/1362622.1362686

Lattice Boltzmann simulation optimization on leading multicore platforms
conference, April 2008

Williams, Samuel; Carter, Jonathan; Oliker, Leonid
Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
DOI: 10.1109/IPDPS.2008.4536295

Optimization of sparse matrix-vector multiplication on emerging multicore platforms
conference, January 2007

Williams, Samuel; Oliker, Leonid; Vuduc, Richard
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
DOI: 10.1145/1362622.1362674

The SPLASH-2 programs: characterization and methodological considerations
conference, January 1995

Woo, Steven Cameron; Ohara, Moriyoshi; Torrie, Evan
Proceedings of the 22nd annual international symposium on Computer architecture - ISCA '95
DOI: 10.1145/223982.223990

Works referencing / citing this record:

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM
journal, October 2014

Codreanu, Valeriu; Dröge, Bob; Williams, David
Concurrency and Computation: Practice and Experience, Vol. 28, Issue 7
DOI: 10.1002/cpe.3413

Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach
journal, May 2017

Kuckuk, Sebastian; Haase, Gundolf; Vasco, Diego A.
Concurrency and Computation: Practice and Experience, Vol. 29, Issue 17
DOI: 10.1002/cpe.4062

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
journal, March 2017

Calore, Enrico; Gabbana, Alessandro; Schifano, Sebastiano Fabio
Concurrency and Computation: Practice and Experience, Vol. 29, Issue 12
DOI: 10.1002/cpe.4143

An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures
journal, April 2018

Cámpora Pérez, Daniel Hugo; Awile, Omar
Concurrency and Computation: Practice and Experience, Vol. 30, Issue 23
DOI: 10.1002/cpe.4483

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
journal, July 2018

Coronado-Barrientos, E.; Indalecio, G.; García-Loureiro, A.
Concurrency and Computation: Practice and Experience, Vol. 31, Issue 1
DOI: 10.1002/cpe.4864

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
journal, August 2018

Carrijo Nasciutti, Thiago; Panetta, Jairo; Pais Lopes, Pedro
Concurrency and Computation: Practice and Experience, Vol. 31, Issue 18
DOI: 10.1002/cpe.4929

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU
journal, September 2018

Yamashita, Kohei; Ito, Yasuaki; Nakano, Koji
Concurrency and Computation: Practice and Experience, Vol. 31, Issue 19
DOI: 10.1002/cpe.4947

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution
journal, August 2018

Reddy Manumachu, Ravi; Lastovetsky, Alexey L.
Concurrency and Computation: Practice and Experience, Vol. 31, Issue 4
DOI: 10.1002/cpe.4958

Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture
journal, September 2018

Kwack, JaeHyuk; Arnold, Galen; Mendes, Celso
Concurrency and Computation: Practice and Experience
DOI: 10.1002/cpe.4963

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor
journal, November 2018

Robertsén, Fredrik; Mattila, Keijo; Westerholm, Jan
Concurrency and Computation: Practice and Experience, Vol. 31, Issue 13
DOI: 10.1002/cpe.5072

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
journal, November 2019

Yang, Charlene; Kurth, Thorsten; Williams, Samuel
Concurrency and Computation: Practice and Experience, Vol. 32, Issue 20
DOI: 10.1002/cpe.5547

Use of model-based architecture attributes to construct a component-level trade space
journal, February 2019

McKean, David; Moreland, James D.; Doskey, Steven
Systems Engineering, Vol. 22, Issue 2
DOI: 10.1002/sys.21478

LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation
book, December 2018

Perepelkina, Anastasia; Levchenko, Vadim
Communications in Computer and Information Science
DOI: 10.1007/978-3-030-05807-4_9

Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography
book, December 2018

Schrödter, Tobias; Pallasch, David; Wienke, Sandra
Lecture Notes in Computer Science
DOI: 10.1007/978-3-030-10549-5_33

DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs
book, January 2019

Fickenscher, Jörg; Hannig, Frank; Teich, Jürgen
Architecture of Computing Systems – ARCS 2019
DOI: 10.1007/978-3-030-18656-2_6

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation
book, July 2019

Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
Parallel Computing Technologies: 15th International Conference, PaCT 2019, Almaty, Kazakhstan, August 19–23, 2019, Proceedings, p. 199-213
DOI: 10.1007/978-3-030-25636-4_16

LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU
book, August 2019

Levchenko, Vadim; Zakirov, Andrey; Perepelkina, Anastasia
Parallel Computational Technologies: 13th International Conference, PCT 2019, Kaliningrad, Russia, April 2–4, 2019, Revised Selected Papers, p. 139-151
DOI: 10.1007/978-3-030-28163-2_10

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL
book, October 2016

Joó, Bálint; Kalamkar, Dhiraj D.; Kurth, Thorsten
Lecture Notes in Computer Science
DOI: 10.1007/978-3-319-46079-6_30

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
book, May 2017

Hammer, Julian; Eitzinger, Jan; Hager, Georg
Tools for High Performance Computing 2016
DOI: 10.1007/978-3-319-56702-0_1

A High-Throughput Kalman Filter for Modern SIMD Architectures
book, January 2018

Cámpora Pérez, Daniel Hugo; Awile, Omar; Potterat, Cédric
Euro-Par 2017: Parallel Processing Workshops
DOI: 10.1007/978-3-319-75178-8_31

Approximate FPGA-Based LSTMs Under Computation Time Constraints
book, January 2018

Rizakis, Michalis; Venieris, Stylianos I.; Kouris, Alexandros
Applied Reconfigurable Computing. Architectures, Tools, and Applications
DOI: 10.1007/978-3-319-78890-6_1

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors
book, January 2018

Hofmann, Johannes; Hager, Georg; Fey, Dietmar
Lecture Notes in Computer Science
DOI: 10.1007/978-3-319-92040-5_2

Software Design Space Exploration for Exascale Combustion Co-design
book, January 2013

Chan, Cy; Unat, Didem; Lijewski, Michael
Lecture Notes in Computer Science
DOI: 10.1007/978-3-642-38750-0_15

How Many Threads will be too Many? On the Scalability of OpenMP Implementations
book, January 2015

Iwainsky, Christian; Shudler, Sergei; Calotoiu, Alexandru
Lecture Notes in Computer Science
DOI: 10.1007/978-3-662-48096-0_35

Measuring energy consumption using EML (energy measurement library)
journal, July 2014

Cabrera, Alberto; Almeida, Francisco; Arteaga, Javier
Computer Science - Research and Development, Vol. 30, Issue 2
DOI: 10.1007/s00450-014-0269-5

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures
journal, November 2016

Ciznicki, Milosz; Kurowski, Krzysztof; Weglarz, Jan
Cluster Computing, Vol. 20, Issue 3
DOI: 10.1007/s10586-016-0686-2

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
journal, October 2016

Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven
International Journal of Parallel Programming, Vol. 45, Issue 5
DOI: 10.1007/s10766-016-0464-z

Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs
journal, April 2018

Vanderbauwhede, Wim; Nabi, Syed Waqar; Urlea, Cristian
International Journal of Parallel Programming, Vol. 47, Issue 1
DOI: 10.1007/s10766-018-0572-z

3DyRM: a dynamic roofline model including memory latency information
journal, March 2014

Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C.
The Journal of Supercomputing, Vol. 70, Issue 2
DOI: 10.1007/s11227-014-1163-4

Optimization of parallel iterated local search algorithms on graphics processing unit
journal, May 2016

Zhou, Yi; He, Fazhi; Qiu, Yimin
The Journal of Supercomputing, Vol. 72, Issue 6
DOI: 10.1007/s11227-016-1738-3

The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes
journal, June 2018

Perepelkina, Anastasia; Levchenko, Vadim; Khilkov, Sergey
The Journal of Supercomputing, Vol. 75, Issue 12
DOI: 10.1007/s11227-018-2461-z

Efficient scheduling of streams on GPGPUs
journal, February 2020

Beheshti Roui, Mohamad; Shekofteh, S. Kazem; Noori, Hamid
The Journal of Supercomputing, Vol. 76, Issue 11
DOI: 10.1007/s11227-020-03209-x

Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm
journal, October 2019

Kuo, F. A.; Chiang, C. H.; Lo, M. C.
Journal of Mechanics, Vol. 36, Issue 1
DOI: 10.1017/jmech.2019.9

High performance FDTD algorithm for GPGPU supercomputers
journal, October 2016

Zakirov, Andrey; Levchenko, Vadim; Perepelkina, Anastasia
Journal of Physics: Conference Series, Vol. 759
DOI: 10.1088/1742-6596/759/1/012100

Ultrafast analysis of individual grain behavior during grain growth by parallel computing
journal, August 2015

Kühbach, M.; Barrales-Mora, L. A.; Mießen, C.
IOP Conference Series: Materials Science and Engineering, Vol. 89
DOI: 10.1088/1757-899x/89/1/012031

A real-time, all-sky, high time resolution, direct imager for the long wavelength array
journal, May 2019

Kent, James; Dowell, Jayce; Beardsley, Adam
Monthly Notices of the Royal Astronomical Society, Vol. 486, Issue 4
DOI: 10.1093/mnras/stz1206

Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
journal, October 2019

Kent, James; Beardsley, Adam P.; Bester, Landman
Monthly Notices of the Royal Astronomical Society, Vol. 491, Issue 1
DOI: 10.1093/mnras/stz3028

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
journal, May 2018

Levchenko, V. D.; Perepelkina, A. Y.
Lobachevskii Journal of Mathematics, Vol. 39, Issue 4
DOI: 10.1134/s1995080218040108

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
conference, January 2015

Zhang, Chen; Li, Peng; Sun, Guangyu
Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15
DOI: 10.1145/2684746.2689060

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Dalton, Steven; Olson, Luke; Bell, Nathan
ACM Transactions on Mathematical Software, Vol. 41, Issue 4
DOI: 10.1145/2699470

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
conference, January 2015

Wahib, Mohamed; Maruyama, Naoya
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
DOI: 10.1145/2749246.2749255

Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model
conference, January 2015

Stengel, Holger; Treibig, Jan; Hager, Georg
Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
DOI: 10.1145/2751205.2751240

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
conference, January 2015

Hoefler, Torsten; Belli, Roberto
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807644

Harnessing energy efficiency of heterogeneous-ISA platforms
conference, January 2015

Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K.
Proceedings of the Workshop on Power-Aware Computing and Systems - HotPower '15
DOI: 10.1145/2818613.2818747

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
conference, January 2015

Ardalani, Newsha; Lestourgeon, Clint; Sankaralingam, Karthikeyan
Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48
DOI: 10.1145/2830772.2830780

Variation Among Processors Under Turbo Boost in HPC Systems
conference, January 2016

Acun, Bilge; Miller, Phil; Kale, Laxmikant V.
Proceedings of the 2016 International Conference on Supercomputing - ICS '16
DOI: 10.1145/2925426.2926289

Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells
journal, January 2017

Meister, Oliver; Rahnema, Kaveh; Bader, Michael
ACM Transactions on Mathematical Software, Vol. 43, Issue 3
DOI: 10.1145/2947668

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks
conference, November 2016

Zhang, Chen; Fang, Zhenman; Zhou, Peipei
ICCAD '16: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, Proceedings of the 35th International Conference on Computer-Aided Design
DOI: 10.1145/2966986.2967011

Resource Conscious Reuse-Driven Tiling for GPUs
conference, January 2016

Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16
DOI: 10.1145/2967938.2967967

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
conference, October 2016

Siegl, Patrick; Buchty, Rainer; Berekovic, Mladen
MEMSYS '16: The Second International Symposium on Memory Systems, Proceedings of the Second International Symposium on Memory Systems
DOI: 10.1145/2989081.2989087

Sparse Matrix-Vector Multiplication on GPGPUs
journal, January 2017

Filippone, Salvatore; Cardellini, Valeria; Barbieri, Davide
ACM Transactions on Mathematical Software, Vol. 43, Issue 4
DOI: 10.1145/3017994

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
conference, January 2017

Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17
DOI: 10.1145/3020078.3021744

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
conference, June 2017

Xiao, Qingcheng; Liang, Yun; Lu, Liqiang
DAC '17: The 54th Annual Design Automation Conference 2017, Proceedings of the 54th Annual Design Automation Conference 2017
DOI: 10.1145/3061639.3062244

A Survey of Power and Energy Predictive Models in HPC Systems and Applications
journal, October 2017

O’brien, Kenneth; Pietri, Ilia; Reddy, Ravi
ACM Computing Surveys, Vol. 50, Issue 3
DOI: 10.1145/3078811

In-Datacenter Performance Analysis of a Tensor Processing Unit
conference, January 2017

Jouppi, Norman P.; Borchers, Al; Boyle, Rick
Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17
DOI: 10.1145/3079856.3080246

In-Datacenter Performance Analysis of a Tensor Processing Unit
journal, June 2017

Jouppi, Norman P.; Borchers, Al; Boyle, Rick
ACM SIGARCH Computer Architecture News, Vol. 45, Issue 2
DOI: 10.1145/3140659.3080246

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
journal, April 2018

Springer, Paul; Bientinesi, Paolo
ACM Transactions on Mathematical Software, Vol. 44, Issue 3
DOI: 10.1145/3157733

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
journal, July 2018

Venieris, Stylianos I.; Kouris, Alexandros; Bouganis, Christos-Savvas
ACM Computing Surveys, Vol. 51, Issue 3
DOI: 10.1145/3186332

A Survey on Compiler Autotuning using Machine Learning
journal, January 2019

Ashouri, Amir H.; Killian, William; Cavazos, John
ACM Computing Surveys, Vol. 51, Issue 5
DOI: 10.1145/3197978

Efficient sparse-matrix multi-vector product on GPUs
conference, January 2018

Hong, Changwan; Sadayappan, P.; Sukumaran-Rajam, Aravind
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
DOI: 10.1145/3208040.3208062

FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
journal, December 2018

Blott, Michaela; Preußer, Thomas B.; Fraser, Nicholas J.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 11, Issue 3
DOI: 10.1145/3242897

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
journal, April 2019

Choi, Young-Kyu; Cong, Jason; Fang, Zhenman
ACM Transactions on Reconfigurable Technology and Systems, Vol. 12, Issue 1
DOI: 10.1145/3294054

Metric Selection for GPU Kernel Classification
journal, January 2019

Shekofteh, S. -Kazem; Noori, Hamid; Naghibzadeh, Mahmoud
ACM Transactions on Architecture and Code Optimization, Vol. 15, Issue 4
DOI: 10.1145/3295690

Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators
journal, August 2019

Kronbichler, Martin; Kormann, Katharina
ACM Transactions on Mathematical Software, Vol. 45, Issue 3
DOI: 10.1145/3325864

On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
conference, January 2020

Helm, Christian; Taura, Kenjiro
HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
DOI: 10.1145/3368474.3368476

Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC
journal, March 2019

Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina
Scientific Programming, Vol. 2019
DOI: 10.1155/2019/6825728

ExaSAT: An exascale co-design tool for performance modeling
journal, April 2014

Unat, Didem; Chan, Cy; Zhang, Weiqun
The International Journal of High Performance Computing Applications, Vol. 29, Issue 2
DOI: 10.1177/1094342014568690

Modeling high-throughput applications for in situ analytics
journal, May 2019

Aupy, Guillaume; Goglin, Brice; Honoré, Valentin
The International Journal of High Performance Computing Applications, Vol. 33, Issue 6
DOI: 10.1177/1094342019847263

Analytic performance modeling and analysis of detailed neuron simulations
journal, April 2020

Cremonesi, Francesco; Hager, Georg; Wellein, Gerhard
The International Journal of High Performance Computing Applications, Vol. 34, Issue 4
DOI: 10.1177/1094342020912528

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
journal, November 2012

Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara
Synthesis Lectures on Computer Architecture, Vol. 7, Issue 2
DOI: 10.2200/s00451ed1v01y201209cac020

Data Management in Machine Learning Systems
journal, February 2019

Boehm, Matthias; Kumar, Arun; Yang, Jun
Synthesis Lectures on Data Management, Vol. 14, Issue 1
DOI: 10.2200/s00895ed1v01y201901dtm057

Lagrange-Flux Schemes: Reformulating Second-Order Accurate Lagrange-Remap Schemes for Better Node-Based HPC Performance
journal, November 2016

De Vuyst, Florian; Gasc, Thibault; Motte, Renaud
Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles, Vol. 71, Issue 6
DOI: 10.2516/ogst/2016019

Compression Challenges in Large Scale Partial Differential Equation Solvers
journal, September 2019

Götschel, Sebastian; Weiser, Martin
Algorithms, Vol. 12, Issue 9
DOI: 10.3390/a12090197

DiamondTorre Algorithm for High-Performance Wave Modeling
journal, August 2016

Levchenko, Vadim; Perepelkina, Anastasia; Zakirov, Andrey
Computation, Vol. 4, Issue 3
DOI: 10.3390/computation4030029

An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution
journal, March 2019

Liu, Bing; Zou, Danyin; Feng, Lei
Electronics, Vol. 8, Issue 3
DOI: 10.3390/electronics8030281

Developing Efficient Discrete Simulations on Multicore and GPU Architectures
journal, January 2020

Cagigas-Muñiz, Daniel; Diaz-del-Rio, Fernando; López-Torres, Manuel Ramón
Electronics, Vol. 9, Issue 1
DOI: 10.3390/electronics9010189

Fog vs. Cloud Computing: Should I Stay or Should I Go?
journal, February 2019

Pisani, Flávia; Martins do Rosario, Vanderson; Borin, Edson
Future Internet, Vol. 11, Issue 2
DOI: 10.3390/fi11020034

A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture
journal, December 2018

Wan, Bo; Yang, Lin; Zhou, Shunping
ISPRS International Journal of Geo-Information, Vol. 7, Issue 12
DOI: 10.3390/ijgi7120472

CPMIP: measurements of real computational performance of Earth system models in CMIP6
journal, January 2017

Balaji, Venkatramani; Maisonnave, Eric; Zadeh, Niki
Geoscientific Model Development, Vol. 10, Issue 1
DOI: 10.5194/gmd-10-19-2017

Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0
journal, January 2018

Fuhrer, Oliver; Chadha, Tarun; Hoefler, Torsten
Geoscientific Model Development, Vol. 11, Issue 4
DOI: 10.5194/gmd-11-1665-2018

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)
journal, January 2018

Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike
Geoscientific Model Development, Vol. 11, Issue 8
DOI: 10.5194/gmd-11-3447-2018

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
posted_content, January 2018

Louboutin, Mathias; Lange, Michael; Luporini, Fabio
Geoscientific Model Development Discussions
DOI: 10.5194/gmd-2018-189

Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation
text, January 2021

Platzer, Michael; Puschner, Peter
Schloss Dagstuhl - Leibniz-Zentrum für Informatik
DOI: 10.4230/lipics.ecrts.2021.1

Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: a First Look at Knights Landing
text, January 2016

Bastrakov, Sergey; Meyerov, Iosif; Gonoskov, Arkady
Unpublished
DOI: 10.13140/rg.2.2.11832.96006

Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
text, January 2020

Kent, James; Beardsley, Ap; Bester, L.
Apollo - University of Cambridge Repository
DOI: 10.17863/cam.48304

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
journal, January 2019

Louboutin, Mathias; Lange, Michael; Luporini, Fabio
Geoscientific Model Development, Vol. 12, Issue 3
DOI: 10.5194/gmd-12-1165-2019

Harnessing Energy Efficiency of Heterogeneous-ISA Platforms
journal, January 2016

Bhat, Sharath K.; Saya, Ajithchandra; Rawat, Hemedra K.
ACM SIGOPS Operating Systems Review, Vol. 49, Issue 2
DOI: 10.1145/2883591.2883605

Ultrafast analysis of individual grain behavior during grain growth by parallel computing
text, January 2015

Kühbach, M.; Barrales-Mora, L. A.; Mießen, C.
RWTH Aachen University
DOI: 10.18154/rwth-2015-04763

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model
text, January 2014

Stengel, Holger; Treibig, Jan; Hager, Georg
arXiv
DOI: 10.48550/arxiv.1410.5010

GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
text, January 2015

Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven
arXiv
DOI: 10.48550/arxiv.1507.08101

Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing
preprint, January 2016

Surmin, Igor; Bastrakov, Sergey; Matveev, Zakhar
arXiv
DOI: 10.48550/arxiv.1608.01009

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
text, January 2016

Umuroglu, Yaman; Fraser, Nicholas J.; Gambardella, Giulio
arXiv
DOI: 10.48550/arxiv.1612.07119

A Survey on Compiler Autotuning using Machine Learning
text, January 2018

Ashouri, Amir H.; Killian, William; Cavazos, John
arXiv
DOI: 10.48550/arxiv.1801.04405

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration
text, January 2018

Louboutin, Mathias; Lange, Michael; Luporini, Fabio
arXiv
DOI: 10.48550/arxiv.1808.01995

A Real-Time, All-Sky, High Time Resolution, Direct Imager for the Long Wavelength Array
text, January 2019

Kent, James; Dowell, Jayce; Beardsley, Adam
arXiv
DOI: 10.48550/arxiv.1904.11422

Performance optimization and modeling of fine-grained irregular communication in UPC
text, January 2019

Lagravière, Jérémie; Langguth, Johannes; Prugger, Martina
arXiv
DOI: 10.48550/arxiv.1912.12701

In situ and in-transit analysis of cosmological simulations
journal, August 2016

Friesen, Brian; Almgren, Ann; Lukić, Zarija
Computational Astrophysics and Cosmology, Vol. 3, Issue 1
DOI: 10.1186/s40668-016-0017-2

Characterizing Task-Based OpenMP Programs
journal, April 2015

Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats
PLOS ONE, Vol. 10, Issue 4
DOI: 10.1371/journal.pone.0123545

Similar Records in DOE PAGES and OSTI.GOV collections:

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Technical Report Williams, Samuel ; Waterman, Andrew ; Patterson, David

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.
https://doi.org/10.2172/1407078

Full Text Available
Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

Journal Article Williams, Samuel ; Waterman, Andrew ; Patterson, David - Communications of the Association for Computing Machinery

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.
https://doi.org/10.1145/1498765.1498785

Full Text Available
Instruction Roofline: An insightful visual performance model for GPUs

Conference Ding, N ; Awan, M ; Williams, S

The Roofline performance model provides an intuitive approach to identify performance bottlenecks and guide performance optimization. However, the classic FLOP-centric approach is inappropriate for the emerging applications that perform more integer operations than floating point operations. In this article, we reintroduce our Instruction Roofline Model on NVIDIA GPUs and expand our evaluation of it. The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together, and provides more performance insights than the FLOP-oriented Roofline Model, that is, instruction throughput, stride memory access patterns, bank conflicts, and thread predication. We use our Instruction Roofline methodology to analyze eight proxymore »« less
https://doi.org/10.1002/cpe.6591

Full Text Available
Automated Cache Performance Analysis And Optimization

Technical Report Mohror, Kathryn

While there is no lack of performance counter tools for coarse-grained measurement of cache activity, there is a critical lack of tools for relating data layout to cache behavior to application performance. Generally, any nontrivial optimizations are either not done at all, or are done ”by hand” requiring significant time and expertise. To the best of our knowledge no tool available to users measures the latency of memory reference instructions for particular addresses and makes this information available to users in an easy-to-use and intuitive way. In this project, we worked to enable the Open|SpeedShop performance analysis tool to gathermore »« less
https://doi.org/10.2172/1113233

Full Text Available
A Testbed of Parallel Kernels for Computer Science Research

Technical Report Bailey, David ; Demmel, James ; Ibrahim, Khaled ; ...

For several decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models for optimal performance, efficiency, and productivity. Unfortunately, this guidance is most often taken from the existing software/hardware ecosystem. Architects attempt to provide micro-architectural solutions to improve performance on fixed binaries. Researchers tweak compilers to improve code generation for existing architectures and implementations, and they may invent new programming models for fixed processor and memory architectures and computational algorithms. In today's rapidly evolving world of on-chip parallelism, these isolated and iterative improvements to performance may miss superior solutions in the same way gradient descentmore »« less
https://doi.org/10.2172/983273

Full Text Available

Similar Records

Title: Roofline: an insightful visual performance model for multicore architectures

Abstract

Citation Formats

Validity of the single processor approach to achieving large scale computing capabilities conference, January 1967

A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1 conference, January 1994

Estimating interlock and improving balance for pipelined architectures journal, August 1988

Improving the ratio of memory operations to floating-point operations in loops journal, November 1994

Self-Adapting Linear Algebra Algorithms and Software journal, February 2005

Performance of Synchronized Iterative Processes in Multiprocessor Systems journal, July 1982

The Design and Implementation of FFTW3 journal, February 2005

Mapping computational concepts to GPUs conference, January 2005

Amdahl's Law in the Multicore Era journal, July 2008

Evaluating associativity in CPU caches journal, January 1989

A Proof for the Queuing Formula: L = λ W journal, June 1961

Latency lags bandwith journal, October 2004

Analytic Queueing Network Models for Parallel Processing of Task Systems journal, December 1986

A genetic algorithms approach to modeling the performance of memory-bound computations conference, January 2007

Lattice Boltzmann simulation optimization on leading multicore platforms conference, April 2008

Optimization of sparse matrix-vector multiplication on emerging multicore platforms conference, January 2007

The SPLASH-2 programs: characterization and methodological considerations conference, January 1995

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM journal, October 2014

Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach journal, May 2017

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications journal, March 2017

An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures journal, April 2018

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL journal, July 2018

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs journal, August 2018

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU journal, September 2018

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution journal, August 2018

Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture journal, September 2018

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor journal, November 2018

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system journal, November 2019

Use of model-based architecture attributes to construct a component-level trade space journal, February 2019

LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation book, December 2018

Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography book, December 2018

DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs book, January 2019

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation book, July 2019

LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU book, August 2019

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL book, October 2016

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels book, May 2017

A High-Throughput Kalman Filter for Modern SIMD Architectures book, January 2018

Approximate FPGA-Based LSTMs Under Computation Time Constraints book, January 2018

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors book, January 2018

Software Design Space Exploration for Exascale Combustion Co-design book, January 2013

How Many Threads will be too Many? On the Scalability of OpenMP Implementations book, January 2015

Measuring energy consumption using EML (energy measurement library) journal, July 2014

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures journal, November 2016

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems journal, October 2016

Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs journal, April 2018

3DyRM: a dynamic roofline model including memory latency information journal, March 2014

Optimization of parallel iterated local search algorithms on graphics processing unit journal, May 2016

The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes journal, June 2018

Efficient scheduling of streams on GPGPUs journal, February 2020

Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm journal, October 2019

High performance FDTD algorithm for GPGPU supercomputers journal, October 2016

Ultrafast analysis of individual grain behavior during grain growth by parallel computing journal, August 2015

A real-time, all-sky, high time resolution, direct imager for the long wavelength array journal, May 2019

Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields journal, October 2019

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation journal, May 2018

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks conference, January 2015

Optimizing Sparse Matrix—Matrix Multiplication for the GPU journal, October 2015

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications conference, January 2015

Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model conference, January 2015

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results conference, January 2015

Harnessing energy efficiency of heterogeneous-ISA platforms conference, January 2015

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance conference, January 2015

Variation Among Processors Under Turbo Boost in HPC Systems conference, January 2016

Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells journal, January 2017

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks conference, November 2016

Resource Conscious Reuse-Driven Tiling for GPUs conference, January 2016

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory conference, October 2016

Sparse Matrix-Vector Multiplication on GPGPUs journal, January 2017

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference conference, January 2017

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs conference, June 2017

A Survey of Power and Energy Predictive Models in HPC Systems and Applications journal, October 2017

In-Datacenter Performance Analysis of a Tensor Processing Unit conference, January 2017

In-Datacenter Performance Analysis of a Tensor Processing Unit journal, June 2017

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication journal, April 2018

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions journal, July 2018

A Survey on Compiler Autotuning using Machine Learning journal, January 2019

Efficient sparse-matrix multi-vector product on GPUs conference, January 2018

Validity of the single processor approach to achieving large scale computing capabilities
conference, January 1967

A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1
conference, January 1994

Estimating interlock and improving balance for pipelined architectures
journal, August 1988

Improving the ratio of memory operations to floating-point operations in loops
journal, November 1994

Self-Adapting Linear Algebra Algorithms and Software
journal, February 2005

Performance of Synchronized Iterative Processes in Multiprocessor Systems
journal, July 1982

The Design and Implementation of FFTW3
journal, February 2005

Mapping computational concepts to GPUs
conference, January 2005

Amdahl's Law in the Multicore Era
journal, July 2008

Evaluating associativity in CPU caches
journal, January 1989

A Proof for the Queuing Formula: L = λ W
journal, June 1961

Latency lags bandwith
journal, October 2004

Analytic Queueing Network Models for Parallel Processing of Task Systems
journal, December 1986

A genetic algorithms approach to modeling the performance of memory-bound computations
conference, January 2007

Lattice Boltzmann simulation optimization on leading multicore platforms
conference, April 2008

Optimization of sparse matrix-vector multiplication on emerging multicore platforms
conference, January 2007

The SPLASH-2 programs: characterization and methodological considerations
conference, January 1995

Evaluating automatically parallelized versions of the support vector machine: EVALUATING AUTOMATICALLY PARALLELIZED VERSIONS OF THE SVM
journal, October 2014

Towards generating efficient flow solvers with the ExaStencils approach: Towards generating efficient flow solvers with the ExaStencils approach
journal, May 2017

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications: Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
journal, March 2017

An efficient low-rank Kalman filter for modern SIMD architectures: An Efficient Low-Rank Kalman Filter for Modern SIMD Architectures
journal, April 2018

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL: AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL
journal, July 2018

Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs
journal, August 2018

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU: Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU
journal, September 2018

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution
journal, August 2018

Roofline analysis with Cray performance analysis tools (CrayPat) and roofline‐based performance projections for a future architecture
journal, September 2018

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor
journal, November 2018

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
journal, November 2019

Use of model-based architecture attributes to construct a component-level trade space
journal, February 2019

LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation
book, December 2018

Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography
book, December 2018

DSL-Based Acceleration of Automotive Environment Perception and Mapping Algorithms for Embedded CPUs, GPUs, and FPGAs
book, January 2019

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation
book, July 2019

LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU
book, August 2019

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL
book, October 2016

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
book, May 2017

A High-Throughput Kalman Filter for Modern SIMD Architectures
book, January 2018

Approximate FPGA-Based LSTMs Under Computation Time Constraints
book, January 2018

On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors
book, January 2018

Software Design Space Exploration for Exascale Combustion Co-design
book, January 2013

How Many Threads will be too Many? On the Scalability of OpenMP Implementations
book, January 2015

Measuring energy consumption using EML (energy measurement library)
journal, July 2014

Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures
journal, November 2016

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
journal, October 2016

Type-Driven Automated Program Transformations and Cost Modelling for Optimising Streaming Programs on FPGAs
journal, April 2018

3DyRM: a dynamic roofline model including memory latency information
journal, March 2014

Optimization of parallel iterated local search algorithms on graphics processing unit
journal, May 2016

The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes
journal, June 2018

Efficient scheduling of streams on GPGPUs
journal, February 2020

Development of a Parallel Explicit Finite-Volume Euler Equation Solver using the Immersed Boundary Method with Hybrid MPI-CUDA Paradigm
journal, October 2019

High performance FDTD algorithm for GPGPU supercomputers
journal, October 2016

Ultrafast analysis of individual grain behavior during grain growth by parallel computing
journal, August 2015

A real-time, all-sky, high time resolution, direct imager for the long wavelength array
journal, May 2019

Direct wide-field radio imaging in real-time at high time resolution using antenna electric fields
journal, October 2019

Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation
journal, May 2018

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
conference, January 2015

Optimizing Sparse Matrix—Matrix Multiplication for the GPU
journal, October 2015

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
conference, January 2015

Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model
conference, January 2015

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
conference, January 2015

Harnessing energy efficiency of heterogeneous-ISA platforms
conference, January 2015

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance
conference, January 2015

Variation Among Processors Under Turbo Boost in HPC Systems
conference, January 2016

Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells
journal, January 2017

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks
conference, November 2016

Resource Conscious Reuse-Driven Tiling for GPUs
conference, January 2016

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
conference, October 2016

Sparse Matrix-Vector Multiplication on GPGPUs
journal, January 2017

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
conference, January 2017

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs
conference, June 2017

A Survey of Power and Energy Predictive Models in HPC Systems and Applications
journal, October 2017

In-Datacenter Performance Analysis of a Tensor Processing Unit
conference, January 2017

In-Datacenter Performance Analysis of a Tensor Processing Unit
journal, June 2017

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
journal, April 2018

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
journal, July 2018

A Survey on Compiler Autotuning using Machine Learning
journal, January 2019

Efficient sparse-matrix multi-vector product on GPUs
conference, January 2018

FINN- R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
journal, December 2018

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
journal, April 2019

Metric Selection for GPU Kernel Classification
journal, January 2019