Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Yang, Charlene; Kurth, Thorsten; Williams, Samuel

doi:10.1002/cpe.5547

Title: Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Abstract

Summary The Roofline performance model provides an intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In preparation for the next‐generation supercomputer Perlmutter at NERSC, this paper presents a methodology to construct a hierarchical Roofline on NVIDIA GPUs and extends it to support reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory, and system memory bandwidths into one single figure, and it offers more profound insights into performance analysis than the traditional DRAM‐only Roofline. We use our Roofline methodology to analyze three proxy applications: GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow. In doing so, we demonstrate the ability of our methodology to readily understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and motivate code optimizations.

Authors:

^[1];

^[1]; Williams, Samuel ^[2]

National Energy Research Scientific Computing Center (NERSC) Lawrence Berkeley National Laboratory Berkeley California
Computational Research Division (CRD) Lawrence Berkeley National Laboratory Berkeley California

Publication Date:: Tue Nov 12 00:00:00 EST 2019

Sponsoring Org.:: USDOE

OSTI Identifier:: 1574050

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Publisher's Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 32 Journal Issue: 20; Journal ID: ISSN 1532-0626

Publisher:: Wiley Blackwell (John Wiley & Sons)

Country of Publication:: United Kingdom

Language:: English

Citation Formats


                    Yang, Charlene, Kurth, Thorsten, and Williams, Samuel. Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system.  United Kingdom: N. p., 2019. 
Web.  doi:10.1002/cpe.5547.

Copy to clipboard


                    Yang, Charlene, Kurth, Thorsten, & Williams, Samuel. Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system.  United Kingdom.  https://doi.org/10.1002/cpe.5547

Copy to clipboard


                    Yang, Charlene, Kurth, Thorsten, and Williams, Samuel. Tue .  
"Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system".  United Kingdom.  https://doi.org/10.1002/cpe.5547.

Copy to clipboard


                    
@article{osti_1574050,

  title        = {Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system},

  author       = {Yang, Charlene and Kurth, Thorsten and Williams, Samuel},

  abstractNote = {Summary The Roofline performance model provides an intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In preparation for the next‐generation supercomputer Perlmutter at NERSC, this paper presents a methodology to construct a hierarchical Roofline on NVIDIA GPUs and extends it to support reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory, and system memory bandwidths into one single figure, and it offers more profound insights into performance analysis than the traditional DRAM‐only Roofline. We use our Roofline methodology to analyze three proxy applications: GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow. In doing so, we demonstrate the ability of our methodology to readily understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and motivate code optimizations.},

  doi          = {10.1002/cpe.5547},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 20,

  volume       = 32,

  place        = {United Kingdom},

  year         = {Tue Nov 12 00:00:00 EST 2019},

  month        = {Tue Nov 12 00:00:00 EST 2019}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Publisher's Version of Record
https://doi.org/10.1002/cpe.5547

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 29 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability
conference, November 2018

Yang, Charlene; Gayatri, Rahulkumar; Kurth, Thorsten
2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)
DOI: 10.1109/P3HPC.2018.00005

Deep Residual Learning for Image Recognition
conference, June 2016

He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI: 10.1109/CVPR.2016.90

Evaluating and Optimizing the NERSC Workload on Knights Landing
conference, November 2016

Barnes, Taylor; Cook, Brandon; Deslippe, Jack
2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
DOI: 10.1109/PMBS.2016.010

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Williams, Samuel; Waterman, Andrew; Patterson, David
Communications of the ACM, Vol. 52, Issue 4
DOI: 10.1145/1498765.1498785

Electron self-energy calculation using a general multi-pole approximation
journal, April 2003

Soininen, J. A.; Rehr, J. J.; Shirley, Eric L.
Journal of Physics: Condensed Matter, Vol. 15, Issue 17
DOI: 10.1088/0953-8984/15/17/312

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
journal, August 2019

Ben-Nun, Tal; Hoefler, Torsten
ACM Computing Surveys, Vol. 52, Issue 4
DOI: 10.1145/3320060

Similar Records in DOE PAGES and OSTI.GOV collections:

Instruction Roofline: An insightful visual performance model for GPUs

Conference Ding, N ; Awan, M ; Williams, S

The Roofline performance model provides an intuitive approach to identify performance bottlenecks and guide performance optimization. However, the classic FLOP-centric approach is inappropriate for the emerging applications that perform more integer operations than floating point operations. In this article, we reintroduce our Instruction Roofline Model on NVIDIA GPUs and expand our evaluation of it. The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together, and provides more performance insights than the FLOP-oriented Roofline Model, that is, instruction throughput, stride memory access patterns, bank conflicts, and thread predication. We use our Instruction Roofline methodology to analyze eight proxymore »« less
https://doi.org/10.1002/cpe.6591

Full Text Available
Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

Conference Koskela, Tuomas S. ; Lobet, Mathieu ; Deslippe, Jack ; ...

In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore »« less
Full Text Available
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Conference Wang, Linnan ; Ye, Jinmian ; Zhao, Yiyang ; ...

Going deeper and wider in neural architectures improves their accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need to change to less de- sired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, Liveness Analysis, Unified Tensor Pool, and Cost-Aware Recomputation; together they effectively reduce the network-wide peak memory usage downmore »« less
https://doi.org/10.1145/3178487.3178491
RACB: Resource Aware Cache Bypass on GPUs

Conference Dai, Hongwen ; Kartsaklis, Christos ; Li, Chao ; ... - 2014 International Symposium on Computer Architecture and High Performance Computing Workshop; 22-24 Oct. 2014; Paris, France

Caches are universally used in computing systems to hide long off-chip memory access latencies. Unlike CPUs, massive threads running simultaneously on GPUs bring a tremendous pressure on memory hierarchy. As a result, the limitation of cache resources becomes a bottleneck for a GPU to exploit thread-level parallelism (TLP) and memory-level parallelism (MLP) and achieve high performance. In this paper, we propose a mechanism to bypass L1D and L2 cache based on the availability of cache resources. Our proposed mechanism is based on the observation that a huge number of stalls coming from limited cache resources prohibit GPUs from providing amore »« less
https://doi.org/10.1109/SBAC-PADW.2014.14
Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory

Journal Article Williams-Young, David B. ; Asadchev, Andrey ; Popovici, Doru Thom ; ... - Journal of Chemical Physics

With the growing reliance of modern supercomputers on accelerator-based architecture such a graphics processing units (GPUs), the development and optimization of electronic structure methods to exploit these massively parallel resources has become a recent priority. While significant strides have been made in the development GPU accelerated, distributed memory algorithms for many modern electronic structure methods, the primary focus of GPU development for Gaussian basis atomic orbital methods has been for shared memory systems with only a handful of examples pursing massive parallelism. In the present work, we present a set of distributed memory algorithms for the evaluation of the Coulombmore »« less
https://doi.org/10.1063/5.0151070

Similar Records

Title: Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Abstract

Citation Formats

An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability conference, November 2018

Deep Residual Learning for Image Recognition conference, June 2016

Evaluating and Optimizing the NERSC Workload on Knights Landing conference, November 2016

Roofline: an insightful visual performance model for multicore architectures journal, April 2009

Electron self-energy calculation using a general multi-pole approximation journal, April 2003

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis journal, August 2019

An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability
conference, November 2018

Deep Residual Learning for Image Recognition
conference, June 2016

Evaluating and Optimizing the NERSC Workload on Knights Landing
conference, November 2016

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Electron self-energy calculation using a general multi-pole approximation
journal, April 2003

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
journal, August 2019