Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel

doi:10.1016/j.jpdc.2014.07.003

Title: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Abstract

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

Authors:

Carter Edwards, H. ^[1]; Trott, Christian R. ^[1]; Sunderland, Daniel ^[1]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Publication Date:: Tue Jul 22 00:00:00 EDT 2014

Research Org.:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA)

OSTI Identifier:: 1106586

Alternate Identifier(s):: OSTI ID: 1556442

Report Number(s):: SAND-2013-5603J
Journal ID: ISSN 0743-7315; PII: S0743731514001257

Grant/Contract Number:: AC04-94AL85000

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Parallel and Distributed Computing

Additional Journal Information:: Journal Volume: 74; Journal Issue: 12; Journal ID: ISSN 0743-7315

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; parallel computing; thread parallelism; manycore; GPU; performance portability; multidimensional array; mini-application

Citation Formats


                    Carter Edwards, H., Trott, Christian R., and Sunderland, Daniel. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.  United States: N. p., 2014. 
Web.  doi:10.1016/j.jpdc.2014.07.003.

Copy to clipboard


                    Carter Edwards, H., Trott, Christian R., & Sunderland, Daniel. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.  United States.  https://doi.org/10.1016/j.jpdc.2014.07.003

Copy to clipboard


                    Carter Edwards, H., Trott, Christian R., and Sunderland, Daniel. Tue .  
"Kokkos: Enabling manycore performance portability through polymorphic memory access patterns".  United States.  https://doi.org/10.1016/j.jpdc.2014.07.003.  https://www.osti.gov/servlets/purl/1106586.

Copy to clipboard


                    
@article{osti_1106586,

  title        = {Kokkos: Enabling manycore performance portability through polymorphic memory access patterns},

  author       = {Carter Edwards, H. and Trott, Christian R. and Sunderland, Daniel},

  abstractNote = {The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.},

  doi          = {10.1016/j.jpdc.2014.07.003},

  journal      = {Journal of Parallel and Distributed Computing},

  number       = 12,

  volume       = 74,

  place        = {United States},

  year         = {Tue Jul 22 00:00:00 EDT 2014},

  month        = {Tue Jul 22 00:00:00 EDT 2014}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.jpdc.2014.07.003

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 449 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
journal, November 2010

Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond
Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2
DOI: 10.1002/cpe.1631

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
conference, February 2010

Broquedis, Franois; Clet-Ortega, Jerome; Moreaud, Stephanie
18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010), 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
DOI: 10.1109/PDP.2010.67

A class of parallel tiled linear algebra algorithms for multicore architectures
journal, January 2009

Buttari, Alfredo; Langou, Julien; Kurzak, Jakub
Parallel Computing, Vol. 35, Issue 1
DOI: 10.1016/j.parco.2008.10.002

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
journal, June 2011

Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M.
Parallel Processing Letters, Vol. 21, Issue 02
DOI: 10.1142/S0129626411000151

Kokkos Array performance-portable manycore programming model
conference, January 2012

Edwards, H. Carter; Sunderland, Daniel
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '12
DOI: 10.1145/2141702.2141703

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
conference, May 2013

Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
DOI: 10.1109/IPDPS.2013.66

High Performance RDMA-Based MPI Implementation over InfiniBand
journal, June 2004

Liu, Jiuxing; Wu, Jiesheng; Panda, Dhabaleswar K.
International Journal of Parallel Programming, Vol. 32, Issue 3
DOI: 10.1023/B:IJPP.0000029272.69895.c1

Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis
journal, May 2005

Luke, Edward A.; George, Thomas
Journal of Functional Programming, Vol. 15, Issue 3
DOI: 10.1017/S0956796805005514

Hierarchical Task-Based Programming With StarSs
journal, June 2009

Planas, Judit; Badia, Rosa M.; Ayguadé, Eduard
The International Journal of High Performance Computing Applications, Vol. 23, Issue 3
DOI: 10.1177/1094342009106195

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995

Plimpton, Steve
Journal of Computational Physics, Vol. 117, Issue 1
DOI: 10.1006/jcph.1995.1039

Works referencing / citing this record:

Thrust2D: A new design abstraction framework for structured grid class of algorithms: Thrust2D
journal, July 2018

Sarkar, Santonu; George, Ajai V.; Manoj, Sankar
Concurrency and Computation: Practice and Experience, Vol. 30, Issue 19
DOI: 10.1002/cpe.4740

Classical molecular dynamics on graphics processing unit architectures
journal, August 2019

Jász, Ádám; Rák, Ádám; Ladjánszki, István
WIREs Computational Molecular Science, Vol. 10, Issue 2
DOI: 10.1002/wcms.1444

High Order Anchoring and Reinitialization of Level Set Function for Simulating Interface Motion
journal, November 2019

Ramanuj, Vimal; Sankaran, Ramanan
Journal of Scientific Computing, Vol. 81, Issue 3
DOI: 10.1007/s10915-019-01076-0

Direct simulation Monte Carlo on petaflop supercomputers and beyond
journal, August 2019

Plimpton, S. J.; Moore, S. G.; Borner, A.
Physics of Fluids, Vol. 31, Issue 8
DOI: 10.1063/1.5108534

Large Eddy Simulation of a Supercritical Fuel Jet in Cross Flow using GPU-Acceleration
conference, January 2016

Gottiparthi, Kalyana C.; Sankaran, Ramanan; Ruiz, Anthony M.
54th AIAA Aerospace Sciences Meeting
DOI: 10.2514/6.2016-1939

Evaluating Support for OpenMP Offload Features
conference, January 2018

Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle
Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
DOI: 10.1145/3229710.3229717

Compiler Optimizations for Parallel Programs
book, November 2019

Doerfert, Johannes; Finkel, Hal; Hall, Mary
Languages and Compilers for Parallel Computing: 31st International Workshop, LCPC 2018, Salt Lake City, UT, USA, October 9–11, 2018, Revised Selected Papers, p. 112-119
DOI: 10.1007/978-3-030-34627-0_9

Modeling of Dynamic Rock–Fluid Interaction Using Coupled 3-D Discrete Element and Lattice Boltzmann Methods
journal, May 2019

Gardner, Michael; Sitar, Nicholas
Rock Mechanics and Rock Engineering, Vol. 52, Issue 12
DOI: 10.1007/s00603-019-01857-x

A large-scale study of MPI usage in open-source HPC applications
conference, November 2019

Laguna, Ignacio; Marshall, Ryan; Mohror, Kathryn
SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/3295500.3356176

A High-performance and Portable All-Mach Regime Flow Solver Code with Well-balanced Gravity. Application to Compressible Convection
journal, April 2019

Padioleau, Thomas; Tremblin, Pascal; Audit, Edouard
The Astrophysical Journal, Vol. 875, Issue 2
DOI: 10.3847/1538-4357/ab0f2c

Preparing sparse solvers for exascale computing
journal, January 2020

Anzt, Hartwig; Boman, Erik; Falgout, Rob
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
DOI: 10.1098/rsta.2019.0053

Performance Portability of a Multiphysics Finite Element Code
conference, June 2018

Tanis, Craig; Sreenivas, Kidambi; Newman, James C.
2018 Aviation Technology, Integration, and Operations Conference
DOI: 10.2514/6.2018-2890

A Study on the Performance Portability of the Finite Element Assembly Process Within the Albany Land Ice Solver
book, February 2020

Watkins, Jerry; Tezaur, Irina; Demeshko, Irina
Numerical Methods for Flows: FEF 2017 Selected Contributions, p. 177-188
DOI: 10.1007/978-3-030-30705-9_16

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems
journal, January 2020

Czarnul, Paweł; Proficz, Jerzy; Drypczewski, Krzysztof
Scientific Programming, Vol. 2020
DOI: 10.1155/2020/4176794

HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model
journal, January 2019

Bertagna, Luca; Deakin, Michael; Guba, Oksana
Geoscientific Model Development, Vol. 12, Issue 4
DOI: 10.5194/gmd-12-1423-2019

Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf
journal, March 2017

Martineau, Matthew; McIntosh-Smith, Simon; Gaudin, Wayne
Concurrency and Computation: Practice and Experience, Vol. 29, Issue 15
DOI: 10.1002/cpe.4117

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019

Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
The European Physical Journal A, Vol. 55, Issue 11
DOI: 10.1140/epja/i2019-12919-7

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

Liu, Junhong; He, Xin; Liu, Weifeng
International Journal of Parallel Programming, Vol. 47, Issue 3
DOI: 10.1007/s10766-018-0604-8

InKS: a programming model to decouple algorithm from optimization in HPC codes
journal, July 2019

Ejjaaouani, Ksander; Aumage, Olivier; Bigot, Julien
The Journal of Supercomputing, Vol. 76, Issue 6
DOI: 10.1007/s11227-019-02950-2

Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code
journal, December 2019

Artigues, Victor; Kormann, Katharina; Rampp, Markus
Concurrency and Computation: Practice and Experience, Vol. 32, Issue 11
DOI: 10.1002/cpe.5640

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
book, October 2016

Zenker, Erik; Widera, René; Huebl, Axel
High Performance Computing
DOI: 10.1007/978-3-319-46079-6_21

Tiling-Based Programming Model for Structured Grids on GPU Clusters
conference, January 2020

Bastem, Burak; Unat, Didem
HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
DOI: 10.1145/3368474.3368485

MPAS-Albany Land Ice (MALI): a variable-resolution ice sheet model for Earth system modeling using Voronoi grids
journal, January 2018

Hoffman, Matthew J.; Perego, Mauro; Price, Stephen F.
Geoscientific Model Development, Vol. 11, Issue 9
DOI: 10.5194/gmd-11-3747-2018

Performance of preconditioned iterative solvers in MFiX–Trilinos for fluidized beds
journal, May 2018

Kotteda, V. M. Krushnarao; Kumar, Vinod; Spotz, William
The Journal of Supercomputing, Vol. 74, Issue 8
DOI: 10.1007/s11227-018-2415-5

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018

Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G.
Taylor & Francis
DOI: 10.6084/m9.figshare.6265274

STEEL-RT: combining single task–single executor model and expanded scheduling to ease heterogeneity exploitation
journal, August 2019

Rey, Antón; Igual, Francisco D.; Prieto-Matías, Manuel
The Journal of Supercomputing, Vol. 76, Issue 6
DOI: 10.1007/s11227-019-02955-x

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)
journal, January 2018

Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike
Geoscientific Model Development, Vol. 11, Issue 8
DOI: 10.5194/gmd-11-3447-2018

Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations
book, January 2018

Stegailov, Vladimir; Agarkov, Alexander; Biryukov, Sergey
Parallel Processing and Applied Mathematics
DOI: 10.1007/978-3-319-78024-5_29

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
journal, May 2018

Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G.
Molecular Physics, Vol. 116, Issue 15-16
DOI: 10.1080/00268976.2018.1471532

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
text, January 2016

Zenker, Erik; Widera, René; Huebl, Axel
Deutsches Elektronen-Synchrotron, DESY, Hamburg
DOI: 10.3204/pubdb-2017-03016

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018

Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G.
Taylor & Francis
DOI: 10.6084/m9.figshare.6265274.v1

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
text, January 2016

Zenker, Erik; Widera, René; Huebl, Axel
arXiv
DOI: 10.48550/arxiv.1606.02862

Similar Records in DOE PAGES and OSTI.GOV collections:

Manycore Performance-Portability: Kokkos Multidimensional Array Library

Journal Article Edwards, H. Carter ; Sunderland, Daniel ; Porter, Vicki ; ... - Scientific Programming

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel executionmore »« less
Cited by 29
https://doi.org/10.1155/2012/917630
Case Study of Using Kokkos and SYCLs Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs

Conference Dufek, Amanda S ; Gayatri, Rahulkumar ; Mehta, Neil A ; ...

Six of the top ten supercomputers in the TOP500 list from June 2021 rely on NVIDIA GPUs to achieve their peak compute bandwidth. With the announcement of Aurora, Frontier, and El Capitan, Intel and AMD have also entered the domain of providing GPUs for scientific computing. A consequence of the increased diversity in the GPU landscape is the emergence of portable programming models such as Kokkos, SYCL, OpenCL, and OpenMP, which allow application developers to maintain a single-source code across a diverse range of hardware architectures. While the portable frameworks try to optimize the compute resource usage on a givenmore »« less
https://doi.org/10.1109/P3HPC54578.2021.00009
ASC-ATDM Performance Portability Requirements for 2015-2019

Technical Report Edwards, Harold C. ; Trott, Christian Robert

This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore »« less
https://doi.org/10.2172/1177389

Full Text Available
The Kokkos Ecosystem [Brief]

Technical Report Trott, Christian Robert

In 2016/2017, the field of High-Performance Computing (HPC) entered a new era driven by fundamental physics challenges to produce ever more energy and cost-efficient processors. Since the convergence on the Message-Passing Interface (MPI) standard in the mid-1990s, application developers enjoyed a seemingly static view of the underlying machine — that of a distributed collection of homogeneous nodes executing in collaboration. However, after almost two decades of dominance, the sole use of MPI to derive parallelism acted as a limiter to improved future performance. While MPI is widely expected to continue to function as the basic mechanism for communication between computemore »« less
https://doi.org/10.2172/1656942

Full Text Available
Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado

Journal Article Phipps, Eric T. ; Pawlowski, Roger P. ; Trott, Christian Robert - ACM Transactions on Mathematical Software

Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations thatmore »« less
https://doi.org/10.1145/3560262

Full Text Available

Similar Records

Title: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Abstract

Citation Formats

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures journal, November 2010

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications conference, February 2010

A class of parallel tiled linear algebra algorithms for multicore architectures journal, January 2009

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES journal, June 2011

Kokkos Array performance-portable manycore programming model conference, January 2012

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures conference, May 2013

High Performance RDMA-Based MPI Implementation over InfiniBand journal, June 2004

Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis journal, May 2005

Hierarchical Task-Based Programming With StarSs journal, June 2009

Fast Parallel Algorithms for Short-Range Molecular Dynamics journal, March 1995

Thrust2D: A new design abstraction framework for structured grid class of algorithms: Thrust2D journal, July 2018

Classical molecular dynamics on graphics processing unit architectures journal, August 2019

High Order Anchoring and Reinitialization of Level Set Function for Simulating Interface Motion journal, November 2019

Direct simulation Monte Carlo on petaflop supercomputers and beyond journal, August 2019

Large Eddy Simulation of a Supercritical Fuel Jet in Cross Flow using GPU-Acceleration conference, January 2016

Evaluating Support for OpenMP Offload Features conference, January 2018

Compiler Optimizations for Parallel Programs book, November 2019

Modeling of Dynamic Rock–Fluid Interaction Using Coupled 3-D Discrete Element and Lattice Boltzmann Methods journal, May 2019

A large-scale study of MPI usage in open-source HPC applications conference, November 2019

A High-performance and Portable All-Mach Regime Flow Solver Code with Well-balanced Gravity. Application to Compressible Convection journal, April 2019

Preparing sparse solvers for exascale computing journal, January 2020

Performance Portability of a Multiphysics Finite Element Code conference, June 2018

A Study on the Performance Portability of the Finite Element Assembly Process Within the Albany Land Ice Solver book, February 2020

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems journal, January 2020

HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model journal, January 2019

Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf journal, March 2017

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond journal, November 2019

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication journal, January 2019

InKS: a programming model to decouple algorithm from optimization in HPC codes journal, July 2019

Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code journal, December 2019

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond book, October 2016

Tiling-Based Programming Model for Structured Grids on GPU Clusters conference, January 2020

MPAS-Albany Land Ice (MALI): a variable-resolution ice sheet model for Earth system modeling using Voronoi grids journal, January 2018

Performance of preconditioned iterative solvers in MFiX–Trilinos for fluidized beds journal, May 2018

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale text, January 2018

STEEL-RT: combining single task–single executor model and expanded scheduling to ease heterogeneity exploitation journal, August 2019

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0) journal, January 2018

Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations book, January 2018

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale journal, May 2018

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond text, January 2016

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale text, January 2018

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond text, January 2016

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
journal, November 2010

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
conference, February 2010

A class of parallel tiled linear algebra algorithms for multicore architectures
journal, January 2009

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
journal, June 2011

Kokkos Array performance-portable manycore programming model
conference, January 2012

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
conference, May 2013

High Performance RDMA-Based MPI Implementation over InfiniBand
journal, June 2004

Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis
journal, May 2005

Hierarchical Task-Based Programming With StarSs
journal, June 2009

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995

Thrust2D: A new design abstraction framework for structured grid class of algorithms: Thrust2D
journal, July 2018

Classical molecular dynamics on graphics processing unit architectures
journal, August 2019

High Order Anchoring and Reinitialization of Level Set Function for Simulating Interface Motion
journal, November 2019

Direct simulation Monte Carlo on petaflop supercomputers and beyond
journal, August 2019

Large Eddy Simulation of a Supercritical Fuel Jet in Cross Flow using GPU-Acceleration
conference, January 2016

Evaluating Support for OpenMP Offload Features
conference, January 2018

Compiler Optimizations for Parallel Programs
book, November 2019

Modeling of Dynamic Rock–Fluid Interaction Using Coupled 3-D Discrete Element and Lattice Boltzmann Methods
journal, May 2019

A large-scale study of MPI usage in open-source HPC applications
conference, November 2019

A High-performance and Portable All-Mach Regime Flow Solver Code with Well-balanced Gravity. Application to Compressible Convection
journal, April 2019

Preparing sparse solvers for exascale computing
journal, January 2020

Performance Portability of a Multiphysics Finite Element Code
conference, June 2018

A Study on the Performance Portability of the Finite Element Assembly Process Within the Albany Land Ice Solver
book, February 2020

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems
journal, January 2020

HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model
journal, January 2019

Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf
journal, March 2017

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

InKS: a programming model to decouple algorithm from optimization in HPC codes
journal, July 2019

Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code
journal, December 2019

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
book, October 2016

Tiling-Based Programming Model for Structured Grids on GPU Clusters
conference, January 2020

MPAS-Albany Land Ice (MALI): a variable-resolution ice sheet model for Earth system modeling using Voronoi grids
journal, January 2018

Performance of preconditioned iterative solvers in MFiX–Trilinos for fluidized beds
journal, May 2018

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018

STEEL-RT: combining single task–single executor model and expanded scheduling to ease heterogeneity exploitation
journal, August 2019

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)
journal, January 2018

Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations
book, January 2018

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
journal, May 2018

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
text, January 2016

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
text, January 2016