DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Abstract

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

Authors:
 [1];  [1];  [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1106586
Alternate Identifier(s):
OSTI ID: 1556442
Report Number(s):
SAND-2013-5603J
Journal ID: ISSN 0743-7315; PII: S0743731514001257
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Parallel and Distributed Computing
Additional Journal Information:
Journal Volume: 74; Journal Issue: 12; Journal ID: ISSN 0743-7315
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; parallel computing; thread parallelism; manycore; GPU; performance portability; multidimensional array; mini-application

Citation Formats

Carter Edwards, H., Trott, Christian R., and Sunderland, Daniel. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. United States: N. p., 2014. Web. doi:10.1016/j.jpdc.2014.07.003.
Carter Edwards, H., Trott, Christian R., & Sunderland, Daniel. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. United States. https://doi.org/10.1016/j.jpdc.2014.07.003
Carter Edwards, H., Trott, Christian R., and Sunderland, Daniel. Tue . "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns". United States. https://doi.org/10.1016/j.jpdc.2014.07.003. https://www.osti.gov/servlets/purl/1106586.
@article{osti_1106586,
title = {Kokkos: Enabling manycore performance portability through polymorphic memory access patterns},
author = {Carter Edwards, H. and Trott, Christian R. and Sunderland, Daniel},
abstractNote = {The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.},
doi = {10.1016/j.jpdc.2014.07.003},
journal = {Journal of Parallel and Distributed Computing},
number = 12,
volume = 74,
place = {United States},
year = {Tue Jul 22 00:00:00 EDT 2014},
month = {Tue Jul 22 00:00:00 EDT 2014}
}

Journal Article:

Citation Metrics:
Cited by: 449 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
journal, November 2010

  • Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond
  • Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2
  • DOI: 10.1002/cpe.1631

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
conference, February 2010

  • Broquedis, Franois; Clet-Ortega, Jerome; Moreaud, Stephanie
  • 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010), 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
  • DOI: 10.1109/PDP.2010.67

A class of parallel tiled linear algebra algorithms for multicore architectures
journal, January 2009


OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
journal, June 2011

  • Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M.
  • Parallel Processing Letters, Vol. 21, Issue 02
  • DOI: 10.1142/S0129626411000151

Kokkos Array performance-portable manycore programming model
conference, January 2012

  • Edwards, H. Carter; Sunderland, Daniel
  • Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '12
  • DOI: 10.1145/2141702.2141703

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
conference, May 2013

  • Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.66

High Performance RDMA-Based MPI Implementation over InfiniBand
journal, June 2004


Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis
journal, May 2005


Hierarchical Task-Based Programming With StarSs
journal, June 2009

  • Planas, Judit; Badia, Rosa M.; Ayguadé, Eduard
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 3
  • DOI: 10.1177/1094342009106195

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995


Works referencing / citing this record:

Thrust2D: A new design abstraction framework for structured grid class of algorithms: Thrust2D
journal, July 2018

  • Sarkar, Santonu; George, Ajai V.; Manoj, Sankar
  • Concurrency and Computation: Practice and Experience, Vol. 30, Issue 19
  • DOI: 10.1002/cpe.4740

Classical molecular dynamics on graphics processing unit architectures
journal, August 2019

  • Jász, Ádám; Rák, Ádám; Ladjánszki, István
  • WIREs Computational Molecular Science, Vol. 10, Issue 2
  • DOI: 10.1002/wcms.1444

High Order Anchoring and Reinitialization of Level Set Function for Simulating Interface Motion
journal, November 2019


Direct simulation Monte Carlo on petaflop supercomputers and beyond
journal, August 2019

  • Plimpton, S. J.; Moore, S. G.; Borner, A.
  • Physics of Fluids, Vol. 31, Issue 8
  • DOI: 10.1063/1.5108534

Large Eddy Simulation of a Supercritical Fuel Jet in Cross Flow using GPU-Acceleration
conference, January 2016

  • Gottiparthi, Kalyana C.; Sankaran, Ramanan; Ruiz, Anthony M.
  • 54th AIAA Aerospace Sciences Meeting
  • DOI: 10.2514/6.2016-1939

Evaluating Support for OpenMP Offload Features
conference, January 2018

  • Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle
  • Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
  • DOI: 10.1145/3229710.3229717

Compiler Optimizations for Parallel Programs
book, November 2019

  • Doerfert, Johannes; Finkel, Hal; Hall, Mary
  • Languages and Compilers for Parallel Computing: 31st International Workshop, LCPC 2018, Salt Lake City, UT, USA, October 9–11, 2018, Revised Selected Papers, p. 112-119
  • DOI: 10.1007/978-3-030-34627-0_9

Modeling of Dynamic Rock–Fluid Interaction Using Coupled 3-D Discrete Element and Lattice Boltzmann Methods
journal, May 2019


A large-scale study of MPI usage in open-source HPC applications
conference, November 2019

  • Laguna, Ignacio; Marshall, Ryan; Mohror, Kathryn
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/3295500.3356176

A High-performance and Portable All-Mach Regime Flow Solver Code with Well-balanced Gravity. Application to Compressible Convection
journal, April 2019

  • Padioleau, Thomas; Tremblin, Pascal; Audit, Edouard
  • The Astrophysical Journal, Vol. 875, Issue 2
  • DOI: 10.3847/1538-4357/ab0f2c

Preparing sparse solvers for exascale computing
journal, January 2020

  • Anzt, Hartwig; Boman, Erik; Falgout, Rob
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166
  • DOI: 10.1098/rsta.2019.0053

Performance Portability of a Multiphysics Finite Element Code
conference, June 2018

  • Tanis, Craig; Sreenivas, Kidambi; Newman, James C.
  • 2018 Aviation Technology, Integration, and Operations Conference
  • DOI: 10.2514/6.2018-2890

A Study on the Performance Portability of the Finite Element Assembly Process Within the Albany Land Ice Solver
book, February 2020

  • Watkins, Jerry; Tezaur, Irina; Demeshko, Irina
  • Numerical Methods for Flows: FEF 2017 Selected Contributions, p. 177-188
  • DOI: 10.1007/978-3-030-30705-9_16

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems
journal, January 2020

  • Czarnul, Paweł; Proficz, Jerzy; Drypczewski, Krzysztof
  • Scientific Programming, Vol. 2020
  • DOI: 10.1155/2020/4176794

HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model
journal, January 2019

  • Bertagna, Luca; Deakin, Michael; Guba, Oksana
  • Geoscientific Model Development, Vol. 12, Issue 4
  • DOI: 10.5194/gmd-12-1423-2019

Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf
journal, March 2017

  • Martineau, Matthew; McIntosh-Smith, Simon; Gaudin, Wayne
  • Concurrency and Computation: Practice and Experience, Vol. 29, Issue 15
  • DOI: 10.1002/cpe.4117

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019

  • Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
  • The European Physical Journal A, Vol. 55, Issue 11
  • DOI: 10.1140/epja/i2019-12919-7

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication
journal, January 2019

  • Liu, Junhong; He, Xin; Liu, Weifeng
  • International Journal of Parallel Programming, Vol. 47, Issue 3
  • DOI: 10.1007/s10766-018-0604-8

InKS: a programming model to decouple algorithm from optimization in HPC codes
journal, July 2019

  • Ejjaaouani, Ksander; Aumage, Olivier; Bigot, Julien
  • The Journal of Supercomputing, Vol. 76, Issue 6
  • DOI: 10.1007/s11227-019-02950-2

Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code
journal, December 2019

  • Artigues, Victor; Kormann, Katharina; Rampp, Markus
  • Concurrency and Computation: Practice and Experience, Vol. 32, Issue 11
  • DOI: 10.1002/cpe.5640

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
book, October 2016


Tiling-Based Programming Model for Structured Grids on GPU Clusters
conference, January 2020

  • Bastem, Burak; Unat, Didem
  • HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
  • DOI: 10.1145/3368474.3368485

MPAS-Albany Land Ice (MALI): a variable-resolution ice sheet model for Earth system modeling using Voronoi grids
journal, January 2018

  • Hoffman, Matthew J.; Perego, Mauro; Price, Stephen F.
  • Geoscientific Model Development, Vol. 11, Issue 9
  • DOI: 10.5194/gmd-11-3747-2018

Performance of preconditioned iterative solvers in MFiX–Trilinos for fluidized beds
journal, May 2018

  • Kotteda, V. M. Krushnarao; Kumar, Vinod; Spotz, William
  • The Journal of Supercomputing, Vol. 74, Issue 8
  • DOI: 10.1007/s11227-018-2415-5

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018


STEEL-RT: combining single task–single executor model and expanded scheduling to ease heterogeneity exploitation
journal, August 2019

  • Rey, Antón; Igual, Francisco D.; Prieto-Matías, Manuel
  • The Journal of Supercomputing, Vol. 76, Issue 6
  • DOI: 10.1007/s11227-019-02955-x

Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0)
journal, January 2018

  • Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike
  • Geoscientific Model Development, Vol. 11, Issue 8
  • DOI: 10.5194/gmd-11-3447-2018

Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations
book, January 2018


Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
journal, May 2018


Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
text, January 2016

  • Zenker, Erik; Widera, René; Huebl, Axel
  • Deutsches Elektronen-Synchrotron, DESY, Hamburg
  • DOI: 10.3204/pubdb-2017-03016

Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale
text, January 2018