Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel

doi:10.1016/j.jpdc.2014.07.003

Title: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Journal Article · Tue Jul 22 00:00:00 EDT 2014 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2014.07.003· OSTI ID:1106586

Carter Edwards, H. ^[1]; Trott, Christian R. ^[1]; Sunderland, Daniel ^[1]

Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC04-94AL85000

OSTI ID:: 1106586

Alternate ID(s):: OSTI ID: 1556442

Report Number(s):: SAND-2013-5603J; PII: S0743731514001257

Journal Information:: Journal of Parallel and Distributed Computing, Vol. 74, Issue 12; ISSN 0743-7315

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 449 works

Citation information provided by
Web of Science

References (10)

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2 https://doi.org/10.1002/cpe.1631	journal	November 2010
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications Broquedis, Franois; Clet-Ortega, Jerome; Moreaud, Stephanie 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010), 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2010.67	conference	February 2010
A class of parallel tiled linear algebra algorithms for multicore architectures Buttari, Alfredo; Langou, Julien; Kurzak, Jakub Parallel Computing, Vol. 35, Issue 1 https://doi.org/10.1016/j.parco.2008.10.002	journal	January 2009
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M. Parallel Processing Letters, Vol. 21, Issue 02 https://doi.org/10.1142/S0129626411000151	journal	June 2011
Kokkos Array performance-portable manycore programming model Edwards, H. Carter; Sunderland, Daniel Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '12 https://doi.org/10.1145/2141702.2141703	conference	January 2012
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.66	conference	May 2013
High Performance RDMA-Based MPI Implementation over InfiniBand Liu, Jiuxing; Wu, Jiesheng; Panda, Dhabaleswar K. International Journal of Parallel Programming, Vol. 32, Issue 3 https://doi.org/10.1023/B:IJPP.0000029272.69895.c1	journal	June 2004
Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis Luke, Edward A.; George, Thomas Journal of Functional Programming, Vol. 15, Issue 3 https://doi.org/10.1017/S0956796805005514	journal	May 2005
Hierarchical Task-Based Programming With StarSs Planas, Judit; Badia, Rosa M.; Ayguadé, Eduard The International Journal of High Performance Computing Applications, Vol. 23, Issue 3 https://doi.org/10.1177/1094342009106195	journal	June 2009
Fast Parallel Algorithms for Short-Range Molecular Dynamics Plimpton, Steve Journal of Computational Physics, Vol. 117, Issue 1 https://doi.org/10.1006/jcph.1995.1039	journal	March 1995

Cited By (32)

Thrust2D: A new design abstraction framework for structured grid class of algorithms: Thrust2D Sarkar, Santonu; George, Ajai V.; Manoj, Sankar Concurrency and Computation: Practice and Experience, Vol. 30, Issue 19 https://doi.org/10.1002/cpe.4740	journal	July 2018
Classical molecular dynamics on graphics processing unit architectures Jász, Ádám; Rák, Ádám; Ladjánszki, István WIREs Computational Molecular Science, Vol. 10, Issue 2 https://doi.org/10.1002/wcms.1444	journal	August 2019
High Order Anchoring and Reinitialization of Level Set Function for Simulating Interface Motion Ramanuj, Vimal; Sankaran, Ramanan Journal of Scientific Computing, Vol. 81, Issue 3 https://doi.org/10.1007/s10915-019-01076-0	journal	November 2019
Direct simulation Monte Carlo on petaflop supercomputers and beyond Plimpton, S. J.; Moore, S. G.; Borner, A. Physics of Fluids, Vol. 31, Issue 8 https://doi.org/10.1063/1.5108534	journal	August 2019
Large Eddy Simulation of a Supercritical Fuel Jet in Cross Flow using GPU-Acceleration Gottiparthi, Kalyana C.; Sankaran, Ramanan; Ruiz, Anthony M. 54th AIAA Aerospace Sciences Meeting https://doi.org/10.2514/6.2016-1939	conference	January 2016
Evaluating Support for OpenMP Offload Features Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18 https://doi.org/10.1145/3229710.3229717	conference	January 2018
Compiler Optimizations for Parallel Programs Doerfert, Johannes; Finkel, Hal; Hall, Mary Languages and Compilers for Parallel Computing: 31st International Workshop, LCPC 2018, Salt Lake City, UT, USA, October 9–11, 2018, Revised Selected Papers, p. 112-119 https://doi.org/10.1007/978-3-030-34627-0_9	book	November 2019
Modeling of Dynamic Rock–Fluid Interaction Using Coupled 3-D Discrete Element and Lattice Boltzmann Methods Gardner, Michael; Sitar, Nicholas Rock Mechanics and Rock Engineering, Vol. 52, Issue 12 https://doi.org/10.1007/s00603-019-01857-x	journal	May 2019
A large-scale study of MPI usage in open-source HPC applications Laguna, Ignacio; Marshall, Ryan; Mohror, Kathryn SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356176	conference	November 2019
A High-performance and Portable All-Mach Regime Flow Solver Code with Well-balanced Gravity. Application to Compressible Convection Padioleau, Thomas; Tremblin, Pascal; Audit, Edouard The Astrophysical Journal, Vol. 875, Issue 2 https://doi.org/10.3847/1538-4357/ab0f2c	journal	April 2019
Preparing sparse solvers for exascale computing Anzt, Hartwig; Boman, Erik; Falgout, Rob Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0053	journal	January 2020
Performance Portability of a Multiphysics Finite Element Code Tanis, Craig; Sreenivas, Kidambi; Newman, James C. 2018 Aviation Technology, Integration, and Operations Conference https://doi.org/10.2514/6.2018-2890	conference	June 2018
A Study on the Performance Portability of the Finite Element Assembly Process Within the Albany Land Ice Solver Watkins, Jerry; Tezaur, Irina; Demeshko, Irina Numerical Methods for Flows: FEF 2017 Selected Contributions, p. 177-188 https://doi.org/10.1007/978-3-030-30705-9_16	book	February 2020
Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems Czarnul, Paweł; Proficz, Jerzy; Drypczewski, Krzysztof Scientific Programming, Vol. 2020 https://doi.org/10.1155/2020/4176794	journal	January 2020
HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model Bertagna, Luca; Deakin, Michael; Guba, Oksana Geoscientific Model Development, Vol. 12, Issue 4 https://doi.org/10.5194/gmd-12-1423-2019	journal	January 2019
Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf Martineau, Matthew; McIntosh-Smith, Simon; Gaudin, Wayne Concurrency and Computation: Practice and Experience, Vol. 29, Issue 15 https://doi.org/10.1002/cpe.4117	journal	March 2017
Status and future perspectives for lattice gauge theory calculations to the exascale and beyond Joó, Bálint; Jung, Chulwoo; Christ, Norman H. The European Physical Journal A, Vol. 55, Issue 11 https://doi.org/10.1140/epja/i2019-12919-7	journal	November 2019
Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication Liu, Junhong; He, Xin; Liu, Weifeng International Journal of Parallel Programming, Vol. 47, Issue 3 https://doi.org/10.1007/s10766-018-0604-8	journal	January 2019
InKS: a programming model to decouple algorithm from optimization in HPC codes Ejjaaouani, Ksander; Aumage, Olivier; Bigot, Julien The Journal of Supercomputing, Vol. 76, Issue 6 https://doi.org/10.1007/s11227-019-02950-2	journal	July 2019
Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code Artigues, Victor; Kormann, Katharina; Rampp, Markus Concurrency and Computation: Practice and Experience, Vol. 32, Issue 11 https://doi.org/10.1002/cpe.5640	journal	December 2019
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond Zenker, Erik; Widera, René; Huebl, Axel High Performance Computing https://doi.org/10.1007/978-3-319-46079-6_21	book	October 2016
Tiling-Based Programming Model for Structured Grids on GPU Clusters Bastem, Burak; Unat, Didem HPCAsia2020: International Conference on High Performance Computing in Asia-Pacific Region, Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region https://doi.org/10.1145/3368474.3368485	conference	January 2020
MPAS-Albany Land Ice (MALI): a variable-resolution ice sheet model for Earth system modeling using Voronoi grids Hoffman, Matthew J.; Perego, Mauro; Price, Stephen F. Geoscientific Model Development, Vol. 11, Issue 9 https://doi.org/10.5194/gmd-11-3747-2018	journal	January 2018
Performance of preconditioned iterative solvers in MFiX–Trilinos for fluidized beds Kotteda, V. M. Krushnarao; Kumar, Vinod; Spotz, William The Journal of Supercomputing, Vol. 74, Issue 8 https://doi.org/10.1007/s11227-018-2415-5	journal	May 2018
Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G. Taylor & Francis https://doi.org/10.6084/m9.figshare.6265274	text	January 2018
STEEL-RT: combining single task–single executor model and expanded scheduling to ease heterogeneity exploitation Rey, Antón; Igual, Francisco D.; Prieto-Matías, Manuel The Journal of Supercomputing, Vol. 76, Issue 6 https://doi.org/10.1007/s11227-019-02955-x	journal	August 2019
Portable multi- and many-core performance for finite-difference or finite-element codes – application to the free-surface component of NEMO (NEMOLite2D 1.0) Porter, Andrew R.; Appleyard, Jeremy; Ashworth, Mike Geoscientific Model Development, Vol. 11, Issue 8 https://doi.org/10.5194/gmd-11-3447-2018	journal	January 2018
Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations Stegailov, Vladimir; Agarkov, Alexander; Biryukov, Sergey Parallel Processing and Applied Mathematics https://doi.org/10.1007/978-3-319-78024-5_29	book	January 2018
Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G. Molecular Physics, Vol. 116, Issue 15-16 https://doi.org/10.1080/00268976.2018.1471532	journal	May 2018
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond Zenker, Erik; Widera, René; Huebl, Axel Deutsches Elektronen-Synchrotron, DESY, Hamburg https://doi.org/10.3204/pubdb-2017-03016	text	January 2016
Highly scalable discrete-particle simulations with novel coarse-graining: accessing the microscale Mattox, Timothy I.; Larentzos, James P.; Moore, Stan G. Taylor & Francis https://doi.org/10.6084/m9.figshare.6265274.v1	text	January 2018
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond Zenker, Erik; Widera, René; Huebl, Axel arXiv https://doi.org/10.48550/arxiv.1606.02862	text	January 2016

Similar Records

Manycore Performance-Portability: Kokkos Multidimensional Array Library

Journal Article · Sun Jan 01 00:00:00 EST 2012 · Scientific Programming · OSTI ID:1106586

Edwards, H. Carter; Sunderland, Daniel; Porter, Vicki; +2 more

Case Study of Using Kokkos and SYCLs Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs

Conference · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:1106586

Dufek, Amanda S; Gayatri, Rahulkumar; Mehta, Neil A; +4 more

ASC-ATDM Performance Portability Requirements for 2015-2019

Technical Report · Sun Mar 01 00:00:00 EST 2015 · OSTI ID:1106586

Edwards, Harold C.; Trott, Christian Robert

Related Subjects

97 MATHEMATICS AND COMPUTING
parallel computing
thread parallelism
manycore
GPU
performance portability
multidimensional array
mini-application

Title: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Citation Formats

References (10)

Cited By (32)

Similar Records

Related Subjects