Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information
  1. Singularity-EOS: Performance Portable Equations of State and Mixed Cell Closures

    We present Singularity-EOS, a new performance-portable library for equations of state and related capabilities. Singularity-EOS provides a large set of analytic equations of state, such as the Gruneisen equation of state, and tabulated equation of state data under a unified interface. It also provides support capabilities around these equations of state, such as Python wrappers, solvers for finding pressure-temperature equilibrium between multiple equations of state, and a unique modifier framework, allowing the user to transform a base equation of state, for example by shifting or scaling the specific internal energy. All capabilities are performance portable, meaning they compile and run on both CPU and GPU for a wide variety of architectures.

  2. Exploring code portability solutions for HEP with a particle tracking test code

    Traditionally, high energy physics (HEP) experiments have relied on x86 CPUs for the majority of their significant computing needs. As the field looks ahead to the next generation of experiments such as DUNE and the High-Luminosity LHC, the computing demands are expected to increase dramatically. To cope with this increase, it will be necessary to take advantage of all available computing resources, including GPUs from different vendors. A broad landscape of code portability tools—including compiler pragma-based approaches, abstraction libraries, and other tools—allow the same source code to run efficiently on multiple architectures. In this paper, we use a test code taken from a HEP tracking algorithm to compare the performance and experience of implementing different portability solutions. While in several cases portable implementations perform close to the reference code version, we find that the performance varies significantly depending on the details of the implementation. Achieving optimal performance is not easy, even for relatively simple applications such as the test codes considered in this work. Several factors can affect the performance, such as the choice of the memory layout, the memory pinning strategy, and the compiler used. The compilers and tools are being actively developed, so future developments may be critical for their deployment in HEP experiments.

  3. MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

    MAGMA (Matrix Algebra for GPU and Multicore Architectures) is a pivotal open-source library in the landscape of GPU-enabled dense and sparse linear algebra computations. With a repertoire of approximately 750 numerical routines across four precisions, MAGMA is deeply ingrained in the DOE software stack, playing a crucial role in high-performance computing. Notable projects such as ExaConstit, HiOP, MARBL, and STRUMPACK, among others, directly harness the capabilities of MAGMA. In addition, the MAGMA development team has been acknowledged multiple times for contributing to the vendors’ numerical software stacks. Looking back over the time of the Exascale Computing Project (ECP), we highlight how MAGMA has adapted to recent changes in modern HPC systems, especially the growing gap between CPU and GPU compute capabilities, as well as the introduction of low precision arithmetic in modern GPUs. Furthermore, we also describe MAGMA’s direct impact on several ECP projects. Maintaining portable performance across NVIDIA and AMD GPUs, and with current efforts toward supporting Intel GPUs, MAGMA ensures its adaptability and relevance in the ever-evolving landscape of GPU architectures.

  4. Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL

    SYCL is a portable programming model for heterogeneous computing, so it is important to obtain reasonable performance portability of SYCL. Towards the goal of better understanding and improving performance portability of SYCL for machine learning workloads, we have been developing benchmarks for basic operators in deep neural networks (DNNs). These operators could be offloaded to heterogeneous computing devices such as graphics processing units (GPUs) to speed up computation. In this work, we introduce the benchmarks, evaluate the performance of the operators on GPU-based systems, and describe the causes of the performance gap between the SYCL and Compute Unified Device Architecture (CUDA) kernels. We find that the causes are related to the utilization of the texture cache for read-only data, optimization of the memory accesses with strength reduction, shared local memory accesses, and register usage per thread. We hope that the efforts of developing benchmarks for studying performance portability will stimulate discussion and interactions within the community.

  5. Early experiences on the OLCF Frontier system with AthenaPK and Parthenon–Hydro

    The Oak Ridge Leadership Computing Facility (OLCF) has been preparing the nation's first exascale system, Frontier, for production and end users. Frontier is based on HPE Cray's new EX architecture and Slingshot interconnect and features 74 cabinets of optimized 3rd Gen AMD EPYC CPUs for HPC and AI and AMD Instinct 250X accelerators. As a part of this preparation, “real-world” user codes have been selected to help assess the functionality, performance, and usability of the system. This article describes early experiences using the system in collaboration with the Hamburg Observatory for two selected codes, which have since been adopted in the OLCF test harness. Experiences discussed include efforts to resolve performance variability and per-cycle slowdowns. Results are shown for a performance portable astrophysical magnetohydronamics code, AthenaPK, and a mini-application stressing the core functionality of a performance portable block-structured adaptive mesh refinement framework, Parthenon-Hydro. Here, these results show good scaling characteristics to the full system. At the largest scale, the Parthenon-Hydro miniapp reaches a total of $1.7$ $$\times$$ $$10^{13}$$ zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈92% weak scaling parallel efficiency (starting from a single node using a second-order, finite-volume method).

  6. Enabling portable demand flexibility control applications in virtual and real buildings

    Control applications that facilitate Demand Flexibility (DF) are difficult to deploy at scale in existing buildings. The heterogeneity of systems and non-standard naming conventions for metadata describing data points in building automation systems often lead to ad-hoc and building-specific applications. In recent years, several researchers investigated semantic models to describe the meaning of building data. They suggest that these models can enhance the deployment of building applications, enabling data exchanges among heterogeneous sources and their portability across different buildings. However, the studies in question fail to explore these capabilities in the context of controls. This paper proposes a novel semantics-driven framework for developing and deploying portable DF control applications. The design of the framework leverages an iterative design science research methodology, evolving from evidence gathered through simulation and field demonstrations. The framework aims to decouple control applications from specific buildings and control platforms, enabling these control applications to be configured semi-automatically. This allows application developers and researchers to streamline the onboarding of new applications that could otherwise be time-consuming and resource-intensive. The framework has been validated for its capability to facilitate the deployment of control applications sharing the same codebase across diverse virtual and real buildings. The demonstration successfully tested two controls for load shifting and shedding in four virtual buildings using the Building Optimization Testing Framework (BOPTEST) and in one real building using the control platform VOLTTRON. Insights into the current limitations, benefits, and challenges of generalizable controls and semantic models are derived from the deployment efforts and outcomes to guide future research in this field.

  7. CSPlib: A performance portable parallel software toolkit for analyzing complex kinetic mechanisms

    Computational singular perturbation (CSP) is a method to analyze dynamical systems. It targets the decoupling of fast and slow dynamics using an alternate linear expansion of the right-hand side of the governing equations based on eigenanalysis of the associated Jacobian matrix. This representation facilitates diagnostic analysis, detection and control of stiffness, and the development of simplified models. For this work, we have implemented CSP in a C++ open-source library CSPlib using the Kokkos parallel programming model to address portability across diverse heterogeneous computing platforms, i.e., multi/many-core CPUs and GPUs. We describe the CSPlib implementation and present its computational performance across different computing platforms using several test problems. Specifically, we test the CSPlib performance for a constant pressure ignition reactor model on different architectures, including IBM Power 9, Intel Xeon Skylake, and NVIDIA V100 GPU. The size of the chemical kinetic mechanism is varied in these tests. As expected, the Jacobian matrix evaluation, the eigensolution of the Jacobian matrix, and matrix inversion are the most expensive computational tasks. When considering the higher throughput characteristic of GPUs, GPUs performs better for small matrices with higher occupancy rate. CPUs gain more advantages from the higher performance of well-tuned and optimized linear algebra libraries such as OpenBLAS.

  8. SeeQ: A Programming Model for Portable Data-Driven Building Applications

    This paper introduces SeeQ, a programming model and an abstraction framework that facilitates the development of portable data- driven building applications. Data-driven approaches can provide insights into building operations and guide decision-making to achieve operational objectives. Yet the configuration of such applications per building requires extensive effort and tacit knowledge. In SeeQ, we propose a portable programming model and build a software system that enables self-configuration and execution across diverse buildings. The configuration of each building is captured in a unified data model - in this paper, we work with the Brick ontology without loss of generality. SeeQ focuses on the distinction between the application logic and the configuration of an application against building-specific data inputs and systems. We test the proposed approach by configuring and deploying a diverse range of applications across five heterogeneous real-world buildings. The analysis shows the potential of SeeQ to significantly reduce the efforts associated with the delivery of building analytics.

  9. Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code

    This paper presents the steps followed to GPU-offload parts of the core solver of EFIT-AI, an equilibrium reconstruction code suitable for tokamak experiments and burning plasmas. For this work, we will focus on the fitting procedure that consists of a Grad–Shafranov (GS) equation inverse solver that calculates equilibrium reconstructions on a grid. We will show profiling results of the original code (CPU-baseline), as well as the directives used to GPU-offload the most time-consuming function, initially to compare OpenACC and OpenMP on NVIDIA and AMD GPUs and later on to assess OpenMP performance portability on NVIDIA, AMD and Intel GPUs. We will make a performance comparison for different spatial grid sizes and show the speedup achieved on NVIDIA A100 (Perlmutter-NERSC), AMD MI250X (Frontier-OLCF) and Intel PVC GPUs (Sunspot-ALCF). Finally, we will draw some conclusions and recommendations to achieve high-performance portability for an equilibrium reconstruction code on the new HPC architectures

  10. Efficient phase-space generation for hadron collider event simulation

    We present a simple yet efficient algorithm for phase-space integration at hadron colliders. Individual mappings consist of a single t-channel combined with any number of s-channel decays, and are constructed using diagrammatic information. The factorial growth in the number of channels is tamed by providing an option to limit the number of s-channel topologies. We provide a publicly available, parallelized code in C++ and test its performance in typical LHC scenarios.


Search for:
All Records
Subject
portability

Refine by:
Resource Type
Availability
Publication Date
  • 1996: 2 results
  • 1997: 0 results
  • 1998: 0 results
  • 1999: 0 results
  • 2000: 0 results
  • 2001: 1 results
  • 2002: 0 results
  • 2003: 0 results
  • 2004: 0 results
  • 2005: 0 results
  • 2006: 0 results
  • 2007: 0 results
  • 2008: 0 results
  • 2009: 0 results
  • 2010: 0 results
  • 2011: 0 results
  • 2012: 0 results
  • 2013: 0 results
  • 2014: 1 results
  • 2015: 1 results
  • 2016: 3 results
  • 2017: 3 results
  • 2018: 3 results
  • 2019: 2 results
  • 2020: 1 results
  • 2021: 10 results
  • 2022: 6 results
  • 2023: 6 results
  • 2024: 6 results
1996
2024
Author / Contributor
Research Organization