skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information
  1. Experimental Characterization of OpenMP Offloading Memory Operations and Unified Shared Memory Support

    The OpenMP specification recently introduced support for unified shared memory, allowing implementation to leverage underlying system software to provide a simpler GPU offloading model where explicit mapping of variables is optional. Support for this feature is becoming more available in different OpenMP implementations on several hardware platforms. A deeper understanding of the different implementation’s execution profile and performance is crucial for applications as they consider the performance portability implications of adopting a unified memory offloading programming style. This work introduces a benchmark tool to characterize unified memory support in several OepnMP compilers and runtimes, with emphasis on identifying discrepancies betweenmore » different OpenMP implementations as to how they various memory allocation strategies interact with unified shared memory. The benchmark tool is used to characterize OpenMP compilers on three leading High Performance Computing platforms supporting different CPU and device architectures. The benchmark tool is used to assess the impact of enabling unified shared memory on the performance of memory-bound code, highlighting implementation differences that should be accounted for when applications consider performance portability across platforms and compilers.« less
  2. Towards a Standard Process Management Infrastructure for Workflows Using Python

    Orchestrating the execution of ensembles of processes lies at the core of scientific workflow engines on large scale parallel platforms. This is usually handled using platform-specific command line tools, with limited process management control and potential strain on system resources. The PMIx standard provides a uniform interface to system resources. The low level C implementation of PMIx has hampered its use in workflow engines, leading to the development of Python binding that has yet to gain traction. In this paper, we present our work to harden the PMIx Python client, demonstrating its usability using a prototype Python driver to orchestratemore » the execution of an ensemble of processes. We present experimental results using the prototype on the Summit supercomputer at Oak Ridge National Laboratory. This work lays the foundation for wider adoption of PMIx for workflow engines, and encourages wider support of more PMIx functionality in vendor provided system software stacks.« less
  3. RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources

    Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Resource partitioning and distribution of tasks execution over portioned resources promises to address those problems but we lack an experimental evaluation of its performance at scale. This paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC platform Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. We partitionmore » resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the workload execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52%. While having less concurrently executed tasks resource utilization is able to reach up to 85%, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.« less
  4. Adaptive Generation of Training Data for ML Reduced Model Creation

    Machine learning proxy models are often used to speed up or completely replace complex computational models. The greatly reduced and deterministic computational costs enable new use cases such as digital twin control systems and global optimization. The challenge of building these proxy models is generating the training data. A naive uniform sampling of the input space can result in a non-uniform sampling of the output space of a model. This can cause gaps in the training data coverage that can miss finer scale details resulting in poor accuracy. While larger and larger data sets could eventually fill in these gaps,more » the computational burden of full-scale simulation codes can make this prohibitive. In this paper, we present an adaptive data generation method that utilizes uncertainty estimation to identify regions where training data should be augmented. By targeting data generation to areas of need, representative data sets can be generated efficiently. The effectiveness of this method will be demonstrated on a simple one-dimensional function and a complex multidimensional physics model.« less
  5. $$\mathrm{RADICAL}$$-Pilot and $$\mathrm{PMIx}$$/$$\mathrm{PRRTE}$$: Executing Heterogeneous Workloads at Large Scale on Partitioned $$\mathrm{HPC}$$ Resources

    Execution of heterogeneous workflows on high-performance computing (HPC) platforms present unprecedented resource management and execution coordination challenges for runtime systems. Task heterogeneity increases the complexity of resource and execution management, limiting the scalability and efficiency of workflow execution. Re-source partitioning and distribution of tasks execution over portioned re-sources promises to address those problems but we lack an experimental evaluation of its performance at scale. Here this paper provides a performance evaluation of the Process Management Interface for Exascale (PMIx) and its reference implementation PRRTE on the leadership-class HPC plat-form Summit, when integrated into a pilot-based runtime system called RADICAL-Pilot. Wemore » partition resources across multiple PRRTE Distributed Virtual Machine (DVM) environments, responsible for launching tasks via the PMIx interface. We experimentally measure the work-load execution performance in terms of task scheduling/launching rate and distribution of DVM task placement times, DVM startup and termination overheads on the Summit leadership-class HPC platform. Integrated solution with PMIx/PRRTE enables using an abstracted, standardized set of interfaces for orchestrating the launch process, dynamic process management and monitoring capabilities. It extends scaling capabilities allowing to overcome a limitation of other launching mechanisms (e.g., JSM/LSF). Explored different DVM setup configurations provide insights on DVM performance and a layout to leverage it. Our experimental results show that heterogeneous workload of 65,500 tasks on 2048 nodes, and partitioned across 32 DVMs, runs steady with resource utilization not lower than 52%. While having less concurrently executed tasks resource utilization is able to reach up to 85%, based on results of heterogeneous workload of 8200 tasks on 256 nodes and 2 DVMs.« less
  6. HPC Molecular Simulation Tries Out a New GPU: Experiences on Early AMD Test Systems for the Frontier Supercomputer

    Molecular simulation is an important tool for nu- merous efforts in physics, chemistry, and the biological sciences. Simulating molecular dynamics requires extremely rapid cal- culations to enable sufficient sampling of simulated temporal molecular processes. The Hewlett Packard Enterprise (HPE) Cray EX Frontier supercomputer installed at the Oak Ridge Leadership Computing Facility (OLCF) will provide an exascale resource for open science, and will feature graphics processing units (GPUs) from Advanced Micro Devices (AMD). The future LUMI supercomputer in Finland will be based on an HPE Cray EX platform as well. Here we test the ports of several widely used molecular dynamicsmore » packages that have each made substantial use of acceleration with NVIDIA GPUs, on Spock, the early Cray pre-Frontier testbed system at the OLCF which employs AMD GPUs. These programs are used extensively in industry for pharmaceutical and materials research, as well as academia, and are also frequently deployed on high-performance computing (HPC) systems, including national leadership HPC resources. We find that in general, performance is competitive and installation is straightforward, even at these early stages in a new GPU ecosystem. Our experiences point to an expanding arena for GPU vendors in HPC for molecular simulation.« less
  7. Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

    High-throughput structure-based screening of drug-like molecules has become a common tool in biomedical research. Recently, acceleration with graphics processing units (GPUs) has provided a large performance boost for molecular docking programs. Both cloud and high-performance computing (HPC) resources have been used for large screens with molecular docking programs; while NVIDIA GPUs have dominated cloud and HPC resources, new vendors such as AMD and Intel are now entering the field, creating the problem of software portability across different GPUs. Ideally, software productivity could be maximized with portable programming models that are able to maintain high performance across architectures. While in manymore » cases compiler directives have been used as an easy way to offload parallel regions of a CPU-based program to a GPU accelerator, they may also be an attractive programming model for providing portability across different GPU vendors, in which case the porting process may proceed in the reverse direction: from low-level, architecture-specific code to higher-level directive-based abstractions. MiniMDock is a new mini-application (miniapp) designed to capture the essential computational kernels found in molecular docking calculations, such as are used in phar-maceutical drug discovery efforts, in order to test different solutions for porting across GPU architectures. Here we extend MiniMDock to GPU offloading with OpenMP directives, and compare to performance of kernels using CUDA and HIP on NVIDIA and AMD GPUs, respectively, as well as across different compilers, exploring performance bottlenecks. We document this reverse-porting process, from highly optimized device code to a higher-level version using directives, compare code structure, and describe barriers that were overcome in this effort.« less
  8. Core-Pedestal Plasma Configurations in Advanced Tokamaks

    Here, several configurations for the core and pedestal plasma are examined for a predefined tokamak design by implementing multiple heating/current drive (H/CD) sources to achieve an optimum configuration of high fusion power in a noninductive operation while maintaining an ideally magnetohydrodynamic (MHD) stable core plasma using the IPS-FASTRAN framework. IPS-FASTRAN is a component-based lightweight coupled simulation framework that is used to simulate magnetically confined plasma by integrating a set of high-fidelity codes to construct the plasma equilibrium (EFIT, TOQ, and CHEASE), calculate the turbulent heat and particle transport fluxes (TGLF), model various H/CD systems (TORIC, TORAY, GENRAY, and NUBEAM), modelmore » the pedestal pressure and width (EPED), and estimate the ideal MHD stability (DCON). The TGLF core transport model and EPED pedestal model are used to self-consistently predict plasma profiles consistent with ideal MHD stability and H/CD (and bootstrap) current sources. In order to evaluate the achievable and sustainable plasma beta, varying configurations are produced ranging from the no-wall stability to with-wall stability regimes, simultaneously subject to the self-consistent TGLF, EPED, and H/CD source profile predictions that optimize configuration performance. The pedestal density, plasma current, and total injected power are scanned to explore their impact on the target plasma configuration, fusion power, and confinement quality. A set of fully noninductive scenarios are achieved by employing ion-cyclotron, neutral beam injection, helicon, and lower-hybrid H/CDs to provide a broad profile for the total current drive in the core region for a predefined tokamak design. These noninductive scenarios are characterized by high fusion gain (Q ~ 4) and power (Pfus ~ 600 MW), optimum confinement quality (H98 ~ 1.1), and high bootstrap current fraction (fBS ~ 0.7) for Greenwald fraction below unity. The broad current profile configurations identified are stable to low-n kink modes either because the normalized pressure βN is below the no-wall limit or a wall is present.« less
  9. Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

    This paper assesses and reports the experience of ten teams working to port, validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems, each one equipped with a server-class Arm CPU from Ampere Computing and two data center GPUs from NVIDIA Corp. The systems are connected together using InfiniBand interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust andmore » easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments.« less
  10. Workflows Community Summit 2022: A Roadmap Revolution

    Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from the execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing (often referred to as a computing continuum) and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, needmore » for processing large-scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC, enable the management of many small-sized files, allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enable the publication of workflows and their associated products according to the FAIR principles.« less
...

Search for:
All Records
Author / Contributor
0000000305541036

Refine by:
Resource Type
Availability
Publication Date
Author / Contributor
Research Organization