Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Arumugam, Kamesh

doi:10.2172/1422715

Title: Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Full Record
Other Related Research

Abstract

Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses andmore »« less

Authors:

Arumugam, Kamesh ^[1]

Old Dominion Univ., Norfolk, VA (United States)

Publication Date:: Mon May 01 00:00:00 EDT 2017

Research Org.:: Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Nuclear Physics (NP)

OSTI Identifier:: 1422715

Report Number(s):: JLAB-ACP-17-2638; DOE/OR/23177-4354

DOE Contract Number:: AC05-06OR23177

Resource Type:: Thesis/Dissertation

Country of Publication:: United States

Language:: English

Citation Formats


                    Arumugam, Kamesh. Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures.  United States: N. p., 2017. 
        Web.  doi:10.2172/1422715.

Copy to clipboard


                    Arumugam, Kamesh. Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures.  United States.  https://doi.org/10.2172/1422715

Copy to clipboard


                    Arumugam, Kamesh. 2017.  
        "Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures".  United States.  https://doi.org/10.2172/1422715.  https://www.osti.gov/servlets/purl/1422715.

Copy to clipboard


                    
@article{osti_1422715,

  title        = {Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures},

  author       = {Arumugam, Kamesh},

  abstractNote = {Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-ow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-ow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.},

  doi          = {10.2172/1422715},

  url          = {https://www.osti.gov/biblio/1422715},
  journal      = {},
number       = ,

  volume       = ,

  place        = {United States},

  year         = {Mon May 01 00:00:00 EDT 2017},

  month        = {Mon May 01 00:00:00 EDT 2017}

}

Copy to clipboard

Thesis/Dissertation:

View Thesis/Dissertation

https://doi.org/10.2172/1422715

Other availability

Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this thesis or dissertation.

Save / Share:

Export Metadata

Save to My Library

Similar records in OSTI.GOV collections:

Investigation of Portable Event-Based Monte Carlo Transport Using the NVIDIA Thrust Library

Journal Article Bleile, Ryan; Brantley, Patrick; Dawson, Shawn; ... - Transactions of the American Nuclear Society

Power consumption considerations are driving future high performance computing platforms toward many-core computing architectures. Los Alamos National Laboratory's Trinity machine, available in 2016, will use both Intel Xeon Haswell processors and Intel Xeon Phi Knights Landing many integrated core (MIC) architecture coprocessors. Lawrence Livermore National Laboratory's Sierra machine, available in 2018, will use an IBM PowerPC architecture along with Nvidia graphics processing unit (GPU) architecture accelerators. These different advanced architectures make the computing landscape in upcoming years complex. Traditional approaches to Monte Carlo transport do not work efficiently on these new computing platforms. MIC architectures require vectorization to operate efficiently,more »« less
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Conference Tian, Jiannan; Rivera, Cody; Di, Sheng; ...

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs,more »« less
Full Text Available
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Conference Tian, Jiannan; Rivera, Cody; Di, Sheng; ...

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs,more »« less
https://doi.org/10.1109/IPDPS49936.2021.00097
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

Journal Article Wang, Bei; Ethier, Stephane; Tang, William; ... - International Journal of High Performance Computing Applications

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore »« less
Cited by 8
https://doi.org/10.1177/1094342017712059

Full Text Available

Similar Records