Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Shen, Xipeng

doi:10.2172/1576175

Title: Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Full Record
Other Related Research

Abstract

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardware trend is the growing gap between memory bandwidth and the aggregate speed—that is, the sum of all cores' computing power—of a Chip Multiprocessor (CMP). Despite the capped growth of the peak CPU speed, the aggregate speed of a CMP keeps increasing as more cores get into a single chip. It is expected that by 2018, node concurrency in an exascale system will increase by hundreds of times, whereas, memory bandwidth will expand by only 10 to 20 times. Consequently, data movement and storage is expected to consume more than 70% of the total system power. Bridging this gap is difficult;more »« less

Authors:

Shen, Xipeng ^[1]

North Carolina State University, Raleigh, NC (United States)

Publication Date:: Fri Nov 29 00:00:00 EST 2019

Research Org.:: North Carolina State University, Raleigh, NC (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1576175

Report Number(s):: DOE-0013700-1

DOE Contract Number:: SC0013700

Resource Type:: Technical Report

Country of Publication:: United States

Language:: English

Subject:: 42 ENGINEERING; Exascale Computing; GPU; Heterogeneous Computing

Citation Formats


                    Shen, Xipeng. Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report).  United States: N. p., 2019. 
        Web.  doi:10.2172/1576175.

Copy to clipboard


                    Shen, Xipeng. Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report).  United States.  https://doi.org/10.2172/1576175

Copy to clipboard


                    Shen, Xipeng. 2019.  
        "Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)".  United States.  https://doi.org/10.2172/1576175.  https://www.osti.gov/servlets/purl/1576175.

Copy to clipboard


                    
@article{osti_1576175,

  title        = {Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)},

  author       = {Shen, Xipeng},

  abstractNote = {The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardware trend is the growing gap between memory bandwidth and the aggregate speed—that is, the sum of all cores' computing power—of a Chip Multiprocessor (CMP). Despite the capped growth of the peak CPU speed, the aggregate speed of a CMP keeps increasing as more cores get into a single chip. It is expected that by 2018, node concurrency in an exascale system will increase by hundreds of times, whereas, memory bandwidth will expand by only 10 to 20 times. Consequently, data movement and storage is expected to consume more than 70% of the total system power. Bridging this gap is difficult; the complexities of modern CMP memory hierarchy make it even harder: Data cache becomes shared among computing units, and the sharing is often non-uniform—whether two computing units share a cache depends on their proximity and the level of the cache. On the recent IBM Power7 architecture, for instance, four hardware contexts (or SMT threads) in a core share the entire memory hierarchy, all cores in one chip share an on-chip L3 cache, and cores across chips share L3 and main memory through off-chip connections. These two trends complicate the translation of computing power into performance, especially for a program with either intensive data accesses or complex patterns in data accesses or control flow paths. Unfortunately, both attributes present and will persist in a class of important applications. For instance, many scientific simulations deal with a large volume of data. And meanwhile, as most real-world processes are non-uniform and evolving (e.g., the evolution of a galaxy or the process of a drug injection), both the computations and data accesses of these programs tend to be irregular and dynamically changing. Currently, the lack of support to these applications on modern CMP severely limits their performance. On GPU, as our recent study shows and other studies echo, performance enhancement of a factor of integers is possible when memory accesses or control flows are streamlined for a set of GPU applications. On multicore CPU, our studies show that traditional locality enhancement, for being oblivious to the new features of multicore memory hierarchy, may even cause large slowdown to data-intensive dynamic applications. The severity of the issues is expected to worsen as the two hardware trends continue. Some recent studies try to match software with the trends, but in a limited scope or manner. For irregularities on GPU, most studies focus on irregularities analyzable through static analysis (e.g., data accesses in regular loops). Dynamic irregularities are harder to address because the needed analysis and transformations typically have to happen at run time. Some other research resorts to hardware extensions, an actual adoption of which is unclear for the entailed space cost and complexity. For data locality, recent years have seen some exploitations of the new memory hierarchy on multicore for performance, but most of them are on process or thread scheduling, rather than program transformations. Our recent study reveals that program-level transformations may magnify the scheduling benefits by a factor of seven, concluding that program-level transformation should play a central role for data locality enhancement on modern CMP. But research in this direction has been sparse, and most have focused on data layout or cache performance modeling, rather than program transformations to match with the new memory hierarchy features. Overall, it is still an open question how to bridge the gap between dynamic computations and the two prominent properties of modern processors. The goal of this project is to develop a set of techniques and software tools to enhance the matching between memory accesses in dynamic simulations and the prominent features of modern and future CMP, alleviating the memory performance issues for petascale and exascale computing. This report summarizes the discoveries and products produced throughout this project. It includes free launch, a new software approach to overcoming the shortcomings of both methods; coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution; algorithmic optimizations to data analytics problems, especially those that involve lots of distance calculations. Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average. The work was published at the 48th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'2015). A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup. In addition, we have examined algorithmic optimizations to data analytics problems, especially those that involve lots of distance calculations. Computing distances among data points is an essential part of many important algorithms in data analytics, graph analysis, and other domains. In each of these domains, developers have spent significant manual effort optimizing algorithms, often through novel applications of the triangle equality, in order to minimize the number of distance computations in the algorithms. In this work, we observe that many algorithms across these domains can be generalized as an instance of a generic distance-related abstraction. Based on this abstraction, we derive seven principles for correctly applying the triangular inequality to optimize distance-related algorithms. Guided by the findings, we develop Triangular OPtimizer (TOP), the first software framework that is able to automatically produce optimized algorithms that either matches or outperforms manually designed algorithms for solving distance-related problems. TOP achieves up to 237x speedups and 2.5X on average. The work has been published at the 32nd International Conference on Machine Learning (ICML'15) and the 41st International Conference on Very Large Data Bases (VLDB'15).},

  doi          = {10.2172/1576175},

  url          = {https://www.osti.gov/biblio/1576175},
  journal      = {},
number       = ,

  volume       = ,

  place        = {United States},

  year         = {Fri Nov 29 00:00:00 EST 2019},

  month        = {Fri Nov 29 00:00:00 EST 2019}

}

Copy to clipboard

Technical Report:

View Technical Report (0.89 MB)

https://doi.org/10.2172/1576175

Save / Share:

Export Metadata

Save to My Library

Similar records in OSTI.GOV collections:

Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

Book Tumeo, Antonino; Villa, Oreste; Secchi, Simone; ...

String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearlymore »« less
Complexity in scalable computing.

Journal Article Rouson, Damian - Proposed for publication in Scientific Programming.

The rich history of scalable computing research owes much to a rapid rise in computing platform scale in terms of size and speed. As platforms evolve, so must algorithms and the software expressions of those algorithms. Unbridled growth in scale inevitably leads to complexity. This special issue grapples with two facets of this complexity: scalable execution and scalable development. The former results from efficient programming of novel hardware with increasing numbers of processing units (e.g., cores, processors, threads or processes). The latter results from efficient development of robust, flexible software with increasing numbers of programming units (e.g., procedures, classes, componentsmore »« less
https://doi.org/10.1155/2008/693705
Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more »« less
https://doi.org/10.2172/1092881

Full Text Available
Multi-core and Many-core Shared-memory Parallel Raycasting Volume Rendering Optimization and Tuning

Journal Article Howison, Mark - International Journal of High Performance Computing Applications

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUsmore »« less
Full Text Available
Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Journal Article Bethel, E.; Howison, Mark - International Journal of High Performance Computing Applications

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUsmore »« less
Cited by 12
https://doi.org/10.1177/1094342012440466

Full Text Available

Other research related to this record:

The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications
journal, February 2012

Zhang, Eddy Z.; Jiang, Yunlian; Shen, Xipeng
IEEE Transactions on Parallel and Distributed Systems, Vol. 23, Issue 2
https://doi.org/10.1109/TPDS.2011.130
This journal is a supplement to the current record

Enabling Portable Optimizations of Data Placement on GPU
journal, July 2015

Chen, Guoyang; Wu, Bo; Li, Dong
IEEE Micro, Vol. 35, Issue 4
https://doi.org/10.1109/MM.2015.53
This journal is a supplement to the current record

TOP: a framework for enabling algorithmic optimizations for distance-related problems
journal, June 2015

Ding, Yufei; Shen, Xipeng; Musuvathi, Madanlal
Proceedings of the VLDB Endowment, Vol. 8, Issue 10
https://doi.org/10.14778/2794367.2794374
This journal is a supplement to the current record

An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations
journal, December 2012

Shen, Xipeng; Liu, Yixun; Zhang, Eddy Z.
International Journal of Parallel Programming, Vol. 41, Issue 6
https://doi.org/10.1007/s10766-012-0236-3
This journal is a supplement to the current record

SM-centric transformation: circumventing hardware restrictions for flexible GPU scheduling
conference, January 2014

Wu, Bo; Chen, Guoyang; Li, Dong
Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
https://doi.org/10.1145/2628071.2628130
This conference is a supplement to the current record

One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
conference, January 2012

Guo, Ziyu; Wu, Bo; Shen, Xipeng
Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
https://doi.org/10.1145/2304576.2304583
This conference is a supplement to the current record

SMT-centric power-aware thread placement in chip multiprocessors
conference, October 2013

Bin Wang,
22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques
https://doi.org/10.1109/PACT.2013.6618807
This conference is a supplement to the current record

Free launch: optimizing GPU dynamic kernel launches through thread reuse
conference, January 2015

Chen, Guoyang; Shen, Xipeng
Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48
https://doi.org/10.1145/2830772.2830818
This conference is a supplement to the current record

Simple Profile Rectifications Go a Long Way
book, January 2013

Wu, Bo; Zhou, Mingzhou; Shen, Xipeng
ECOOP 2013 – Object-Oriented Programming
https://doi.org/10.1007/978-3-642-39038-8_27
This book is a supplement to the current record

Understanding Co-run Degradations on Integrated Heterogeneous Processors
book, January 2015

Zhu, Qi; Wu, Bo; Shen, Xipeng
Languages and Compilers for Parallel Computing
https://doi.org/10.1007/978-3-319-17473-0_6
This book is a supplement to the current record

A study towards optimal data layout for GPU computing
conference, January 2012

Zhang, Eddy Z.; Li, Han; Shen, Xipeng
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness - MSPC '12
https://doi.org/10.1145/2247684.2247699
This conference is a supplement to the current record

Similar Records
Related Works
book (2)
conference (5)
journal (4)