skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scaling Graph Community Detection on the Tilera Many-core Architecture

Abstract

In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 46x on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploitedmore » for graph algorithms that that embody all the essential characteristics of an irregular application.« less

Authors:
; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1194322
Report Number(s):
PNNL-SA-103170
400470000
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: 21st International Conference on High Performance Computing (HiPC 2014), December 17-20, 2014, Dona Paula, India
Country of Publication:
United States
Language:
English
Subject:
Tilera; community detection

Citation Formats

Chavarría-Miranda, Daniel, Halappanavar, Mahantesh, and Kalyanaraman, Anantharaman. Scaling Graph Community Detection on the Tilera Many-core Architecture. United States: N. p., 2014. Web. doi:10.1109/HiPC.2014.7116708.
Chavarría-Miranda, Daniel, Halappanavar, Mahantesh, & Kalyanaraman, Anantharaman. Scaling Graph Community Detection on the Tilera Many-core Architecture. United States. doi:10.1109/HiPC.2014.7116708.
Chavarría-Miranda, Daniel, Halappanavar, Mahantesh, and Kalyanaraman, Anantharaman. Mon . "Scaling Graph Community Detection on the Tilera Many-core Architecture". United States. doi:10.1109/HiPC.2014.7116708.
@article{osti_1194322,
title = {Scaling Graph Community Detection on the Tilera Many-core Architecture},
author = {Chavarría-Miranda, Daniel and Halappanavar, Mahantesh and Kalyanaraman, Anantharaman},
abstractNote = {In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 46x on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that that embody all the essential characteristics of an irregular application.},
doi = {10.1109/HiPC.2014.7116708},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Dec 01 00:00:00 EST 2014},
month = {Mon Dec 01 00:00:00 EST 2014}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86more » servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.« less
  • Abstract not provided.
  • Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, multicore and many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications { the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) {more » on the Tilera many-core system. We have significantly extended MIT's OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.« less
  • this paper, we have developed a new methodology that takes in consideration the access patterns from a single parallel actor (e.g. a thread), as well as, the access patterns of “grouped” parallel actors that share a resource (e.g. a distributed Level 3 cache). We start with a hierarchical tile code for our target machine and apply a series of transformations at the tile level to improve data residence in a given memory hierarchy level. The contribution of this paper includes (a) collaborative data restructuring for group reuse and (b) low overhead transformation technique to improve access pattern and bring closelymore » connected data elements together. Preliminary results in a many core architecture, Tilera TileGX, shows promising improvements over optimized OpenMP code (up to 31% increase in GFLOPS) and over our own previous work on fine grained runtimes (up to 16%) for selected kernels« less
  • Abstract not provided.