skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An asynchronous traversal engine for graph-based rich metadata management

Abstract

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenance data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem canmore » be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less

Authors:
 [1];  [2];  [2];  [2];  [1];  [1]
  1. Texas Tech Univ., Lubbock, TX (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Science Foundation (NSF)
OSTI Identifier:
1333002
Alternate Identifier(s):
OSTI ID: 1359729
Grant/Contract Number:  
AC02-06CH11357; CCF-1409946; CNS-1263183
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 58; Journal Issue: C; Journal ID: ISSN 0167-8191
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; graph partitioning; graph traversal; parallel file systems; property graph; rich metadata management

Citation Formats

Dai, Dong, Carns, Philip, Ross, Robert B., Jenkins, John, Muirhead, Nicholas, and Chen, Yong. An asynchronous traversal engine for graph-based rich metadata management. United States: N. p., 2016. Web. doi:10.1016/j.parco.2016.06.002.
Dai, Dong, Carns, Philip, Ross, Robert B., Jenkins, John, Muirhead, Nicholas, & Chen, Yong. An asynchronous traversal engine for graph-based rich metadata management. United States. doi:10.1016/j.parco.2016.06.002.
Dai, Dong, Carns, Philip, Ross, Robert B., Jenkins, John, Muirhead, Nicholas, and Chen, Yong. Thu . "An asynchronous traversal engine for graph-based rich metadata management". United States. doi:10.1016/j.parco.2016.06.002. https://www.osti.gov/servlets/purl/1333002.
@article{osti_1333002,
title = {An asynchronous traversal engine for graph-based rich metadata management},
author = {Dai, Dong and Carns, Philip and Ross, Robert B. and Jenkins, John and Muirhead, Nicholas and Chen, Yong},
abstractNote = {Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenance data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.},
doi = {10.1016/j.parco.2016.06.002},
journal = {Parallel Computing},
number = C,
volume = 58,
place = {United States},
year = {Thu Jun 23 00:00:00 EDT 2016},
month = {Thu Jun 23 00:00:00 EDT 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share: