Scalable Pattern Matching in Metadata Graphs via Constraint Checking
- Univ. of British Columbia, Vancouver, BC (Canada)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- Grant/Contract Number:
- AC52-07NA27344
- OSTI ID:
- 1769153
- Report Number(s):
- LLNL-JRNL-817625; 1022376
- Journal Information:
- ACM Transactions on Parallel Computing, Vol. 8, Issue 1; ISSN 2329-4949
- Publisher:
- Association for Computing MachineryCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Taming verification hardness: an efficient algorithm for testing subgraph isomorphism
|
journal | August 2008 |
Fast Graph Pattern Matching
|
conference | April 2008 |
Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
|
conference | June 2020 |
What is Twitter, a social network or a news media?
|
conference | January 2010 |
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks
|
conference | January 2011 |
Continuous pattern detection over billion-edge graph using distributed framework
|
conference | March 2014 |
node2vec: Scalable Feature Learning for Networks
|
conference | January 2016 |
Pregel: a system for large-scale graph processing
|
conference | January 2010 |
Group formation in large social networks: membership, growth, and evolution
|
conference | January 2006 |
Fractal: A General-Purpose Graph Pattern Mining System
|
conference | June 2019 |
RolX: structural role extraction & mining in large graphs
|
conference | January 2012 |
A large time-aware web graph
|
journal | November 2008 |
GraMi: frequent subgraph and pattern mining in a single large graph
|
journal | March 2014 |
Exemplar or matching: modeling DCJ problems with unequal content genome data
|
journal | August 2015 |
TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing
|
conference | June 2014 |
Efficient subgraph matching on billion node graphs
|
journal | May 2012 |
Towards Interactive Pattern Search in Massive Graphs
|
conference | June 2020 |
PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine
|
conference | May 2017 |
Counting triangles and the curse of the last reducer
|
conference | January 2011 |
GraphFrames: an integrated API for mixing graph and relational queries
|
conference | January 2016 |
Classification of software behaviors for failure detection: a discriminative pattern mining approach
|
conference | January 2009 |
Distributed graph pattern matching
|
conference | January 2012 |
In-Memory Graph Databases for Web-Scale Data
|
journal | March 2015 |
On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures
|
journal | September 2007 |
PGX.D: a fast distributed graph processing engine
|
conference | November 2015 |
Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs
|
conference | September 2017 |
G-Tries: a data structure for storing and finding subgraphs
|
journal | February 2013 |
Arabesque: a system for distributed graph mining
|
conference | October 2015 |
Efficient Processing of Large Graphs via Input Reduction
|
conference | May 2016 |
Enabling real time data analysis
|
journal | September 2010 |
Inexact graph matching for structural pattern recognition
|
journal | May 1983 |
PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution
|
conference | November 2018 |
Inexact subgraph isomorphism in MapReduce
|
journal | February 2013 |
A distributed vertex-centric approach for pattern matching in massive graphs
|
conference | October 2013 |
G-Miner: an efficient task-oriented graph mining system
|
conference | April 2018 |
GraphMat: high performance graph analytics made productive
|
journal | July 2015 |
Implementing sparse matrix-vector multiplication on throughput-oriented processors
|
conference | January 2009 |
Practical graph isomorphism, II
|
journal | January 2014 |
Biomolecular network motif counting and discovery by color coding
|
journal | June 2008 |
Real-time twitter recommendation: online motif detection in large dynamic graphs
|
journal | August 2014 |
Fast Connected Components Computation in Large Graphs by Vertex Pruning
|
journal | March 2017 |
A survey and experimental comparison of distributed SPARQL engines for very large RDF data
|
journal | September 2017 |
Turbo iso: towards ultrafast and robust subgraph isomorphism search in large graph databases
|
conference | January 2013 |
Fast best-effort pattern matching in large attributed graphs
|
conference | January 2007 |
Efficient distributed subgraph similarity matching
|
journal | March 2015 |
Graph pattern matching: from intractable to polynomial time
|
journal | September 2010 |
QFrag: distributed graph search via subgraph isomorphism
|
conference | September 2017 |
Diversified top-k graph pattern matching
|
journal | August 2013 |
A (sub)graph isomorphism algorithm for matching large graphs
|
journal | October 2004 |
PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly
|
journal | May 2013 |
An Algorithm for Subgraph Isomorphism
|
journal | January 1976 |
R-MAT: A Recursive Model for Graph Mining
|
conference | December 2013 |
Distributed quiescence detection in multiagent negotiation
|
conference | January 2000 |
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
|
journal | May 2017 |
Thirty Years of Graph Matching in Pattern Recognition
|
journal | May 2004 |
Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks | preprint | January 2010 |
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
|
dataset | January 2015 |
Similar Records
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching
graphenv: a Python library for reinforcement learning on graph search spaces