skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Journal Article · · ACM Transactions on Parallel Computing
DOI:https://doi.org/10.1145/3434391· OSTI ID:1769153
 [1];  [1];  [1];  [2];  [2]
  1. Univ. of British Columbia, Vancouver, BC (Canada)
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.

Research Organization:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1769153
Report Number(s):
LLNL-JRNL-817625; 1022376
Journal Information:
ACM Transactions on Parallel Computing, Vol. 8, Issue 1; ISSN 2329-4949
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (57)

Taming verification hardness: an efficient algorithm for testing subgraph isomorphism journal August 2008
Fast Graph Pattern Matching conference April 2008
Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
  • Reza, Tashin; Ripeanu, Matei; Sanders, Geoffrey
  • SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data https://doi.org/10.1145/3318464.3380566
conference June 2020
What is Twitter, a social network or a news media? conference January 2010
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks conference January 2011
Continuous pattern detection over billion-edge graph using distributed framework conference March 2014
node2vec: Scalable Feature Learning for Networks conference January 2016
Pregel: a system for large-scale graph processing conference January 2010
Group formation in large social networks: membership, growth, and evolution
  • Backstrom, Lars; Huttenlocher, Dan; Kleinberg, Jon
  • Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06 https://doi.org/10.1145/1150402.1150412
conference January 2006
Fractal: A General-Purpose Graph Pattern Mining System
  • Dias, Vinicius; Teixeira, Carlos H. C.; Guedes, Dorgival
  • SIGMOD/PODS '19: International Conference on Management of Data, Proceedings of the 2019 International Conference on Management of Data https://doi.org/10.1145/3299869.3319875
conference June 2019
RolX: structural role extraction & mining in large graphs
  • Henderson, Keith; Gallagher, Brian; Eliassi-Rad, Tina
  • Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12 https://doi.org/10.1145/2339530.2339723
conference January 2012
A large time-aware web graph journal November 2008
GraMi: frequent subgraph and pattern mining in a single large graph journal March 2014
Exemplar or matching: modeling DCJ problems with unequal content genome data journal August 2015
TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing
  • Gurajada, Sairam; Seufert, Stephan; Miliaraki, Iris
  • SIGMOD/PODS'14: International Conference on Management of Data, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data https://doi.org/10.1145/2588555.2610511
conference June 2014
Efficient subgraph matching on billion node graphs journal May 2012
Towards Interactive Pattern Search in Massive Graphs
  • Reza, Tahsin; Ripeanu, Matei; Sanders, Geoffrey
  • SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 3rd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) https://doi.org/10.1145/3398682.3399166
conference June 2020
PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine
  • Roth, Nicholas P.; Trigonakis, Vasileios; Hong, Sungpack
  • SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems https://doi.org/10.1145/3078447.3078454
conference May 2017
Counting triangles and the curse of the last reducer conference January 2011
GraphFrames: an integrated API for mixing graph and relational queries conference January 2016
Classification of software behaviors for failure detection: a discriminative pattern mining approach conference January 2009
Distributed graph pattern matching conference January 2012
In-Memory Graph Databases for Web-Scale Data journal March 2015
On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures journal September 2007
PGX.D: a fast distributed graph processing engine
  • Hong, Sungpack; Depner, Siegfried; Manhardt, Thomas
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807620
conference November 2015
Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs conference September 2017
G-Tries: a data structure for storing and finding subgraphs journal February 2013
Arabesque: a system for distributed graph mining
  • Teixeira, Carlos H. C.; Fonseca, Alexandre J.; Serafini, Marco
  • SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles, Proceedings of the 25th Symposium on Operating Systems Principles https://doi.org/10.1145/2815400.2815410
conference October 2015
Efficient Processing of Large Graphs via Input Reduction
  • Kusum, Amlan; Vora, Keval; Gupta, Rajiv
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907312
conference May 2016
Enabling real time data analysis journal September 2010
Inexact graph matching for structural pattern recognition journal May 1983
PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution conference November 2018
Inexact subgraph isomorphism in MapReduce journal February 2013
A distributed vertex-centric approach for pattern matching in massive graphs conference October 2013
G-Miner: an efficient task-oriented graph mining system conference April 2018
GraphMat: high performance graph analytics made productive journal July 2015
Implementing sparse matrix-vector multiplication on throughput-oriented processors conference January 2009
Practical graph isomorphism, II journal January 2014
Biomolecular network motif counting and discovery by color coding journal June 2008
Real-time twitter recommendation: online motif detection in large dynamic graphs journal August 2014
Fast Connected Components Computation in Large Graphs by Vertex Pruning journal March 2017
A survey and experimental comparison of distributed SPARQL engines for very large RDF data journal September 2017
Turbo iso: towards ultrafast and robust subgraph isomorphism search in large graph databases conference January 2013
Fast best-effort pattern matching in large attributed graphs
  • Tong, Hanghang; Faloutsos, Christos; Gallagher, Brian
  • Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07 https://doi.org/10.1145/1281192.1281271
conference January 2007
Efficient distributed subgraph similarity matching journal March 2015
Graph pattern matching: from intractable to polynomial time journal September 2010
QFrag: distributed graph search via subgraph isomorphism
  • Serafini, Marco; De Francisci Morales, Gianmarco; Siganos, Georgos
  • SoCC '17: ACM Symposium on Cloud Computing, Proceedings of the 2017 Symposium on Cloud Computing https://doi.org/10.1145/3127479.3131625
conference September 2017
Diversified top-k graph pattern matching journal August 2013
A (sub)graph isomorphism algorithm for matching large graphs journal October 2004
PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly journal May 2013
An Algorithm for Subgraph Isomorphism journal January 1976
R-MAT: A Recursive Model for Graph Mining conference December 2013
Distributed quiescence detection in multiagent negotiation conference January 2000
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations journal May 2017
Thirty Years of Graph Matching in Pattern Recognition journal May 2004
Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks preprint January 2010
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs dataset January 2015