DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Abstract

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such asmore » full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.« less

Authors:
 [1];  [1];  [1];  [2];  [2]
  1. Univ. of British Columbia, Vancouver, BC (Canada)
  2. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1769153
Report Number(s):
LLNL-JRNL-817625
Journal ID: ISSN 2329-4949; 1022376
Grant/Contract Number:  
AC52-07NA27344
Resource Type:
Accepted Manuscript
Journal Name:
ACM Transactions on Parallel Computing
Additional Journal Information:
Journal Volume: 8; Journal Issue: 1; Journal ID: ISSN 2329-4949
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Computer systems organization→Distributed architectures; Information systems →Data mining; Mathematics of computing→Graph algorithms; Pattern matching; Subgraph isomorphism; Graph processing; Distributed computing

Citation Formats

Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, and Pearce, Roger A. Scalable Pattern Matching in Metadata Graphs via Constraint Checking. United States: N. p., 2021. Web. doi:10.1145/3434391.
Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, & Pearce, Roger A. Scalable Pattern Matching in Metadata Graphs via Constraint Checking. United States. https://doi.org/10.1145/3434391
Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, and Pearce, Roger A. Mon . "Scalable Pattern Matching in Metadata Graphs via Constraint Checking". United States. https://doi.org/10.1145/3434391. https://www.osti.gov/servlets/purl/1769153.
@article{osti_1769153,
title = {Scalable Pattern Matching in Metadata Graphs via Constraint Checking},
author = {Reza, Tahsin and Halawa, Hassan and Ripeanu, Matei and Sanders, Geoffrey and Pearce, Roger A.},
abstractNote = {Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.},
doi = {10.1145/3434391},
journal = {ACM Transactions on Parallel Computing},
number = 1,
volume = 8,
place = {United States},
year = {Mon Jan 04 00:00:00 EST 2021},
month = {Mon Jan 04 00:00:00 EST 2021}
}

Works referenced in this record:

Taming verification hardness: an efficient algorithm for testing subgraph isomorphism
journal, August 2008

  • Shang, Haichuan; Zhang, Ying; Lin, Xuemin
  • Proceedings of the VLDB Endowment, Vol. 1, Issue 1
  • DOI: 10.14778/1453856.1453899

Fast Graph Pattern Matching
conference, April 2008

  • Cheng, Jiefeng; Yu, Jeffrey Xu; Ding, Bolin
  • 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008)
  • DOI: 10.1109/ICDE.2008.4497500

Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
conference, June 2020

  • Reza, Tashin; Ripeanu, Matei; Sanders, Geoffrey
  • SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
  • DOI: 10.1145/3318464.3380566

What is Twitter, a social network or a news media?
conference, January 2010

  • Kwak, Haewoon; Lee, Changhyun; Park, Hosung
  • Proceedings of the 19th international conference on World wide web - WWW '10
  • DOI: 10.1145/1772690.1772751

Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks
conference, January 2011

  • Boldi, Paolo; Rosa, Marco; Santini, Massimo
  • Proceedings of the 20th international conference on World wide web - WWW '11
  • DOI: 10.1145/1963405.1963488

Continuous pattern detection over billion-edge graph using distributed framework
conference, March 2014

  • Gao, Jun; Zhou, Chang; Zhou, Jiashuai
  • 2014 IEEE 30th International Conference on Data Engineering (ICDE)
  • DOI: 10.1109/ICDE.2014.6816681

node2vec: Scalable Feature Learning for Networks
conference, January 2016

  • Grover, Aditya; Leskovec, Jure
  • Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16
  • DOI: 10.1145/2939672.2939754

Pregel: a system for large-scale graph processing
conference, January 2010

  • Malewicz, Grzegorz; Austern, Matthew H.; Bik, Aart J. C.
  • Proceedings of the 2010 international conference on Management of data - SIGMOD '10
  • DOI: 10.1145/1807167.1807184

Group formation in large social networks: membership, growth, and evolution
conference, January 2006

  • Backstrom, Lars; Huttenlocher, Dan; Kleinberg, Jon
  • Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06
  • DOI: 10.1145/1150402.1150412

Fractal: A General-Purpose Graph Pattern Mining System
conference, June 2019

  • Dias, Vinicius; Teixeira, Carlos H. C.; Guedes, Dorgival
  • SIGMOD/PODS '19: International Conference on Management of Data, Proceedings of the 2019 International Conference on Management of Data
  • DOI: 10.1145/3299869.3319875

RolX: structural role extraction & mining in large graphs
conference, January 2012

  • Henderson, Keith; Gallagher, Brian; Eliassi-Rad, Tina
  • Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12
  • DOI: 10.1145/2339530.2339723

A large time-aware web graph
journal, November 2008


GraMi: frequent subgraph and pattern mining in a single large graph
journal, March 2014

  • Elseidy, Mohammed; Abdelhamid, Ehab; Skiadopoulos, Spiros
  • Proceedings of the VLDB Endowment, Vol. 7, Issue 7
  • DOI: 10.14778/2732286.2732289

Exemplar or matching: modeling DCJ problems with unequal content genome data
journal, August 2015

  • Yin, Zhaoming; Tang, Jijun; Schaeffer, Stephen W.
  • Journal of Combinatorial Optimization, Vol. 32, Issue 4
  • DOI: 10.1007/s10878-015-9940-4

TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing
conference, June 2014

  • Gurajada, Sairam; Seufert, Stephan; Miliaraki, Iris
  • SIGMOD/PODS'14: International Conference on Management of Data, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
  • DOI: 10.1145/2588555.2610511

Efficient subgraph matching on billion node graphs
journal, May 2012

  • Sun, Zhao; Wang, Hongzhi; Wang, Haixun
  • Proceedings of the VLDB Endowment, Vol. 5, Issue 9
  • DOI: 10.14778/2311906.2311907

Towards Interactive Pattern Search in Massive Graphs
conference, June 2020

  • Reza, Tahsin; Ripeanu, Matei; Sanders, Geoffrey
  • SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 3rd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)
  • DOI: 10.1145/3398682.3399166

PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine
conference, May 2017

  • Roth, Nicholas P.; Trigonakis, Vasileios; Hong, Sungpack
  • SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems
  • DOI: 10.1145/3078447.3078454

Counting triangles and the curse of the last reducer
conference, January 2011

  • Suri, Siddharth; Vassilvitskii, Sergei
  • Proceedings of the 20th international conference on World wide web - WWW '11
  • DOI: 10.1145/1963405.1963491

GraphFrames: an integrated API for mixing graph and relational queries
conference, January 2016

  • Dave, Ankur; Jindal, Alekh; Li, Li Erran
  • Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems - GRADES '16
  • DOI: 10.1145/2960414.2960416

Classification of software behaviors for failure detection: a discriminative pattern mining approach
conference, January 2009

  • Lo, David; Cheng, Hong; Han, Jiawei
  • Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09
  • DOI: 10.1145/1557019.1557083

Distributed graph pattern matching
conference, January 2012

  • Ma, Shuai; Cao, Yang; Huai, Jinpeng
  • Proceedings of the 21st international conference on World Wide Web - WWW '12
  • DOI: 10.1145/2187836.2187963

In-Memory Graph Databases for Web-Scale Data
journal, March 2015

  • Castellana, Vito Giovanni; Morari, Alessandro; Weaver, Jesse
  • Computer, Vol. 48, Issue 3
  • DOI: 10.1109/MC.2015.74

On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures
journal, September 2007

  • Low, Diana H. P.; Veeravalli, Bharadwaj; Bader, David A.
  • Journal of Parallel and Distributed Computing, Vol. 67, Issue 9
  • DOI: 10.1016/j.jpdc.2007.03.007

PGX.D: a fast distributed graph processing engine
conference, November 2015

  • Hong, Sungpack; Depner, Siegfried; Manhardt, Thomas
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2807591.2807620

Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs
conference, September 2017

  • Reza, Tahsin; Klymko, Christine; Ripeanu, Matei
  • 2017 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2017.85

G-Tries: a data structure for storing and finding subgraphs
journal, February 2013


Arabesque: a system for distributed graph mining
conference, October 2015

  • Teixeira, Carlos H. C.; Fonseca, Alexandre J.; Serafini, Marco
  • SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles, Proceedings of the 25th Symposium on Operating Systems Principles
  • DOI: 10.1145/2815400.2815410

Efficient Processing of Large Graphs via Input Reduction
conference, May 2016

  • Kusum, Amlan; Vora, Keval; Gupta, Rajiv
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
  • DOI: 10.1145/2907294.2907312

Enabling real time data analysis
journal, September 2010

  • Srivastava, Divesh; Golab, Lukasz; Greer, Rick
  • Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2
  • DOI: 10.14778/1920841.1920843

Inexact graph matching for structural pattern recognition
journal, May 1983


PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution
conference, November 2018

  • Reza, Tahsin; Ripeanu, Matei; Tripoul, Nicolas
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00024

Inexact subgraph isomorphism in MapReduce
journal, February 2013


A distributed vertex-centric approach for pattern matching in massive graphs
conference, October 2013


G-Miner: an efficient task-oriented graph mining system
conference, April 2018

  • Chen, Hongzhi; Liu, Miao; Zhao, Yunjian
  • EuroSys '18: Thirteenth EuroSys Conference 2018, Proceedings of the Thirteenth EuroSys Conference
  • DOI: 10.1145/3190508.3190545

GraphMat: high performance graph analytics made productive
journal, July 2015

  • Sundaram, Narayanan; Satish, Nadathur; Patwary, Md Mostofa Ali
  • Proceedings of the VLDB Endowment, Vol. 8, Issue 11
  • DOI: 10.14778/2809974.2809983

Implementing sparse matrix-vector multiplication on throughput-oriented processors
conference, January 2009

  • Bell, Nathan; Garland, Michael
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654078

Practical graph isomorphism, II
journal, January 2014


Biomolecular network motif counting and discovery by color coding
journal, June 2008


Real-time twitter recommendation: online motif detection in large dynamic graphs
journal, August 2014

  • Gupta, Pankaj; Satuluri, Venu; Grewal, Ajeet
  • Proceedings of the VLDB Endowment, Vol. 7, Issue 13
  • DOI: 10.14778/2733004.2733010

Fast Connected Components Computation in Large Graphs by Vertex Pruning
journal, March 2017

  • Lulli, Alessandro; Carlini, Emanuele; Dazzi, Patrizio
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 3
  • DOI: 10.1109/TPDS.2016.2591038

A survey and experimental comparison of distributed SPARQL engines for very large RDF data
journal, September 2017

  • Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair
  • Proceedings of the VLDB Endowment, Vol. 10, Issue 13
  • DOI: 10.14778/3151106.3151109

Turbo iso: towards ultrafast and robust subgraph isomorphism search in large graph databases
conference, January 2013

  • Han, Wook-Shin; Lee, Jinsoo; Lee, Jeong-Hoon
  • Proceedings of the 2013 international conference on Management of data - SIGMOD '13
  • DOI: 10.1145/2463676.2465300

Fast best-effort pattern matching in large attributed graphs
conference, January 2007

  • Tong, Hanghang; Faloutsos, Christos; Gallagher, Brian
  • Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07
  • DOI: 10.1145/1281192.1281271

Efficient distributed subgraph similarity matching
journal, March 2015


Graph pattern matching: from intractable to polynomial time
journal, September 2010

  • Fan, Wenfei; Li, Jianzhong; Ma, Shuai
  • Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2
  • DOI: 10.14778/1920841.1920878

QFrag: distributed graph search via subgraph isomorphism
conference, September 2017

  • Serafini, Marco; De Francisci Morales, Gianmarco; Siganos, Georgos
  • SoCC '17: ACM Symposium on Cloud Computing, Proceedings of the 2017 Symposium on Cloud Computing
  • DOI: 10.1145/3127479.3131625

Diversified top-k graph pattern matching
journal, August 2013


A (sub)graph isomorphism algorithm for matching large graphs
journal, October 2004

  • Cordella, L. P.; Foggia, P.; Sansone, C.
  • IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, Issue 10
  • DOI: 10.1109/TPAMI.2004.75

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly
journal, May 2013

  • Liu, Xing; Pande, Pushkar R.; Meyerhenke, Henning
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 24, Issue 5
  • DOI: 10.1109/TPDS.2012.190

An Algorithm for Subgraph Isomorphism
journal, January 1976


R-MAT: A Recursive Model for Graph Mining
conference, December 2013

  • Chakrabarti, Deepayan; Zhan, Yiping; Faloutsos, Christos
  • Proceedings of the 2004 SIAM International Conference on Data Mining
  • DOI: 10.1137/1.9781611972740.43

Distributed quiescence detection in multiagent negotiation
conference, January 2000

  • Wellman, M. P.; Walsh, W. E.
  • Proceedings Fourth International Conference on MultiAgent Systems
  • DOI: 10.1109/ICMAS.2000.858469

KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
journal, May 2017

  • Vora, Keval; Gupta, Rajiv; Xu, Guoqing
  • ACM SIGARCH Computer Architecture News, Vol. 45, Issue 1
  • DOI: 10.1145/3093337.3037748

Biomolecular network motif counting and discovery by color coding
journal, June 2008


Thirty Years of Graph Matching in Pattern Recognition
journal, May 2004

  • Conte, D.; Foggia, P.; Sansone, C.
  • International Journal of Pattern Recognition and Artificial Intelligence, Vol. 18, Issue 03
  • DOI: 10.1142/s0218001404003228

A survey and experimental comparison of distributed SPARQL engines for very large RDF data
journal, September 2017

  • Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair
  • Proceedings of the VLDB Endowment, Vol. 10, Issue 13
  • DOI: 10.14778/3151106.3151109

A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
dataset, January 2015