Scalable Pattern Matching in Metadata Graphs via Constraint Checking
Abstract
Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such asmore »
- Authors:
-
- Univ. of British Columbia, Vancouver, BC (Canada)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1769153
- Report Number(s):
- LLNL-JRNL-817625
Journal ID: ISSN 2329-4949; 1022376
- Grant/Contract Number:
- AC52-07NA27344
- Resource Type:
- Accepted Manuscript
- Journal Name:
- ACM Transactions on Parallel Computing
- Additional Journal Information:
- Journal Volume: 8; Journal Issue: 1; Journal ID: ISSN 2329-4949
- Publisher:
- Association for Computing Machinery
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Computer systems organization→Distributed architectures; Information systems →Data mining; Mathematics of computing→Graph algorithms; Pattern matching; Subgraph isomorphism; Graph processing; Distributed computing
Citation Formats
Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, and Pearce, Roger A. Scalable Pattern Matching in Metadata Graphs via Constraint Checking. United States: N. p., 2021.
Web. doi:10.1145/3434391.
Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, & Pearce, Roger A. Scalable Pattern Matching in Metadata Graphs via Constraint Checking. United States. https://doi.org/10.1145/3434391
Reza, Tahsin, Halawa, Hassan, Ripeanu, Matei, Sanders, Geoffrey, and Pearce, Roger A. Mon .
"Scalable Pattern Matching in Metadata Graphs via Constraint Checking". United States. https://doi.org/10.1145/3434391. https://www.osti.gov/servlets/purl/1769153.
@article{osti_1769153,
title = {Scalable Pattern Matching in Metadata Graphs via Constraint Checking},
author = {Reza, Tahsin and Halawa, Hassan and Ripeanu, Matei and Sanders, Geoffrey and Pearce, Roger A.},
abstractNote = {Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.},
doi = {10.1145/3434391},
journal = {ACM Transactions on Parallel Computing},
number = 1,
volume = 8,
place = {United States},
year = {Mon Jan 04 00:00:00 EST 2021},
month = {Mon Jan 04 00:00:00 EST 2021}
}
Works referenced in this record:
Taming verification hardness: an efficient algorithm for testing subgraph isomorphism
journal, August 2008
- Shang, Haichuan; Zhang, Ying; Lin, Xuemin
- Proceedings of the VLDB Endowment, Vol. 1, Issue 1
Fast Graph Pattern Matching
conference, April 2008
- Cheng, Jiefeng; Yu, Jeffrey Xu; Ding, Bolin
- 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008)
Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees
conference, June 2020
- Reza, Tashin; Ripeanu, Matei; Sanders, Geoffrey
- SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
What is Twitter, a social network or a news media?
conference, January 2010
- Kwak, Haewoon; Lee, Changhyun; Park, Hosung
- Proceedings of the 19th international conference on World wide web - WWW '10
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks
conference, January 2011
- Boldi, Paolo; Rosa, Marco; Santini, Massimo
- Proceedings of the 20th international conference on World wide web - WWW '11
Continuous pattern detection over billion-edge graph using distributed framework
conference, March 2014
- Gao, Jun; Zhou, Chang; Zhou, Jiashuai
- 2014 IEEE 30th International Conference on Data Engineering (ICDE)
node2vec: Scalable Feature Learning for Networks
conference, January 2016
- Grover, Aditya; Leskovec, Jure
- Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16
Pregel: a system for large-scale graph processing
conference, January 2010
- Malewicz, Grzegorz; Austern, Matthew H.; Bik, Aart J. C.
- Proceedings of the 2010 international conference on Management of data - SIGMOD '10
Group formation in large social networks: membership, growth, and evolution
conference, January 2006
- Backstrom, Lars; Huttenlocher, Dan; Kleinberg, Jon
- Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06
Fractal: A General-Purpose Graph Pattern Mining System
conference, June 2019
- Dias, Vinicius; Teixeira, Carlos H. C.; Guedes, Dorgival
- SIGMOD/PODS '19: International Conference on Management of Data, Proceedings of the 2019 International Conference on Management of Data
RolX: structural role extraction & mining in large graphs
conference, January 2012
- Henderson, Keith; Gallagher, Brian; Eliassi-Rad, Tina
- Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12
A large time-aware web graph
journal, November 2008
- Boldi, Paolo; Santini, Massimo; Vigna, Sebastiano
- ACM SIGIR Forum, Vol. 42, Issue 2
GraMi: frequent subgraph and pattern mining in a single large graph
journal, March 2014
- Elseidy, Mohammed; Abdelhamid, Ehab; Skiadopoulos, Spiros
- Proceedings of the VLDB Endowment, Vol. 7, Issue 7
Exemplar or matching: modeling DCJ problems with unequal content genome data
journal, August 2015
- Yin, Zhaoming; Tang, Jijun; Schaeffer, Stephen W.
- Journal of Combinatorial Optimization, Vol. 32, Issue 4
TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing
conference, June 2014
- Gurajada, Sairam; Seufert, Stephan; Miliaraki, Iris
- SIGMOD/PODS'14: International Conference on Management of Data, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
Efficient subgraph matching on billion node graphs
journal, May 2012
- Sun, Zhao; Wang, Hongzhi; Wang, Haixun
- Proceedings of the VLDB Endowment, Vol. 5, Issue 9
Towards Interactive Pattern Search in Massive Graphs
conference, June 2020
- Reza, Tahsin; Ripeanu, Matei; Sanders, Geoffrey
- SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 3rd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)
PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine
conference, May 2017
- Roth, Nicholas P.; Trigonakis, Vasileios; Hong, Sungpack
- SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems
Counting triangles and the curse of the last reducer
conference, January 2011
- Suri, Siddharth; Vassilvitskii, Sergei
- Proceedings of the 20th international conference on World wide web - WWW '11
GraphFrames: an integrated API for mixing graph and relational queries
conference, January 2016
- Dave, Ankur; Jindal, Alekh; Li, Li Erran
- Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems - GRADES '16
Classification of software behaviors for failure detection: a discriminative pattern mining approach
conference, January 2009
- Lo, David; Cheng, Hong; Han, Jiawei
- Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09
Distributed graph pattern matching
conference, January 2012
- Ma, Shuai; Cao, Yang; Huai, Jinpeng
- Proceedings of the 21st international conference on World Wide Web - WWW '12
In-Memory Graph Databases for Web-Scale Data
journal, March 2015
- Castellana, Vito Giovanni; Morari, Alessandro; Weaver, Jesse
- Computer, Vol. 48, Issue 3
On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures
journal, September 2007
- Low, Diana H. P.; Veeravalli, Bharadwaj; Bader, David A.
- Journal of Parallel and Distributed Computing, Vol. 67, Issue 9
PGX.D: a fast distributed graph processing engine
conference, November 2015
- Hong, Sungpack; Depner, Siegfried; Manhardt, Thomas
- SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs
conference, September 2017
- Reza, Tahsin; Klymko, Christine; Ripeanu, Matei
- 2017 IEEE International Conference on Cluster Computing (CLUSTER)
G-Tries: a data structure for storing and finding subgraphs
journal, February 2013
- Ribeiro, Pedro; Silva, Fernando
- Data Mining and Knowledge Discovery, Vol. 28, Issue 2
Arabesque: a system for distributed graph mining
conference, October 2015
- Teixeira, Carlos H. C.; Fonseca, Alexandre J.; Serafini, Marco
- SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles, Proceedings of the 25th Symposium on Operating Systems Principles
Efficient Processing of Large Graphs via Input Reduction
conference, May 2016
- Kusum, Amlan; Vora, Keval; Gupta, Rajiv
- HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
Enabling real time data analysis
journal, September 2010
- Srivastava, Divesh; Golab, Lukasz; Greer, Rick
- Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2
Inexact graph matching for structural pattern recognition
journal, May 1983
- Bunke, H.; Allermann, G.
- Pattern Recognition Letters, Vol. 1, Issue 4
PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution
conference, November 2018
- Reza, Tahsin; Ripeanu, Matei; Tripoul, Nicolas
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
Inexact subgraph isomorphism in MapReduce
journal, February 2013
- Plantenga, Todd
- Journal of Parallel and Distributed Computing, Vol. 73, Issue 2
A distributed vertex-centric approach for pattern matching in massive graphs
conference, October 2013
- Fard, Arash; Nisar, M. Usman; Ramaswamy, Lakshmish
- 2013 IEEE International Conference on Big Data
G-Miner: an efficient task-oriented graph mining system
conference, April 2018
- Chen, Hongzhi; Liu, Miao; Zhao, Yunjian
- EuroSys '18: Thirteenth EuroSys Conference 2018, Proceedings of the Thirteenth EuroSys Conference
GraphMat: high performance graph analytics made productive
journal, July 2015
- Sundaram, Narayanan; Satish, Nadathur; Patwary, Md Mostofa Ali
- Proceedings of the VLDB Endowment, Vol. 8, Issue 11
Implementing sparse matrix-vector multiplication on throughput-oriented processors
conference, January 2009
- Bell, Nathan; Garland, Michael
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
Practical graph isomorphism, II
journal, January 2014
- McKay, Brendan D.; Piperno, Adolfo
- Journal of Symbolic Computation, Vol. 60
Biomolecular network motif counting and discovery by color coding
journal, June 2008
- Alon, N.; Dao, P.; Hajirasouliha, I.
- Bioinformatics, Vol. 24, Issue 13
Real-time twitter recommendation: online motif detection in large dynamic graphs
journal, August 2014
- Gupta, Pankaj; Satuluri, Venu; Grewal, Ajeet
- Proceedings of the VLDB Endowment, Vol. 7, Issue 13
Fast Connected Components Computation in Large Graphs by Vertex Pruning
journal, March 2017
- Lulli, Alessandro; Carlini, Emanuele; Dazzi, Patrizio
- IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 3
A survey and experimental comparison of distributed SPARQL engines for very large RDF data
journal, September 2017
- Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair
- Proceedings of the VLDB Endowment, Vol. 10, Issue 13
Turbo iso: towards ultrafast and robust subgraph isomorphism search in large graph databases
conference, January 2013
- Han, Wook-Shin; Lee, Jinsoo; Lee, Jeong-Hoon
- Proceedings of the 2013 international conference on Management of data - SIGMOD '13
Fast best-effort pattern matching in large attributed graphs
conference, January 2007
- Tong, Hanghang; Faloutsos, Christos; Gallagher, Brian
- Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07
Efficient distributed subgraph similarity matching
journal, March 2015
- Yuan, Ye; Wang, Guoren; Xu, Jeffery Yu
- The VLDB Journal, Vol. 24, Issue 3
Graph pattern matching: from intractable to polynomial time
journal, September 2010
- Fan, Wenfei; Li, Jianzhong; Ma, Shuai
- Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2
QFrag: distributed graph search via subgraph isomorphism
conference, September 2017
- Serafini, Marco; De Francisci Morales, Gianmarco; Siganos, Georgos
- SoCC '17: ACM Symposium on Cloud Computing, Proceedings of the 2017 Symposium on Cloud Computing
Diversified top-k graph pattern matching
journal, August 2013
- Fan, Wenfei; Wang, Xin; Wu, Yinghui
- Proceedings of the VLDB Endowment, Vol. 6, Issue 13
A (sub)graph isomorphism algorithm for matching large graphs
journal, October 2004
- Cordella, L. P.; Foggia, P.; Sansone, C.
- IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, Issue 10
PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly
journal, May 2013
- Liu, Xing; Pande, Pushkar R.; Meyerhenke, Henning
- IEEE Transactions on Parallel and Distributed Systems, Vol. 24, Issue 5
An Algorithm for Subgraph Isomorphism
journal, January 1976
- Ullmann, J. R.
- Journal of the ACM, Vol. 23, Issue 1
R-MAT: A Recursive Model for Graph Mining
conference, December 2013
- Chakrabarti, Deepayan; Zhan, Yiping; Faloutsos, Christos
- Proceedings of the 2004 SIAM International Conference on Data Mining
Distributed quiescence detection in multiagent negotiation
conference, January 2000
- Wellman, M. P.; Walsh, W. E.
- Proceedings Fourth International Conference on MultiAgent Systems
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
journal, May 2017
- Vora, Keval; Gupta, Rajiv; Xu, Guoqing
- ACM SIGARCH Computer Architecture News, Vol. 45, Issue 1
Biomolecular network motif counting and discovery by color coding
journal, June 2008
- Alon, N.; Dao, P.; Hajirasouliha, I.
- Bioinformatics, Vol. 24, Issue 13
Thirty Years of Graph Matching in Pattern Recognition
journal, May 2004
- Conte, D.; Foggia, P.; Sansone, C.
- International Journal of Pattern Recognition and Artificial Intelligence, Vol. 18, Issue 03
A survey and experimental comparison of distributed SPARQL engines for very large RDF data
journal, September 2017
- Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair
- Proceedings of the VLDB Endowment, Vol. 10, Issue 13
Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks
preprint, January 2010
- Boldi, Paolo; Rosa, Marco; Santini, Massimo
- arXiv
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
dataset, January 2015
- Choudhury, Sutanay; Holder, Lawrence; Chin, George
- OpenProceedings.org