Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Reza, Tahsin; Halawa, Hassan; Ripeanu, Matei; Sanders, Geoffrey; Pearce, Roger A.

doi:10.1145/3434391

Title: Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Journal Article · Mon Jan 04 00:00:00 EST 2021 · ACM Transactions on Parallel Computing

DOI:https://doi.org/10.1145/3434391· OSTI ID:1769153

Reza, Tahsin ^[1]; Halawa, Hassan ^[1]; Ripeanu, Matei ^[1]; Sanders, Geoffrey ^[2]; Pearce, Roger A. ^[2]

Univ. of British Columbia, Vancouver, BC (Canada)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: They do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. Moreover, the algorithms at the core of the existing techniques are not suitable for today’s graph processing infrastructures relying on horizontal scalability and shared-nothing clusters, as most of these algorithms are inherently sequential and difficult to parallelize. In this article we present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex and edge participating in a match has to meet a set of constraints implicitly specified by the search template. These constraints can be verified independently and typically are less expensive to compute than searching the full template. The pipeline we propose generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, thus reducing the background graph to a subgraph that is the union of all template matches—the complete set of all vertices and edges that participate in at least one match. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration, match counting, or computing vertex/edge centrality. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. This technique (i) enables highly scalable pattern matching in metadata (labeled) graphs; (ii) supports arbitrary patterns with 100% precision; (iii) enables tradeoffs between precision and time-to-solution, while always selects all vertices and edges that participate in matches, thus offering 100% recall; and (iv) supports a set of popular data analytics scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs, respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems. This article serves two purposes: First, it synthesises the knowledge accumulated during a long-term project. Second, it presents new system features, usage scenarios, optimizations, and comparisons with related work that strengthen the confidence that pattern matching based on iterative pruning via constraint checking is an effective and scalable approach in practice. The new contributions include the following: (i) We demonstrate the ability of the constraint checking approach to efficiently support two additional search scenarios that often emerge in practice, interactive incremental search and exploratory search. (ii) We empirically compare our solution with two additional state-of-the-art systems, Arabsque and TriAD. (iii) We show the ability of our solution to accommodate a more diverse range of datasets with varying properties, e.g., scale, skewness, label distribution, and match frequency. (iv) We introduce or extend a number of system features (e.g., work aggregation, load balancing, and the ability to cap the generated traffic) and design optimizations and demonstrate their advantages with respect to improving performance and scalability. (v) We present bottleneck analysis and insights into artifacts that influence performance. (vi) We present a theoretical complexity argument that motivates the performance gains we observe.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC52-07NA27344

OSTI ID:: 1769153

Report Number(s):: LLNL-JRNL-817625; 1022376

Journal Information:: ACM Transactions on Parallel Computing, Vol. 8, Issue 1; ISSN 2329-4949

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

References (57)

Taming verification hardness: an efficient algorithm for testing subgraph isomorphism Shang, Haichuan; Zhang, Ying; Lin, Xuemin Proceedings of the VLDB Endowment, Vol. 1, Issue 1 https://doi.org/10.14778/1453856.1453899	journal	August 2008
Fast Graph Pattern Matching Cheng, Jiefeng; Yu, Jeffrey Xu; Ding, Bolin 2008 IEEE 24th International Conference on Data Engineering (ICDE 2008) https://doi.org/10.1109/ICDE.2008.4497500	conference	April 2008
Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees Reza, Tashin; Ripeanu, Matei; Sanders, Geoffrey SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data https://doi.org/10.1145/3318464.3380566	conference	June 2020
What is Twitter, a social network or a news media? Kwak, Haewoon; Lee, Changhyun; Park, Hosung Proceedings of the 19th international conference on World wide web - WWW '10 https://doi.org/10.1145/1772690.1772751	conference	January 2010
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks Boldi, Paolo; Rosa, Marco; Santini, Massimo Proceedings of the 20th international conference on World wide web - WWW '11 https://doi.org/10.1145/1963405.1963488	conference	January 2011
Continuous pattern detection over billion-edge graph using distributed framework Gao, Jun; Zhou, Chang; Zhou, Jiashuai 2014 IEEE 30th International Conference on Data Engineering (ICDE) https://doi.org/10.1109/ICDE.2014.6816681	conference	March 2014
node2vec: Scalable Feature Learning for Networks Grover, Aditya; Leskovec, Jure Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 https://doi.org/10.1145/2939672.2939754	conference	January 2016
Pregel: a system for large-scale graph processing Malewicz, Grzegorz; Austern, Matthew H.; Bik, Aart J. C. Proceedings of the 2010 international conference on Management of data - SIGMOD '10 https://doi.org/10.1145/1807167.1807184	conference	January 2010
Group formation in large social networks: membership, growth, and evolution Backstrom, Lars; Huttenlocher, Dan; Kleinberg, Jon Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06 https://doi.org/10.1145/1150402.1150412	conference	January 2006
Fractal: A General-Purpose Graph Pattern Mining System Dias, Vinicius; Teixeira, Carlos H. C.; Guedes, Dorgival SIGMOD/PODS '19: International Conference on Management of Data, Proceedings of the 2019 International Conference on Management of Data https://doi.org/10.1145/3299869.3319875	conference	June 2019
RolX: structural role extraction & mining in large graphs Henderson, Keith; Gallagher, Brian; Eliassi-Rad, Tina Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12 https://doi.org/10.1145/2339530.2339723	conference	January 2012
A large time-aware web graph Boldi, Paolo; Santini, Massimo; Vigna, Sebastiano ACM SIGIR Forum, Vol. 42, Issue 2 https://doi.org/10.1145/1480506.1480511	journal	November 2008
GraMi: frequent subgraph and pattern mining in a single large graph Elseidy, Mohammed; Abdelhamid, Ehab; Skiadopoulos, Spiros Proceedings of the VLDB Endowment, Vol. 7, Issue 7 https://doi.org/10.14778/2732286.2732289	journal	March 2014
Exemplar or matching: modeling DCJ problems with unequal content genome data Yin, Zhaoming; Tang, Jijun; Schaeffer, Stephen W. Journal of Combinatorial Optimization, Vol. 32, Issue 4 https://doi.org/10.1007/s10878-015-9940-4	journal	August 2015
TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing Gurajada, Sairam; Seufert, Stephan; Miliaraki, Iris SIGMOD/PODS'14: International Conference on Management of Data, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data https://doi.org/10.1145/2588555.2610511	conference	June 2014
Efficient subgraph matching on billion node graphs Sun, Zhao; Wang, Hongzhi; Wang, Haixun Proceedings of the VLDB Endowment, Vol. 5, Issue 9 https://doi.org/10.14778/2311906.2311907	journal	May 2012
Towards Interactive Pattern Search in Massive Graphs Reza, Tahsin; Ripeanu, Matei; Sanders, Geoffrey SIGMOD/PODS '20: International Conference on Management of Data, Proceedings of the 3rd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) https://doi.org/10.1145/3398682.3399166	conference	June 2020
PGX.D/Async: A Scalable Distributed Graph Pattern Matching Engine Roth, Nicholas P.; Trigonakis, Vasileios; Hong, Sungpack SIGMOD/PODS'17: International Conference on Management of Data, Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems https://doi.org/10.1145/3078447.3078454	conference	May 2017
Counting triangles and the curse of the last reducer Suri, Siddharth; Vassilvitskii, Sergei Proceedings of the 20th international conference on World wide web - WWW '11 https://doi.org/10.1145/1963405.1963491	conference	January 2011
GraphFrames: an integrated API for mixing graph and relational queries Dave, Ankur; Jindal, Alekh; Li, Li Erran Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems - GRADES '16 https://doi.org/10.1145/2960414.2960416	conference	January 2016
Classification of software behaviors for failure detection: a discriminative pattern mining approach Lo, David; Cheng, Hong; Han, Jiawei Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09 https://doi.org/10.1145/1557019.1557083	conference	January 2009
Distributed graph pattern matching Ma, Shuai; Cao, Yang; Huai, Jinpeng Proceedings of the 21st international conference on World Wide Web - WWW '12 https://doi.org/10.1145/2187836.2187963	conference	January 2012
In-Memory Graph Databases for Web-Scale Data Castellana, Vito Giovanni; Morari, Alessandro; Weaver, Jesse Computer, Vol. 48, Issue 3 https://doi.org/10.1109/MC.2015.74	journal	March 2015
On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures Low, Diana H. P.; Veeravalli, Bharadwaj; Bader, David A. Journal of Parallel and Distributed Computing, Vol. 67, Issue 9 https://doi.org/10.1016/j.jpdc.2007.03.007	journal	September 2007
PGX.D: a fast distributed graph processing engine Hong, Sungpack; Depner, Siegfried; Manhardt, Thomas SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807620	conference	November 2015
Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs Reza, Tahsin; Klymko, Christine; Ripeanu, Matei 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.85	conference	September 2017
G-Tries: a data structure for storing and finding subgraphs Ribeiro, Pedro; Silva, Fernando Data Mining and Knowledge Discovery, Vol. 28, Issue 2 https://doi.org/10.1007/s10618-013-0303-4	journal	February 2013
Arabesque: a system for distributed graph mining Teixeira, Carlos H. C.; Fonseca, Alexandre J.; Serafini, Marco SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles, Proceedings of the 25th Symposium on Operating Systems Principles https://doi.org/10.1145/2815400.2815410	conference	October 2015
Efficient Processing of Large Graphs via Input Reduction Kusum, Amlan; Vora, Keval; Gupta, Rajiv HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907312	conference	May 2016
Enabling real time data analysis Srivastava, Divesh; Golab, Lukasz; Greer, Rick Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2 https://doi.org/10.14778/1920841.1920843	journal	September 2010
Inexact graph matching for structural pattern recognition Bunke, H.; Allermann, G. Pattern Recognition Letters, Vol. 1, Issue 4 https://doi.org/10.1016/0167-8655(83)90033-8	journal	May 1983
PruneJuice: Pruning Trillion-edge Graphs to a Precise Pattern-Matching Solution Reza, Tahsin; Ripeanu, Matei; Tripoul, Nicolas SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00024	conference	November 2018
Inexact subgraph isomorphism in MapReduce Plantenga, Todd Journal of Parallel and Distributed Computing, Vol. 73, Issue 2 https://doi.org/10.1016/j.jpdc.2012.10.005	journal	February 2013
A distributed vertex-centric approach for pattern matching in massive graphs Fard, Arash; Nisar, M. Usman; Ramaswamy, Lakshmish 2013 IEEE International Conference on Big Data https://doi.org/10.1109/BigData.2013.6691601	conference	October 2013
G-Miner: an efficient task-oriented graph mining system Chen, Hongzhi; Liu, Miao; Zhao, Yunjian EuroSys '18: Thirteenth EuroSys Conference 2018, Proceedings of the Thirteenth EuroSys Conference https://doi.org/10.1145/3190508.3190545	conference	April 2018
GraphMat: high performance graph analytics made productive Sundaram, Narayanan; Satish, Nadathur; Patwary, Md Mostofa Ali Proceedings of the VLDB Endowment, Vol. 8, Issue 11 https://doi.org/10.14778/2809974.2809983	journal	July 2015
Implementing sparse matrix-vector multiplication on throughput-oriented processors Bell, Nathan; Garland, Michael Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654078	conference	January 2009
Practical graph isomorphism, II McKay, Brendan D.; Piperno, Adolfo Journal of Symbolic Computation, Vol. 60 https://doi.org/10.1016/j.jsc.2013.09.003	journal	January 2014
Biomolecular network motif counting and discovery by color coding Alon, N.; Dao, P.; Hajirasouliha, I. Bioinformatics, Vol. 24, Issue 13 https://doi.org/10.1093/bioinformatics/btn163	journal	June 2008
Real-time twitter recommendation: online motif detection in large dynamic graphs Gupta, Pankaj; Satuluri, Venu; Grewal, Ajeet Proceedings of the VLDB Endowment, Vol. 7, Issue 13 https://doi.org/10.14778/2733004.2733010	journal	August 2014
Fast Connected Components Computation in Large Graphs by Vertex Pruning Lulli, Alessandro; Carlini, Emanuele; Dazzi, Patrizio IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 3 https://doi.org/10.1109/TPDS.2016.2591038	journal	March 2017
A survey and experimental comparison of distributed SPARQL engines for very large RDF data Abdelaziz, Ibrahim; Harbi, Razen; Khayyat, Zuhair Proceedings of the VLDB Endowment, Vol. 10, Issue 13 https://doi.org/10.14778/3151106.3151109	journal	September 2017
Turbo _iso: towards ultrafast and robust subgraph isomorphism search in large graph databases Han, Wook-Shin; Lee, Jinsoo; Lee, Jeong-Hoon Proceedings of the 2013 international conference on Management of data - SIGMOD '13 https://doi.org/10.1145/2463676.2465300	conference	January 2013
Fast best-effort pattern matching in large attributed graphs Tong, Hanghang; Faloutsos, Christos; Gallagher, Brian Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07 https://doi.org/10.1145/1281192.1281271	conference	January 2007
Efficient distributed subgraph similarity matching Yuan, Ye; Wang, Guoren; Xu, Jeffery Yu The VLDB Journal, Vol. 24, Issue 3 https://doi.org/10.1007/s00778-015-0381-6	journal	March 2015
Graph pattern matching: from intractable to polynomial time Fan, Wenfei; Li, Jianzhong; Ma, Shuai Proceedings of the VLDB Endowment, Vol. 3, Issue 1-2 https://doi.org/10.14778/1920841.1920878	journal	September 2010
QFrag: distributed graph search via subgraph isomorphism Serafini, Marco; De Francisci Morales, Gianmarco; Siganos, Georgos SoCC '17: ACM Symposium on Cloud Computing, Proceedings of the 2017 Symposium on Cloud Computing https://doi.org/10.1145/3127479.3131625	conference	September 2017
Diversified top-k graph pattern matching Fan, Wenfei; Wang, Xin; Wu, Yinghui Proceedings of the VLDB Endowment, Vol. 6, Issue 13 https://doi.org/10.14778/2536258.2536263	journal	August 2013
A (sub)graph isomorphism algorithm for matching large graphs Cordella, L. P.; Foggia, P.; Sansone, C. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, Issue 10 https://doi.org/10.1109/TPAMI.2004.75	journal	October 2004
PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly Liu, Xing; Pande, Pushkar R.; Meyerhenke, Henning IEEE Transactions on Parallel and Distributed Systems, Vol. 24, Issue 5 https://doi.org/10.1109/TPDS.2012.190	journal	May 2013
An Algorithm for Subgraph Isomorphism Ullmann, J. R. Journal of the ACM, Vol. 23, Issue 1 https://doi.org/10.1145/321921.321925	journal	January 1976
R-MAT: A Recursive Model for Graph Mining Chakrabarti, Deepayan; Zhan, Yiping; Faloutsos, Christos Proceedings of the 2004 SIAM International Conference on Data Mining https://doi.org/10.1137/1.9781611972740.43	conference	December 2013
Distributed quiescence detection in multiagent negotiation Wellman, M. P.; Walsh, W. E. Proceedings Fourth International Conference on MultiAgent Systems https://doi.org/10.1109/ICMAS.2000.858469	conference	January 2000
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations Vora, Keval; Gupta, Rajiv; Xu, Guoqing ACM SIGARCH Computer Architecture News, Vol. 45, Issue 1 https://doi.org/10.1145/3093337.3037748	journal	May 2017
Thirty Years of Graph Matching in Pattern Recognition Conte, D.; Foggia, P.; Sansone, C. International Journal of Pattern Recognition and Artificial Intelligence, Vol. 18, Issue 03 https://doi.org/10.1142/s0218001404003228	journal	May 2004
Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks Boldi, Paolo; Rosa, Marco; Santini, Massimo arXiv https://doi.org/10.48550/arxiv.1011.5425	preprint	January 2010
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs Choudhury, Sutanay; Holder, Lawrence; Chin, George OpenProceedings.org https://doi.org/10.5441/002/edbt.2015.15	dataset	January 2015

Similar Records

Distributed approximate minimal Steiner trees with millions of seed vertices on billion-edge graphs

Journal Article · Tue Jun 20 00:00:00 EDT 2023 · Journal of Parallel and Distributed Computing · OSTI ID:1769153

Reza, Tahsin; Steil, Trevor; Sanders, Geoffrey; +1 more

A distributed-memory approximation algorithm for maximum weight perfect bipartite matching

Journal Article · Thu Jun 14 00:00:00 EDT 2018 · arXiv.org Repository · OSTI ID:1769153

Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.; +2 more

graphenv: a Python library for reinforcement learning on graph search spaces

Journal Article · Mon Sep 05 00:00:00 EDT 2022 · Journal of Open Source Software · OSTI ID:1769153

Biagioni, David; Tripp, Charles Edison; Clark, Struan; +3 more

Related Subjects

97 MATHEMATICS AND COMPUTING
Computer systems organization→Distributed architectures
Information systems →Data mining
Mathematics of computing→Graph algorithms
Pattern matching
Subgraph isomorphism
Graph processing
Distributed computing

Title: Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Citation Formats

References (57)

Similar Records

Related Subjects