Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

Cappelletti, Luca; Rekerle, Lauren; Fontana, Tommaso; Hansen, Peter; Casiraghi, Elena; Ravanmehr, Vida; Mungall, Christopher J.; Yang, Jeremy J.; Spranger, Leonard; Karlebach, Guy; Caufield, J. Harry; Carmody, Leigh; Coleman, Ben; Oprea, Tudor I.; Reese, Justin; Valentini, Giorgio; Robinson, Peter N.

doi:10.1093/bioadv/vbae036

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

Journal Article · Mon Mar 04 00:00:00 EST 2024 · Bioinformatics Advances

DOI:https://doi.org/10.1093/bioadv/vbae036· OSTI ID:2375469

^[1]; ^[2]; ^[1]; ^[2]; ^[3]; ^[2]; ^[4]; ^[5]; ^[6]; ^[2]; ^[4]; ^[2]; ^[7]; ^[8]; ^[4]; ^[9]; ^[10]

Univ. degli Studi di Milano (Italy)
Jackson Laboratory for Genomic Medicine, Farmington, CT (United States)
Univ. degli Studi di Milano (Italy); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Univ. of New Mexico, Albuquerque, NM (United States).
Freie Univ., Berlin (Germany)
Jackson Laboratory for Genomic Medicine, Farmington, CT (United States); Univ. of Connecticut, Farmington, CT (United States)
Univ. of New Mexico, Albuquerque, NM (United States)
Univ. degli Studi di Milano (Italy); European Laboratory for Learning and Intelligent Systems (ELLIS) (Europe)
Jackson Laboratory for Genomic Medicine, Farmington, CT (United States); Univ. of Connecticut, Farmington, CT (United States); European Laboratory for Learning and Intelligent Systems (ELLIS) (Europe); Universitätsmedizin Berlin (Germany)

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: National Cancer Institute (NCI); National Institutes of Health (NIH); USDOE Office of Science (SC), Biological and Environmental Research (BER)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 2375469

Journal Information:: Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 4; ISSN 2635-0041

Publisher:: Oxford University PressCopyright Statement

Country of Publication:: United States

Language:: English

References (26)

Friends and neighbors on the Web Adamic, Lada A.; Adar, Eytan Social Networks, Vol. 25, Issue 3 https://doi.org/10.1016/S0378-8733(03)00009-1	journal	July 2003
AdaNS: Adaptive negative sampling for unsupervised graph representation learning Wang, Yu; Hu, Liang; Gao, Wanfu Pattern Recognition, Vol. 136 https://doi.org/10.1016/j.patcog.2022.109266	journal	April 2023
Harnessing synthetic lethality to predict the response to cancer treatment Lee, Joo Sang; Das, Avinash; Jerby-Arnon, Livnat Nature Communications, Vol. 9, Issue 1 https://doi.org/10.1038/s41467-018-04647-1	journal	June 2018
Graph representation learning in biomedicine and healthcare Li, Michelle M.; Huang, Kexin; Zitnik, Marinka Nature Biomedical Engineering, Vol. 6, Issue 12 https://doi.org/10.1038/s41551-022-00942-x	journal	October 2022
Systematic auditing is essential to debiasing machine learning in biology Eid, Fatma-Elzahraa; Elmarakeby, Haitham A.; Chan, Yujia Alina Communications Biology, Vol. 4, Issue 1 https://doi.org/10.1038/s42003-021-01674-5	journal	February 2021
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans Roberts, Michael; Driggs, Derek; Thorpe, Matthew Nature Machine Intelligence, Vol. 3, Issue 3 https://doi.org/10.1038/s42256-021-00307-0	journal	March 2021
GRAPE for fast and scalable graph processing and random-walk-based embedding Cappelletti, Luca; Fontana, Tommaso; Casiraghi, Elena Nature Computational Science, Vol. 3, Issue 6, p. 552-568 https://doi.org/10.1038/s43588-023-00465-8	journal	June 2023
The powerful law of the power law and other myths in network biology Lima-Mendez, Gipsi; van Helden, Jacques Molecular BioSystems, Vol. 5, Issue 12 https://doi.org/10.1039/b908681a	journal	January 2009
Machine Learning in Medicine Rajkomar, Alvin; Dean, Jeffrey; Kohane, Isaac New England Journal of Medicine, Vol. 380, Issue 14 https://doi.org/10.1056/NEJMra1814259	journal	April 2019
Biological network analysis with deep learning Muzio, Giulia; O’Bray, Leslie; Borgwardt, Karsten Briefings in Bioinformatics, Vol. 22, Issue 2 https://doi.org/10.1093/bib/bbaa257	journal	November 2020
Implications of topological imbalance for representation learning on biomedical knowledge graphs Bonner, Stephen; Kirik, Ufuk; Engkvist, Ola Briefings in Bioinformatics, Vol. 23, Issue 5 https://doi.org/10.1093/bib/bbac279	journal	July 2022
KG-Hub—building and exchanging biological knowledge graphs Caufield, J. Harry; Putman, Tim; Schaper, Kevin Bioinformatics, Vol. 39, Issue 7 https://doi.org/10.1093/bioinformatics/btad418	journal	June 2023
The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets Szklarczyk, Damian; Gable, Annika L.; Nastou, Katerina C. Nucleic Acids Research, Vol. 49, Issue D1 https://doi.org/10.1093/nar/gkaa1074	journal	November 2020
SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets Guo, Jing; Liu, Hui; Zheng, Jie Nucleic Acids Research, Vol. 44, Issue D1 https://doi.org/10.1093/nar/gkv1108	journal	October 2015
A Review of Relational Machine Learning for Knowledge Graphs Nickel, Maximilian; Murphy, Kevin; Tresp, Volker Proceedings of the IEEE, Vol. 104, Issue 1 https://doi.org/10.1109/JPROC.2015.2483592	journal	January 2016
A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications Cai, Hongyun; Zheng, Vincent W.; Chang, Kevin Chen-Chuan IEEE Transactions on Knowledge and Data Engineering, Vol. 30, Issue 9 https://doi.org/10.1109/TKDE.2018.2807452	journal	September 2018
Emergence of Scaling in Random Networks Barabási, Albert-László; Albert, Réka Science, Vol. 286, Issue 5439 https://doi.org/10.1126/science.286.5439.509	journal	October 1999
Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Wynants, Laure; Van Calster, Ben; Collins, Gary S. BMJ https://doi.org/10.1136/bmj.m1328	journal	April 2020
Understanding Graph Embedding Methods and Their Applications Xu, Mengjia SIAM Review, Vol. 63, Issue 4 https://doi.org/10.1137/20M1386062	journal	January 2021
Predicting missing links via local information Zhou, Tao; Lü, Linyuan; Zhang, Yi-Cheng The European Physical Journal B, Vol. 71, Issue 4 https://doi.org/10.1140/epjb/e2009-00335-8	journal	October 2009
Machine Learning in Medicine Deo, Rahul C. Circulation, Vol. 132, Issue 20 https://doi.org/10.1161/CIRCULATIONAHA.115.001593	journal	November 2015
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation Chicco, Davide; Jurman, Giuseppe BMC Genomics, Vol. 21, Issue 1 https://doi.org/10.1186/s12864-019-6413-7	journal	January 2020
Scale-free networks in cell biology Albert, R. Journal of Cell Science, Vol. 118, Issue 21 https://doi.org/10.1242/jcs.02714	journal	November 2005
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study Zech, John R.; Badgeley, Marcus A.; Liu, Manway PLOS Medicine, Vol. 15, Issue 11 https://doi.org/10.1371/journal.pmed.1002683	journal	November 2018
The Impact of Multifunctional Genes on "Guilt by Association" Analysis Gillis, Jesse; Pavlidis, Paul PLoS ONE, Vol. 6, Issue 2 https://doi.org/10.1371/journal.pone.0017258	journal	February 2011
The Cellosaurus, a Cell-Line Knowledge Resource Bairoch, Amos Journal of Biomolecular Techniques : JBT, Vol. 29, Issue 2 https://doi.org/10.7171/jbt.18-2902-002	journal	July 2018

Similar Records

Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science

Journal Article · Sun Jun 05 20:00:00 EDT 2022 · Clinical and Translational Science · OSTI ID:1871236

Related Subjects

59 BASIC BIOLOGICAL SCIENCES

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

Citation Formats

References (26)

Similar Records

Related Subjects