Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

Journal Article · · Bioinformatics Advances
 [1];  [2];  [1];  [2];  [3];  [2];  [4];  [5];  [6];  [2];  [4];  [2];  [7];  [8];  [4];  [9];  [10]
  1. Univ. degli Studi di Milano (Italy)
  2. Jackson Laboratory for Genomic Medicine, Farmington, CT (United States)
  3. Univ. degli Studi di Milano (Italy); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
  4. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
  5. Univ. of New Mexico, Albuquerque, NM (United States).
  6. Freie Univ., Berlin (Germany)
  7. Jackson Laboratory for Genomic Medicine, Farmington, CT (United States); Univ. of Connecticut, Farmington, CT (United States)
  8. Univ. of New Mexico, Albuquerque, NM (United States)
  9. Univ. degli Studi di Milano (Italy); European Laboratory for Learning and Intelligent Systems (ELLIS) (Europe)
  10. Jackson Laboratory for Genomic Medicine, Farmington, CT (United States); Univ. of Connecticut, Farmington, CT (United States); European Laboratory for Learning and Intelligent Systems (ELLIS) (Europe); Universitätsmedizin Berlin (Germany)

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER); National Institutes of Health (NIH); National Cancer Institute (NCI)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2375469
Journal Information:
Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 4; ISSN 2635-0041
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (26)

Friends and neighbors on the Web journal July 2003
AdaNS: Adaptive negative sampling for unsupervised graph representation learning journal April 2023
Harnessing synthetic lethality to predict the response to cancer treatment journal June 2018
Graph representation learning in biomedicine and healthcare journal October 2022
Systematic auditing is essential to debiasing machine learning in biology journal February 2021
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans journal March 2021
GRAPE for fast and scalable graph processing and random-walk-based embedding journal June 2023
The powerful law of the power law and other myths in network biology journal January 2009
Machine Learning in Medicine journal April 2019
Biological network analysis with deep learning journal November 2020
Implications of topological imbalance for representation learning on biomedical knowledge graphs journal July 2022
KG-Hub—building and exchanging biological knowledge graphs journal June 2023
The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets journal November 2020
SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets journal October 2015
A Review of Relational Machine Learning for Knowledge Graphs journal January 2016
A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications journal September 2018
Emergence of Scaling in Random Networks journal October 1999
Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal journal April 2020
Understanding Graph Embedding Methods and Their Applications journal January 2021
Predicting missing links via local information journal October 2009
Machine Learning in Medicine journal November 2015
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation journal January 2020
Scale-free networks in cell biology journal November 2005
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study journal November 2018
The Impact of Multifunctional Genes on "Guilt by Association" Analysis journal February 2011
The Cellosaurus, a Cell-Line Knowledge Resource journal July 2018

Similar Records

Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science
Journal Article · Mon Jun 06 00:00:00 EDT 2022 · Clinical and Translational Science · OSTI ID:1871236

Related Subjects