DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Integrating multimodal data through interpretable heterogeneous ensembles

Journal Article · · Bioinformatics Advances

Motivation: Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results: We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms and uses heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data and mortality due to coronavirus disease 2019 (COVID-19) from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling.

Research Organization:
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Sponsoring Organization:
National Institutes of Health (NIH); USDOE
Grant/Contract Number:
AC36-08GO28308
OSTI ID:
1898016
Report Number(s):
NREL/JA-2700-84525; MainId:85298; UUID:aae3b503-8f10-4591-9705-5947a4338735; MainAdminID:67973
Journal Information:
Bioinformatics Advances, Journal Name: Bioinformatics Advances Journal Issue: 1 Vol. 2; ISSN 2635-0041
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (45)

Generating ensembles of heterogeneous classifiers using Stacked Generalization: Generating ensembles of heterogeneous classifiers
  • Sesmero, M. Paz; Ledezma, Agapito I.; Sanchis, Araceli
  • Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 5, Issue 1 https://doi.org/10.1002/widm.1143
journal January 2015
A Matrix Factorization Approach for Integrating Multiple Data Views book January 2009
An interactive web-based dashboard to track COVID-19 in real time journal May 2020
Prevalence and risk factors for delirium in critically ill patients with COVID-19 (COVID-D): a multicentre cohort study journal March 2021
Clinical features of COVID-19 mortality: development and validation of a clinical prediction model journal October 2020
Compact Integration of Multi-Network Topology for Functional Analysis of Genes journal December 2016
Characteristics and predictors of death among 4035 consecutively hospitalized patients with COVID-19 in Spain journal November 2020
Interactome-based approaches to human disease journal June 2017
Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities journal October 2019
Predicting protein function and other biomedical characteristics with heterogeneous ensembles journal January 2016
Random Forests journal January 2001
Gene Ontology: tool for the unification of biology journal May 2000
Crowdsourced assessment of common genetic contribution to predicting anti-TNF treatment response in rheumatoid arthritis journal August 2016
A large-scale evaluation of computational protein function prediction journal January 2013
Similarity network fusion for aggregating data types on a genomic scale journal January 2014
Network propagation: a universal amplifier of genetic associations journal June 2017
Mining electronic health records: towards better research applications and clinical care journal May 2012
Machine learning applications in genetics and genomics journal May 2015
Harnessing multimodal data integration to advance precision oncology journal October 2021
DOME: recommendations for supervised machine learning validation in biology journal July 2021
Graphical assessment of tests and classifiers journal July 2021
GOATOOLS: A Python library for Gene Ontology analyses journal July 2018
From local explanations to global understanding with explainable AI for trees journal January 2020
Hospitalization and Mortality among Black Patients and White Patients with Covid-19 journal June 2020
Missing value estimation methods for DNA microarrays journal June 2001
Supervised learning is an accurate method for network-based gene classification journal April 2020
A review of feature selection techniques in bioinformatics journal August 2007
deepNF: deep network fusion for protein function prediction journal June 2018
The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets journal November 2020
Expansion of the Gene Ontology knowledgebase and resources journal November 2016
Methods for biological data integration: perspectives and challenges journal November 2015
Boosting: Foundations and Algorithms journal January 2013
Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal journal April 2020
Applying deep neural networks to unstructured text notes in electronic medical records for phenotyping youth depression journal July 2017
Deep Learning in Medical Image Analysis journal June 2017
Electrolyte imbalances in patients with severe coronavirus disease 2019 (COVID-19) journal May 2020
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function journal January 2008
Multi-omics approaches to disease journal May 2017
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens journal November 2019
Large-scale protein function prediction using heterogeneous ensembles journal January 2018
Prediction of infectious disease epidemics via weighted density ensembles journal February 2018
Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation journal November 2020
State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing journal December 2020
Clinical Predictors of Mortality and Critical Illness in Patients with COVID-19 Pneumonia journal October 2021
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens collection January 2019