DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks

Abstract

As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.

Authors:
; ; ; ; ; ORCiD logo
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
OSTI Identifier:
1576840
Alternate Identifier(s):
OSTI ID: 1631222
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Published Article
Journal Name:
Genes
Additional Journal Information:
Journal Name: Genes Journal Volume: 10 Journal Issue: 12; Journal ID: ISSN 2073-4425
Publisher:
MDPI
Country of Publication:
Switzerland
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Random Forest; Iterative Random Forest; Gene Expression Networks; high-performance computing; X-AI-based eQTL

Citation Formats

Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, and Jacobson, Daniel. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Switzerland: N. p., 2019. Web. doi:10.3390/genes10120996.
Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, & Jacobson, Daniel. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Switzerland. https://doi.org/10.3390/genes10120996
Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, and Jacobson, Daniel. Mon . "A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks". Switzerland. https://doi.org/10.3390/genes10120996.
@article{osti_1576840,
title = {A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks},
author = {Cliff, Ashley and Romero, Jonathon and Kainer, David and Walker, Angelica and Furches, Anna and Jacobson, Daniel},
abstractNote = {As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.},
doi = {10.3390/genes10120996},
journal = {Genes},
number = 12,
volume = 10,
place = {Switzerland},
year = {Mon Dec 02 00:00:00 EST 2019},
month = {Mon Dec 02 00:00:00 EST 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.3390/genes10120996

Citation Metrics:
Cited by: 14 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: The diagram shows the process of iRF-LOOP for a set of Expression profiles, creating a Predictive Expression Network. Each gene is independently treated as the target for an iRF run, with all other genes as predictors. iRF provides importance scores of each predictor gene, and creates network edgemore » weights between target and predictors. These importance scores are then combined into an edge matrix, providing a value for each possible connection, from which a network can be generated. Generally, the weights are thresholded at some value, determined through other means, and only edges with large enough weights are included in the final network. Due to the inherent directionality of a prediction, the edges are weighted, and not likely to be symmetric.« less

Save / Share:

Works referenced in this record:

Inferring Regulatory Networks from Expression Data Using Tree-Based Methods
journal, September 2010


Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions
journal, July 2016


The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray)
journal, September 2006


Gene networks inference using dynamic Bayesian networks
journal, September 2003


PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants
journal, October 2016

  • Jin, Jinpu; Tian, Feng; Yang, De-Chang
  • Nucleic Acids Research, Vol. 45, Issue D1
  • DOI: 10.1093/nar/gkw982

A survey on feature selection methods
journal, January 2014


Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
journal, November 2019


Iterative random forests to discover predictive and stable high-order interactions
journal, January 2018

  • Basu, Sumanta; Kumbier, Karl; Brown, James B.
  • Proceedings of the National Academy of Sciences, Vol. 115, Issue 8
  • DOI: 10.1073/pnas.1711236115

ranger : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R
journal, January 2017

  • Wright, Marvin N.; Ziegler, Andreas
  • Journal of Statistical Software, Vol. 77, Issue 1
  • DOI: 10.18637/jss.v077.i01

Random Forests
journal, January 2001


Random forests for genomic data analysis
journal, June 2012


Finding New Cell Wall Regulatory Genes in Populus trichocarpa Using Multiple Lines of Evidence
journal, October 2019