A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
Abstract
As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.
- Authors:
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Biological and Environmental Research (BER)
- OSTI Identifier:
- 1576840
- Alternate Identifier(s):
- OSTI ID: 1631222
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Published Article
- Journal Name:
- Genes
- Additional Journal Information:
- Journal Name: Genes Journal Volume: 10 Journal Issue: 12; Journal ID: ISSN 2073-4425
- Publisher:
- MDPI
- Country of Publication:
- Switzerland
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; Random Forest; Iterative Random Forest; Gene Expression Networks; high-performance computing; X-AI-based eQTL
Citation Formats
Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, and Jacobson, Daniel. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Switzerland: N. p., 2019.
Web. doi:10.3390/genes10120996.
Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, & Jacobson, Daniel. A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Switzerland. https://doi.org/10.3390/genes10120996
Cliff, Ashley, Romero, Jonathon, Kainer, David, Walker, Angelica, Furches, Anna, and Jacobson, Daniel. Mon .
"A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks". Switzerland. https://doi.org/10.3390/genes10120996.
@article{osti_1576840,
title = {A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks},
author = {Cliff, Ashley and Romero, Jonathon and Kainer, David and Walker, Angelica and Furches, Anna and Jacobson, Daniel},
abstractNote = {As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.},
doi = {10.3390/genes10120996},
journal = {Genes},
number = 12,
volume = 10,
place = {Switzerland},
year = {Mon Dec 02 00:00:00 EST 2019},
month = {Mon Dec 02 00:00:00 EST 2019}
}
https://doi.org/10.3390/genes10120996
Web of Science
Figures / Tables:
Works referenced in this record:
Inferring Regulatory Networks from Expression Data Using Tree-Based Methods
journal, September 2010
- Huynh-Thu, Vân Anh; Irrthum, Alexandre; Wehenkel, Louis
- PLoS ONE, Vol. 5, Issue 9
Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions
journal, July 2016
- Kawakatsu, Taiji; Huang, Shao-shan Carol; Jupe, Florian
- Cell, Vol. 166, Issue 2
The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray)
journal, September 2006
- Tuskan, G. A.; DiFazio, S.; Jansson, S.
- Science, Vol. 313, Issue 5793, p. 1596-1604
Genome-wide association studies and expression-based quantitative trait loci analyses reveal roles of HCT2 in caffeoylquinic acid biosynthesis and its regulation by defense-responsive transcription factors in Populus
journal, July 2018
- Zhang, Jin; Yang, Yongil; Zheng, Kaijie
- New Phytologist, Vol. 220, Issue 2
Gene networks inference using dynamic Bayesian networks
journal, September 2003
- Perrin, B. -E.; Ralaivola, L.; Mazurie, A.
- Bioinformatics, Vol. 19, Issue Suppl 2
PlantTFDB 4.0: toward a central hub for transcription factors and regulatory interactions in plants
journal, October 2016
- Jin, Jinpu; Tian, Feng; Yang, De-Chang
- Nucleic Acids Research, Vol. 45, Issue D1
A survey on feature selection methods
journal, January 2014
- Chandrashekar, Girish; Sahin, Ferat
- Computers & Electrical Engineering, Vol. 40, Issue 1
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data
journal, September 2011
- Li, H.
- Bioinformatics, Vol. 27, Issue 21
Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
journal, November 2019
- Harfouche, Antoine L.; Jacobson, Daniel A.; Kainer, David
- Trends in Biotechnology, Vol. 37, Issue 11
Iterative random forests to discover predictive and stable high-order interactions
journal, January 2018
- Basu, Sumanta; Kumbier, Karl; Brown, James B.
- Proceedings of the National Academy of Sciences, Vol. 115, Issue 8
ranger : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R
journal, January 2017
- Wright, Marvin N.; Ziegler, Andreas
- Journal of Statistical Software, Vol. 77, Issue 1
Random forests for genomic data analysis
journal, June 2012
- Chen, Xi; Ishwaran, Hemant
- Genomics, Vol. 99, Issue 6
Finding New Cell Wall Regulatory Genes in Populus trichocarpa Using Multiple Lines of Evidence
journal, October 2019
- Furches, Anna; Kainer, David; Weighill, Deborah
- Frontiers in Plant Science, Vol. 10