skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Iterative random forests to discover predictive and stable high-order interactions

Abstract

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila , among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda ( Zld ), Giant ( Gt ), and Twist ( Twi ), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identifiedmore » interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.« less

Authors:
; ; ;
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1417528
Grant/Contract Number:  
DOE DE-AC02-05CH11231; SC0017069
Resource Type:
Published Article
Journal Name:
Proceedings of the National Academy of Sciences of the United States of America
Additional Journal Information:
Journal Name: Proceedings of the National Academy of Sciences of the United States of America Journal Volume: 115 Journal Issue: 8; Journal ID: ISSN 0027-8424
Publisher:
Proceedings of the National Academy of Sciences
Country of Publication:
United States
Language:
English

Citation Formats

Basu, Sumanta, Kumbier, Karl, Brown, James B., and Yu, Bin. Iterative random forests to discover predictive and stable high-order interactions. United States: N. p., 2018. Web. doi:10.1073/pnas.1711236115.
Basu, Sumanta, Kumbier, Karl, Brown, James B., & Yu, Bin. Iterative random forests to discover predictive and stable high-order interactions. United States. doi:10.1073/pnas.1711236115.
Basu, Sumanta, Kumbier, Karl, Brown, James B., and Yu, Bin. Fri . "Iterative random forests to discover predictive and stable high-order interactions". United States. doi:10.1073/pnas.1711236115.
@article{osti_1417528,
title = {Iterative random forests to discover predictive and stable high-order interactions},
author = {Basu, Sumanta and Kumbier, Karl and Brown, James B. and Yu, Bin},
abstractNote = {Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila , among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda ( Zld ), Giant ( Gt ), and Twist ( Twi ), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.},
doi = {10.1073/pnas.1711236115},
journal = {Proceedings of the National Academy of Sciences of the United States of America},
number = 8,
volume = 115,
place = {United States},
year = {2018},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1073/pnas.1711236115

Citation Metrics:
Cited by: 4 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Processing the H3K36me3 signature
journal, March 2009

  • Sims III, Robert J.; Reinberg, Danny
  • Nature Genetics, Vol. 41, Issue 3
  • DOI: 10.1038/ng0309-270

Genetic adaptation to high altitude in the Ethiopian highlands
journal, January 2012

  • Scheinfeldt, Laura B.; Soi, Sameer; Thompson, Simon
  • Genome Biology, Vol. 13, Issue 1
  • DOI: 10.1186/gb-2012-13-1-r1

Forest Garrote
journal, January 2009

  • Meinshausen, Nicolai
  • Electronic Journal of Statistics, Vol. 3, Issue 0
  • DOI: 10.1214/09-EJS434

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome
journal, January 2002

  • Berman, B. P.; Nibu, Y.; Pfeiffer, B. D.
  • Proceedings of the National Academy of Sciences, Vol. 99, Issue 2
  • DOI: 10.1073/pnas.231608898

Epigenome editing by a CRISPR-Cas9-based acetyltransferase activates genes from promoters and enhancers
journal, April 2015

  • Hilton, Isaac B.; D'Ippolito, Anthony M.; Vockley, Christopher M.
  • Nature Biotechnology, Vol. 33, Issue 5
  • DOI: 10.1038/nbt.3199

Co-ChIP enables genome-wide mapping of histone mark co-occurrence at single-molecule resolution
journal, July 2016

  • Weiner, Assaf; Lara-Astiaso, David; Krupalnik, Vladislav
  • Nature Biotechnology, Vol. 34, Issue 9
  • DOI: 10.1038/nbt.3652

Extensive Promoter-Centered Chromatin Interactions Provide a Topological Basis for Transcription Regulation
journal, January 2012


Sequence Analysis Using Logic Regression
journal, January 2001


Mutations affecting segment number and polarity in Drosophila
journal, October 1980

  • Nüsslein-Volhard, Christiane; Wieschaus, Eric
  • Nature, Vol. 287, Issue 5785
  • DOI: 10.1038/287795a0

Computing away the magic?
journal, August 2013


A balanced iterative random forest for gene selection from microarray data
journal, August 2013


From gradients to stripes in Drosophila embryogenesis: filling in the gaps
journal, November 1996


Node harvest
journal, December 2010

  • Meinshausen, Nicolai
  • The Annals of Applied Statistics, Vol. 4, Issue 4
  • DOI: 10.1214/10-AOAS367

Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila
journal, August 2015

  • Stoiber, Marcus H.; Olson, Sara; May, Gemma E.
  • Genome Research, Vol. 25, Issue 11
  • DOI: 10.1101/gr.182675.114

Differential chromatin marking of introns and expressed exons by H3K36me3
journal, February 2009

  • Kolasinska-Zwierz, Paulina; Down, Thomas; Latorre, Isabel
  • Nature Genetics, Vol. 41, Issue 3
  • DOI: 10.1038/ng.322

ChromNet: Learning the human chromatin network from all ENCODE ChIP-seq data
journal, April 2016


Impacts of the ubiquitous factor Zelda on Bicoid-dependent DNA binding and transcription in Drosophila
journal, March 2014


Global Quantitative Modeling of Chromatin Factor Interactions
journal, March 2014


Measuring reproducibility of high-throughput experiments
journal, September 2011

  • Li, Qunhua; Brown, James B.; Huang, Haiyan
  • The Annals of Applied Statistics, Vol. 5, Issue 3
  • DOI: 10.1214/11-AOAS466

Dynamic reprogramming of chromatin accessibility during Drosophila embryo development
journal, January 2011


DNA regions bound at low occupancy by transcription factors do not drive patterned reporter gene expression in Drosophila
journal, December 2012

  • Fisher, W. W.; Li, J. J.; Hammonds, A. S.
  • Proceedings of the National Academy of Sciences, Vol. 109, Issue 52
  • DOI: 10.1073/pnas.1209589110

Stability
journal, September 2013


Zelda Binding in the Early Drosophila melanogaster Embryo Marks Regions Subsequently Activated at the Maternal-to-Zygotic Transition
journal, October 2011


Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer
journal, July 2001

  • Ritchie, Marylyn D.; Hahn, Lance W.; Roodi, Nady
  • The American Journal of Human Genetics, Vol. 69, Issue 1
  • DOI: 10.1086/321276

Predictive learning via rule ensembles
journal, September 2008

  • Friedman, Jerome H.; Popescu, Bogdan E.
  • The Annals of Applied Statistics, Vol. 2, Issue 3
  • DOI: 10.1214/07-AOAS148

Zelda Potentiates Morphogen Activity by Increasing Chromatin Accessibility
journal, June 2014


A Broad Set of Chromatin Factors Influences Splicing
journal, September 2016


Bagging predictors
journal, August 1996


A U1 snRNP–specific assembly pathway reveals the SMN complex as a versatile hub for RNP exchange
journal, February 2016

  • So, Byung Ran; Wan, Lili; Zhang, Zhenxi
  • Nature Structural & Molecular Biology, Vol. 23, Issue 3
  • DOI: 10.1038/nsmb.3167

ATP-dependent chromatin remodeling during mammalian development
journal, August 2016

  • Hota, Swetansu K.; Bruneau, Benoit G.
  • Development, Vol. 143, Issue 16
  • DOI: 10.1242/dev.128892

Enriched random forests
journal, July 2008


eFORGE: A Tool for Identifying Cell Type-Specific Signal in Epigenomic Data
journal, November 2016


Integrative annotation of chromatin elements from ENCODE data
journal, December 2012

  • Hoffman, Michael M.; Ernst, Jason; Wilder, Steven P.
  • Nucleic Acids Research, Vol. 41, Issue 2
  • DOI: 10.1093/nar/gks1284

Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
journal, February 2017


Random Forests
journal, January 2001


ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
journal, September 2012


Bayesian inference of epistatic interactions in case-control studies
journal, August 2007

  • Zhang, Yu; Liu, Jun S.
  • Nature Genetics, Vol. 39, Issue 9
  • DOI: 10.1038/ng2110

CTCF: from insulators to alternative splicing regulation
journal, February 2012


Transcriptional Enhancers in Animal Development and Evolution
journal, September 2010


The zinc-finger protein Zelda is a key activator of the early zygotic genome in Drosophila
journal, October 2008

  • Liang, Hsiao-Lan; Nien, Chung-Yi; Liu, Hsiao-Yun
  • Nature, Vol. 456, Issue 7220
  • DOI: 10.1038/nature07388