Iterative random forests to discover predictive and stable high-order interactions
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853,, Department of Statistical Science, Cornell University, Ithaca, NY 14853,, Data Driven Decisions Department, Preminon LLC, Antioch, CA 94531,
- Statistics Department, University of California, Berkeley, CA 94720,
- Data Driven Decisions Department, Preminon LLC, Antioch, CA 94531,, Statistics Department, University of California, Berkeley, CA 94720,, Centre for Computational Biology, School of Biosciences, University of Birmingham, Edgbaston B15 2TT, United Kingdom,, Molecular Ecosystems Biology Department, Biosciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720,
- Data Driven Decisions Department, Preminon LLC, Antioch, CA 94531,, Statistics Department, University of California, Berkeley, CA 94720,, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the earlyDrosophilaembryo and alternative splicing of primary transcripts in human-derived cell lines. InDrosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., betweenZelda(Zld),Giant(Gt), and Twist(Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.
- Research Organization:
- Preminon, LLC, Antioch, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); National Institutes of Health (NIH) National Human Genome Research Institute (NHGRI); US Army Research Office (ARO); US Department of the Navy, Office of Naval Research (ONR); National Science Foundation (NSF); Center for Science of Information; National Library of Medicine
- Grant/Contract Number:
- DOE DE-AC02-05CH11231; SC0017069; AC02-05CH11231 U01HG007031; W911NF1710005; N00014-16-1-2664; R00 HG006698; DMS-1613002; T32LM012417
- OSTI ID:
- 1417528
- Alternate ID(s):
- OSTI ID: 1625005
- Journal Information:
- Proceedings of the National Academy of Sciences of the United States of America, Journal Name: Proceedings of the National Academy of Sciences of the United States of America Vol. 115 Journal Issue: 8; ISSN 0027-8424
- Publisher:
- Proceedings of the National Academy of SciencesCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Similar Records
GENOME ENABLED MODIFICATION OF POPLAR ROOT DEVELOPMENT FOR INCREASED CARBON SEQUESTRATION
Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila