DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A practical guide to methods controlling false discoveries in computational biology

Abstract

Background: In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Results: Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Conclusions: Modern FDR methods that usemore » an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.« less

Authors:
; ; ; ; ; ; ; ; ORCiD logo
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
OSTI Identifier:
1618920
Alternate Identifier(s):
OSTI ID: 1626945
Grant/Contract Number:  
AC02-05CH11231; U41HG004059; R02HG005220; R01GM083084; R01GM103552; R00HG009007; 2018-183142; 2018-183201; 2018-183560; P30CA076292
Resource Type:
Published Article
Journal Name:
Genome Biology (Online)
Additional Journal Information:
Journal Name: Genome Biology (Online) Journal Volume: 20 Journal Issue: 1; Journal ID: ISSN 1474-760X
Publisher:
Springer Science + Business Media
Country of Publication:
United Kingdom
Language:
English
Subject:
Biotechnology & Applied Microbiology; Genetics & Heredity; Multiple hypothesis testing; False discovery rate; RNA-seq; ScRNA-seq; ChIP-seq; Microbiome; GWAS; Gene set analysis

Citation Formats

Korthauer, Keegan, Kimes, Patrick K., Duvallet, Claire, Reyes, Alejandro, Subramanian, Ayshwarya, Teng, Mingxiang, Shukla, Chinmay, Alm, Eric J., and Hicks, Stephanie C. A practical guide to methods controlling false discoveries in computational biology. United Kingdom: N. p., 2019. Web. doi:10.1186/s13059-019-1716-1.
Korthauer, Keegan, Kimes, Patrick K., Duvallet, Claire, Reyes, Alejandro, Subramanian, Ayshwarya, Teng, Mingxiang, Shukla, Chinmay, Alm, Eric J., & Hicks, Stephanie C. A practical guide to methods controlling false discoveries in computational biology. United Kingdom. https://doi.org/10.1186/s13059-019-1716-1
Korthauer, Keegan, Kimes, Patrick K., Duvallet, Claire, Reyes, Alejandro, Subramanian, Ayshwarya, Teng, Mingxiang, Shukla, Chinmay, Alm, Eric J., and Hicks, Stephanie C. Tue . "A practical guide to methods controlling false discoveries in computational biology". United Kingdom. https://doi.org/10.1186/s13059-019-1716-1.
@article{osti_1618920,
title = {A practical guide to methods controlling false discoveries in computational biology},
author = {Korthauer, Keegan and Kimes, Patrick K. and Duvallet, Claire and Reyes, Alejandro and Subramanian, Ayshwarya and Teng, Mingxiang and Shukla, Chinmay and Alm, Eric J. and Hicks, Stephanie C.},
abstractNote = {Background: In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Results: Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Conclusions: Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.},
doi = {10.1186/s13059-019-1716-1},
journal = {Genome Biology (Online)},
number = 1,
volume = 20,
place = {United Kingdom},
year = {Tue Jun 04 00:00:00 EDT 2019},
month = {Tue Jun 04 00:00:00 EDT 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.1186/s13059-019-1716-1

Citation Metrics:
Cited by: 150 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

AdaPT: an interactive procedure for multiple testing with side information
journal, June 2018

  • Lei, Lihua; Fithian, William
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 80, Issue 4
  • DOI: 10.1111/rssb.12274

Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma
journal, October 2017


Multiple Comparisons among Means
journal, March 1961


A statistical approach for identifying differential distributions in single-cell RNA-seq experiments
journal, October 2016


A direct approach to false discovery rates
journal, August 2002

  • Storey, John D.
  • Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 64, Issue 3
  • DOI: 10.1111/1467-9868.00346

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
journal, July 2015


Data-driven hypothesis weighting increases detection power in genome-scale multiple testing
journal, May 2016

  • Ignatiadis, Nikolaos; Klaus, Bernd; Zaugg, Judith B.
  • Nature Methods, Vol. 13, Issue 7
  • DOI: 10.1038/nmeth.3885

Bias, robustness and scalability in single-cell differential expression analysis
journal, February 2018

  • Soneson, Charlotte; Robinson, Mark D.
  • Nature Methods, Vol. 15, Issue 4
  • DOI: 10.1038/nmeth.4612

Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions
journal, April 2016


A global reference for human genetic variation
journal, January 2015

  • Consortium, The 1000 Genomes Project; Auton, Adam; Abecasis, Gonçalo R.
  • Nature, Vol. 526, Issue 7571, p. 68-74
  • DOI: 10.1038/nature15393

Fast gene set enrichment analysis
preprint, February 2019

  • Korotkevich, Gennady; Sukhov, Vladimir; Sergushichev, Alexey
  • DOI: 10.1101/060012

iCOBRA: open, reproducible, standardized and live method benchmarking
journal, March 2016

  • Soneson, Charlotte; Robinson, Mark D.
  • Nature Methods, Vol. 13, Issue 4
  • DOI: 10.1038/nmeth.3805

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses
journal, September 2007

  • Purcell, Shaun; Neale, Benjamin; Todd-Brown, Kathe
  • The American Journal of Human Genetics, Vol. 81, Issue 3
  • DOI: 10.1086/519795

Identification of Regulatory Networks in HSCs and Their Immediate Progeny via Integrated Proteome, Transcriptome, and DNA Methylome Analysis
journal, October 2014


A sharper Bonferroni procedure for multiple tests of significance
journal, January 1988


Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate
journal, April 2002

  • Genovese, Christopher R.; Lazar, Nicole A.; Nichols, Thomas
  • NeuroImage, Vol. 15, Issue 4
  • DOI: 10.1006/nimg.2001.1037

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
journal, September 2005

  • Subramanian, A.; Tamayo, P.; Mootha, V. K.
  • Proceedings of the National Academy of Sciences, Vol. 102, Issue 43, p. 15545-15550
  • DOI: 10.1073/pnas.0506580102

Controlling the rate of Type I error over a large set of statistical tests
journal, May 2002

  • Keselman, H. J.; Cribbie, Robert; Holland, Burt
  • British Journal of Mathematical and Statistical Psychology, Vol. 55, Issue 1
  • DOI: 10.1348/000711002159680

The human transcriptome across tissues and individuals
journal, May 2015


Reproducible and replicable comparisons using SummarizedBenchmark
journal, July 2018


Multiple Hypothesis Testing in Microarray Experiments
journal, February 2003

  • Dudoit, Sandrine; Shaffer, Juliet Popper; Block, Jennifer C.
  • Statistical Science, Vol. 18, Issue 1
  • DOI: 10.1214/ss/1056397487

From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data
journal, January 2015


Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks
journal, December 2009

  • Cai, T. Tony; Sun, Wenguang
  • Journal of the American Statistical Association, Vol. 104, Issue 488
  • DOI: 10.1198/jasa.2009.tm08415

Technical trading revisited: False discoveries, persistence tests, and transaction costs
journal, December 2012


False Discovery Rate Regression: An Application to Neural Synchrony Detection in Primary Visual Cortex
journal, April 2015

  • Scott, James G.; Kelly, Ryan C.; Smith, Matthew A.
  • Journal of the American Statistical Association, Vol. 110, Issue 510
  • DOI: 10.1080/01621459.2014.990973

Multiple hypothesis testing in genomics
journal, January 2014

  • Goeman, Jelle J.; Solari, Aldo
  • Statistics in Medicine, Vol. 33, Issue 11
  • DOI: 10.1002/sim.6082

Search and clustering orders of magnitude faster than BLAST
journal, August 2010


Polyester : simulating RNA-seq datasets with differential transcript expression
journal, April 2015


csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows
journal, November 2015

  • Lun, Aaron T. L.; Smyth, Gordon K.
  • Nucleic Acids Research, Vol. 44, Issue 5
  • DOI: 10.1093/nar/gkv1191

False Discovery Rates and Related Statistical Concepts in Mass Spectrometry-Based Proteomics
journal, January 2008

  • Choi, Hyungwon; Nesvizhskii, Alexey I.
  • Journal of Proteome Research, Vol. 7, Issue 1
  • DOI: 10.1021/pr700747q

An integrated encyclopedia of DNA elements in the human genome
journal, September 2012


On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics
journal, March 2000

  • Benjamini, Yoav; Hochberg, Yosef
  • Journal of Educational and Behavioral Statistics, Vol. 25, Issue 1
  • DOI: 10.3102/10769986025001060

Gene ontology analysis for RNA-seq: accounting for selection bias
journal, January 2010

  • Young, Matthew D.; Wakefield, Matthew J.; Smyth, Gordon K.
  • Genome Biology, Vol. 11, Issue 2
  • DOI: 10.1186/gb-2010-11-2-r14

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
journal, November 2009


Natural Bacterial Communities Serve as Quantitative Geochemical Biosensors
journal, May 2015


The False Discovery Rate: A Key Concept in Large-Scale Genetic Studies
journal, January 2010


limma powers differential expression analyses for RNA-sequencing and microarray studies
journal, January 2015

  • Ritchie, Matthew E.; Phipson, Belinda; Wu, Di
  • Nucleic Acids Research, Vol. 43, Issue 7
  • DOI: 10.1093/nar/gkv007

Independent filtering increases detection power for high-throughput experiments
journal, May 2010

  • Bourgon, R.; Gentleman, R.; Huber, W.
  • Proceedings of the National Academy of Sciences, Vol. 107, Issue 21
  • DOI: 10.1073/pnas.0914005107

TALEN-based knockout library for human microRNAs
journal, November 2013

  • Kim, Young-Kook; Wee, Gabbine; Park, Joha
  • Nature Structural & Molecular Biology, Vol. 20, Issue 12
  • DOI: 10.1038/nsmb.2701

Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index
journal, October 2010

  • Speliotes, Elizabeth K.; Willer, Cristen J.; Berndt, Sonja I.
  • Nature Genetics, Vol. 42, Issue 11
  • DOI: 10.1038/ng.686

Multiple Hypothesis Testing
journal, January 1995


QIIME allows analysis of high-throughput community sequencing data
journal, April 2010

  • Caporaso, J. Gregory; Kuczynski, Justin; Stombaugh, Jesse
  • Nature Methods, Vol. 7, Issue 5
  • DOI: 10.1038/nmeth.f.303

PEAR: a fast and accurate Illumina Paired-End reAd mergeR
journal, October 2013


A stagewise rejective multiple test procedure based on a modified Bonferroni test
journal, January 1988


Large-scale association analysis identifies new risk loci for coronary artery disease
journal, December 2012

  • Deloukas, Panos; Kanoni, Stavroula; Willenborg, Christina
  • Nature Genetics, Vol. 45, Issue 1
  • DOI: 10.1038/ng.2480

Non-Invasive Mapping of the Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease
journal, June 2012


Discovering the false discovery rate: False Discovery Rate
journal, August 2010


Reproducible RNA-seq analysis using recount2
journal, April 2017

  • Collado-Torres, Leonardo; Nellore, Abhinav; Kammers, Kai
  • Nature Biotechnology, Vol. 35, Issue 4
  • DOI: 10.1038/nbt.3838

Genome-wide and single-cell analyses reveal a context dependent relationship between CBP recruitment and gene expression
journal, September 2014

  • Kasper, Lawryn H.; Qu, Chunxu; Obenauer, John C.
  • Nucleic Acids Research, Vol. 42, Issue 18
  • DOI: 10.1093/nar/gku827

Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues
journal, November 2017

  • Reyes, Alejandro; Huber, Wolfgang
  • Nucleic Acids Research, Vol. 46, Issue 2
  • DOI: 10.1093/nar/gkx1165

Count Rna-Seq Data Used For Benchmarking Fdr Control Methods
dataset, January 2018


Measuring Signaling and RNA-Seq in the Same Cell Links Gene Expression to Dynamic Patterns of NF-κB Activation
journal, April 2017


Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
journal, December 2014


Meta-analysis of gut microbiome studies identifies disease-specific and shared responses
journal, December 2017


Human Genetics Shape the Gut Microbiome
journal, November 2014