Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Structure-informed clustering for population stratification in association studies

Journal Article · · BMC Bioinformatics
 [1];  [2];  [3];  [4];  [4]
  1. IBM, Yorktown Heights, NY (United States). Thomas J. Watson Research Center
  2. IBM, Yorktown Heights, NY (United States). Thomas J. Watson Research Center; Purdue University, West Lafayette, IN (United States)
  3. Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
  4. Purdue University, West Lafayette, IN (United States)
Background: Identifying variants associated with complex traits is a challenging task in genetic association studies due to linkage disequilibrium (LD) between genetic variants and population stratification, unrelated to the disease risk. Existing methods of population structure correction use principal component analysis or linear mixed models with a random effect when modeling associations between a trait of interest and genetic markers. However, due to stringent significance thresholds and latent interactions between the markers, these methods often fail to detect genuinely associated variants. Results: To overcome this, we propose CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers. In simulation studies, we show that our method outperforms existing methods in detecting true causal variants. Applying CluStrat on WTCCC2 and UK Biobank cohorts, we found biologically relevant associations in Schizophrenia and Myocardial Infarction. CluStrat was also able to correct for population structure in polygenic adaptation of height in Europeans. Conclusions: CluStrat highlights the advantages of biologically relevant distance metrics, such as the Mahalanobis distance, which captures the cryptic interactions within populations in the presence of LD better than the Euclidean distance.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
National Science Foundation (NSF); USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
2471413
Journal Information:
BMC Bioinformatics, Journal Name: BMC Bioinformatics Journal Issue: 1 Vol. 24; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (37)

An evolutionary compass for detecting signals of polygenic selection and mutational bias journal February 2019
A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity journal June 1995
Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? journal April 2019
In Schizophrenia, Deficits in Natural IgM Isotype Antibodies Including those Directed to Malondialdehyde and Azelaic Acid Strongly Predict Negative Symptoms, Neurocognitive Impairments, and the Deficit Syndrome journal November 2018
GCTA: A Tool for Genome-wide Complex Trait Analysis journal January 2011
Neurofibromatose de type 1 journal February 2006
10 Years of GWAS Discovery: Biology, Function, and Translation journal July 2017
A global reference for human genetic variation journal January 2015
Genome-wide patterns of selection in 230 ancient Eurasians journal November 2015
Defining the role of common variation in the genomic and biological architecture of adult human height journal October 2014
Efficient Bayesian mixed-model analysis increases association power in large cohorts journal February 2015
Testing for genetic associations in arbitrarily structured populations journal March 2015
Common SNPs explain a large proportion of the heritability for human height journal June 2010
Principal components analysis corrects for stratification in genome-wide association studies journal July 2006
A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction journal December 2017
The UK Biobank resource with deep phenotyping and genomic data journal October 2018
The mutational constraint spectrum quantified from variation in 141,456 humans journal May 2020
Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder journal November 2018
clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters journal May 2012
METAL: fast and efficient meta-analysis of genomewide association scans journal July 2010
TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes journal April 2019
The Mahalanobis distance and elliptic distributions journal January 1985
Inference of Population Structure Using Multilocus Genotype Data journal June 2000
Linkage Disequilibrium in Subdivided Populations journal September 1973
The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 journal November 2018
Genomic Control for Association Studies journal December 1999
Detection of human adaptation during the past 2000 years journal November 2016
The Ensembl Variant Effect Predictor journal June 2016
Second-generation PLINK: rising to the challenge of larger and richer datasets journal February 2015
Population Structure and Cryptic Relatedness in Genetic Association Studies journal November 2009
Population Structure and Eigenanalysis journal January 2006
The Role of Geography in Human Adaptation journal June 2009
Cellular Senescence in Cardiovascular Diseases: A Systematic Review journal January 2022
Protein Kinases as Drug Development Targets for Heart Disease Therapy journal July 2010
Genome-wide patterns of selection in 230 ancient Eurasians collection January 2015
Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies journal March 2019
Reduced signal for polygenic adaptation of height in UK Biobank journal March 2019

Similar Records

A genomic survey of linkage disequilibrium
Journal Article · Thu Sep 01 00:00:00 EDT 1994 · American Journal of Human Genetics · OSTI ID:133360

A comparison of genetic map distance and linkage disequilibrium between 15 polymorphic dinucleotide repeat loci in two populations
Journal Article · Thu Sep 01 00:00:00 EDT 1994 · American Journal of Human Genetics · OSTI ID:134134

An empiric comparison of linkage disequilibrium parameters in disease gene localizations; the myotonic dystrophy experience
Journal Article · Thu Sep 01 00:00:00 EDT 1994 · American Journal of Human Genetics · OSTI ID:133929