DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits

Abstract

The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our resultsmore » highlight the importance of algorithm selection for the prediction of trait values.« less

Authors:
ORCiD logo [1];  [2]; ORCiD logo [2];  [2]; ORCiD logo [1]; ORCiD logo [1]
  1. Michigan State Univ., East Lansing, MI (United States)
  2. Dublin City Univ. (Ireland)
Publication Date:
Research Org.:
Great Lakes Bioenergy Research Center (GLBRC), Madison, WI (United States); Univ. of Wisconsin, Madison, WI (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
OSTI Identifier:
1637329
Grant/Contract Number:  
SC0018409
Resource Type:
Accepted Manuscript
Journal Name:
G3
Additional Journal Information:
Journal Volume: 9; Journal Issue: 11; Journal ID: ISSN 2160-1836
Publisher:
Genetics Society of America
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; genomic selection; artificial neural network; genotype-to-phenotype; genomic prediction; genpred; shared data resources

Citation Formats

Azodi, Christina B., Bolger, Emily, McCarren, Andrew, Roantree, Mark, de los Campos, Gustavo, and Shiu, Shin-Han. Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. United States: N. p., 2019. Web. doi:10.1534/g3.119.400498.
Azodi, Christina B., Bolger, Emily, McCarren, Andrew, Roantree, Mark, de los Campos, Gustavo, & Shiu, Shin-Han. Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. United States. https://doi.org/10.1534/g3.119.400498
Azodi, Christina B., Bolger, Emily, McCarren, Andrew, Roantree, Mark, de los Campos, Gustavo, and Shiu, Shin-Han. Wed . "Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits". United States. https://doi.org/10.1534/g3.119.400498. https://www.osti.gov/servlets/purl/1637329.
@article{osti_1637329,
title = {Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits},
author = {Azodi, Christina B. and Bolger, Emily and McCarren, Andrew and Roantree, Mark and de los Campos, Gustavo and Shiu, Shin-Han},
abstractNote = {The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.},
doi = {10.1534/g3.119.400498},
journal = {G3},
number = 11,
volume = 9,
place = {United States},
year = {Wed Sep 18 00:00:00 EDT 2019},
month = {Wed Sep 18 00:00:00 EDT 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 68 works
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: Algorithms used and compared in past GP studies and algorithms and data included in the GP benchmark. (A) Number of times a GP algorithm was utilized (diagonal) or directly compared to other GP algorithms (lower triangle) out of 91 publications published between 2012-2018 (Table S1). GP algorithms weremore » included if they were utilized in > 1 study. (B) A graphical representation of the GP algorithms included in the study and their relationship to each other. Colors designate if the algorithm identifies only linear (orange) or linear and non-linear (green) relationships. The placement of each algorithm on the tree designates (qualitatively) the relationship between different algorithms. The labels at each branch provide more information about how algorithms in that branch differ from others. rrBLUP, ridge regression Best Linear Unbiased Predictor; BRR, Bayesian Ridge Regression; BA, BayesA; BB, BayesB; BL, Bayesian LASSO; SVR, Support Vector Regression (kernel type: lin, linear; poly, polynomial; rbf, radial basis function); RF, Random Forest; GTB, Gradient Tree Boosting; ANN, Artificial Neural Network; CNN, Convolutional Neural Network. (C) Species and traits included in the benchmark with training population types and sizes and marker types and numbers for each dataset. NAM: Nested Association Mapping. DM: partial diallel mating. GBS: genotyping by sequencing. SNP: single nucleotide polymorphism. HT: height. FT: flowering time. YLD: yield. GM: grain moisture. R8: time to R8 developmental stage. DBH: diameter at breast height. DE: wood density. ST: standability.« less

Save / Share:

Works referenced in this record:

Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree
journal, March 2009


Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures
journal, April 2006


Genomic selection of agronomic traits in hybrid rice using an NCII population
journal, May 2018


Recurrent Neural Networks for Sequential Phenotype Prediction in Genomics
conference, December 2015

  • Pouladi, Farhad; Salehinejad, Hojjat; Gilani, Amir Mohammad
  • 2015 International Conference on Developments of E-Systems Engineering (DeSE)
  • DOI: 10.1109/DeSE.2015.52

Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods
journal, August 2010

  • De Los Campos, Gustavo; Gianola, Daniel; Rosa, Guilherme J. M.
  • Genetics Research, Vol. 92, Issue 4
  • DOI: 10.1017/S0016672310000285

Extensive Genetic Diversity is Present within North American Switchgrass Germplasm
journal, January 2018


Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle
journal, March 2015

  • Ehret, Anita; Hochstuhl, David; Gianola, Daniel
  • Genetics Selection Evolution, Vol. 47, Issue 1
  • DOI: 10.1186/s12711-015-0097-5

Applications of Machine Learning Methods to Genomic Selection in Breeding Wheat for Rust Resistance
journal, July 2018


Insights into the Maize Pan-Genome and Pan-Transcriptome
journal, January 2014

  • Hirsch, Candice N.; Foerster, Jillian M.; Johnson, James M.
  • The Plant Cell, Vol. 26, Issue 1
  • DOI: 10.1105/tpc.113.119982

Genome-Enabled Prediction Models for Yield Related Traits in Chickpea
journal, November 2016

  • Roorkiwal, Manish; Rathore, Abhishek; Das, Roma R.
  • Frontiers in Plant Science, Vol. 7
  • DOI: 10.3389/fpls.2016.01666

A deep convolutional neural network approach for predicting phenotypes from genotypes
journal, August 2018


Genome-wide prediction of discrete traits using bayesian regressions and machine learning
journal, February 2011

  • González-Recio, Oscar; Forni, Selma
  • Genetics Selection Evolution, Vol. 43, Issue 1
  • DOI: 10.1186/1297-9686-43-7

Genetic architecture of complex traits in plants
journal, April 2007


A Ranking Approach to Genomic Selection
journal, June 2015


Genome-enabled prediction of genetic values using radial basis function neural networks
journal, May 2012

  • González-Camacho, J. M.; de los Campos, G.; Pérez, P.
  • Theoretical and Applied Genetics, Vol. 125, Issue 4
  • DOI: 10.1007/s00122-012-1868-9

LASSO with cross-validation for genomic selection
journal, December 2009


Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding
journal, June 2012


Early Stopping - But When?
book, January 1998


Regularization and variable selection via the elastic net
journal, April 2005


Dominance and Epistasis Interactions Revealed as Important Variants for Leaf Traits of Maize NAM Population
journal, June 2018


Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy
journal, July 2018

  • Norman, Adam; Taylor, Julian; Edwards, James
  • G3: Genes|Genomes|Genetics, Vol. 8, Issue 9
  • DOI: 10.1534/g3.118.200311

Efficiency of multi-trait, indirect, and trait-assisted genomic selection for improvement of biomass sorghum
journal, December 2017

  • Fernandes, Samuel B.; Dias, Kaio O. G.; Ferreira, Daniel F.
  • Theoretical and Applied Genetics, Vol. 131, Issue 3
  • DOI: 10.1007/s00122-017-3033-y

Accuracy of breeding values of 'unrelated' individuals predicted by dense SNP genotyping
journal, June 2009


Genomic selection accuracies within and between environments and small breeding groups in white spruce
journal, January 2014


Genomic Selection for Crop Improvement
journal, January 2009


Handling limited datasets with neural networks in medical applications: A small-data approach
journal, January 2017


Diversity and population structure of northern switchgrass as revealed through exome capture sequencing
journal, November 2015

  • Evans, Joseph; Crisovan, Emily; Barry, Kerrie
  • The Plant Journal, Vol. 84, Issue 4
  • DOI: 10.1111/tpj.13041

SMURF: Genomic mapping of fungal secondary metabolite clusters
journal, September 2010

  • Khaldi, Nora; Seifuddin, Fayaz T.; Turner, Geoff
  • Fungal Genetics and Biology, Vol. 47, Issue 9
  • DOI: 10.1016/j.fgb.2010.06.003

Benchmarking parametric and machine learning models for genomic prediction of complex traits
dataset, January 2019


Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP
journal, January 2011


Does genomic selection have a future in plant breeding?
journal, September 2013


Can Deep Learning Improve Genomic Prediction of Complex Human Traits?
journal, August 2018


Predictive ability of subsets of single nucleotide polymorphisms with and without parent average in US Holsteins
journal, December 2010

  • Vazquez, A. I.; Rosa, G. J. M.; Weigel, K. A.
  • Journal of Dairy Science, Vol. 93, Issue 12
  • DOI: 10.3168/jds.2010-3335

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

  • Valiev, M.; Bylaska, E. J.; Govind, N.
  • Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
  • DOI: 10.1016/j.cpc.2010.04.018

A bootstrap evaluation of the effect of data splitting on financial time series
journal, January 1998

  • LeBaron, B.; Weigend, A. S.
  • IEEE Transactions on Neural Networks, Vol. 9, Issue 1
  • DOI: 10.1109/72.655043

Extension of the bayesian alphabet for genomic selection
journal, May 2011

  • Habier, David; Fernando, Rohan L.; Kizilkaya, Kadir
  • BMC Bioinformatics, Vol. 12, Issue 1
  • DOI: 10.1186/1471-2105-12-186

Accelerating the Switchgrass (Panicum virgatum L.) Breeding Cycle Using Genomic Selection Approaches
journal, November 2014


machine.
journal, October 2001


Genomic Selection in Plant Breeding: A Comparison of Models
journal, January 2012


The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets
journal, January 2013

  • González-Recio, O.; Jiménez-Montero, J. A.; Alenda, R.
  • Journal of Dairy Science, Vol. 96, Issue 1
  • DOI: 10.3168/jds.2012-5630

Genomic selection: genome-wide prediction in plant improvement
journal, September 2014


Assessing Predictive Properties of Genome-Wide Selection in Soybeans
journal, June 2016

  • Xavier, Alencar; Muir, William M.; Rainey, Katy Martin
  • G3: Genes|Genomes|Genetics, Vol. 6, Issue 8
  • DOI: 10.1534/g3.116.032268

Prediction of body mass index in mice using dense molecular markers and a regularized neural network
journal, April 2011


Marker-assisted selection to improve drought adaptation in maize: the backcross approach, perspectives, limitations, and alternatives
journal, November 2006

  • Ribaut, J. -M.; Ragot, M.
  • Journal of Experimental Botany, Vol. 58, Issue 2
  • DOI: 10.1093/jxb/erl214

The Pfam protein families database: towards a more sustainable future
journal, December 2015

  • Finn, Robert D.; Coggill, Penelope; Eberhardt, Ruth Y.
  • Nucleic Acids Research, Vol. 44, Issue D1
  • DOI: 10.1093/nar/gkv1344

A comparison of statistical methods for genomic selection in a mice population
journal, January 2012

  • Neves, Haroldo HR; Carvalheiro, Roberto; Queiroz, Sandra A.
  • BMC Genetics, Vol. 13, Issue 1
  • DOI: 10.1186/1471-2156-13-100

Deep learning for biology
journal, February 2018


Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits
journal, February 2008


Application of high-dimensional feature selection: evaluation for genomic prediction in man
journal, May 2015

  • Bermingham, M. L.; Pong-Wong, R.; Spiliopoulou, A.
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep10312

Ensemble Methods in Machine Learning
book, January 2000


Benchmarking parametric and machine learning models for genomic prediction of complex traits
dataset, January 2019


Deep learning for computational biology
journal, July 2016

  • Angermueller, Christof; Pärnamaa, Tanel; Parts, Leopold
  • Molecular Systems Biology, Vol. 12, Issue 7
  • DOI: 10.15252/msb.20156651

Genome-enabled prediction using probabilistic neural network classifiers
journal, March 2016

  • González-Camacho, Juan Manuel; Crossa, José; Pérez-Rodríguez, Paulino
  • BMC Genomics, Vol. 17, Issue 1
  • DOI: 10.1186/s12864-016-2553-1

A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers
journal, December 2009

  • Moser, Gerhard; Tier, Bruce; Crump, Ron E.
  • Genetics Selection Evolution, Vol. 41, Issue 1
  • DOI: 10.1186/1297-9686-41-56

Genome-Wide Regression and Prediction with the BGLR Statistical Package
journal, July 2014


Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines
journal, January 2012

  • Riedelsheimer, Christian; Technow, Frank; Melchinger, Albrecht E.
  • BMC Genomics, Vol. 13, Issue 1
  • DOI: 10.1186/1471-2164-13-452

Accuracy of Genomic Prediction in Switchgrass ( Panicum virgatum L.) Improved by Accounting for Linkage Disequilibrium
journal, February 2016

  • Ramstein, Guillaume P.; Evans, Joseph; Kaeppler, Shawn M.
  • G3: Genes|Genomes|Genetics, Vol. 6, Issue 4
  • DOI: 10.1534/g3.115.024950

Genomic selection: prediction of accuracy and maximisation of long term response
journal, August 2008


Application of support vector regression to genome-assisted prediction of quantitative traits
journal, July 2011

  • Long, Nanye; Gianola, Daniel; Rosa, Guilherme J. M.
  • Theoretical and Applied Genetics, Vol. 123, Issue 7
  • DOI: 10.1007/s00122-011-1648-y

Random Forests
journal, January 2001


Genetic Diversity of a Maize Association Population with Restricted Phenology
journal, January 2011


Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps
journal, April 2001


Ensemble Methods in Machine Learning
book, January 2000


Application of support vector regression to genome-assisted prediction of quantitative traits
journal, July 2011

  • Long, Nanye; Gianola, Daniel; Rosa, Guilherme J. M.
  • Theoretical and Applied Genetics, Vol. 123, Issue 7
  • DOI: 10.1007/s00122-011-1648-y

Genome-enabled prediction of genetic values using radial basis function neural networks
journal, May 2012

  • González-Camacho, J. M.; de los Campos, G.; Pérez, P.
  • Theoretical and Applied Genetics, Vol. 125, Issue 4
  • DOI: 10.1007/s00122-012-1868-9

Efficiency of multi-trait, indirect, and trait-assisted genomic selection for improvement of biomass sorghum
journal, December 2017

  • Fernandes, Samuel B.; Dias, Kaio O. G.; Ferreira, Daniel F.
  • Theoretical and Applied Genetics, Vol. 131, Issue 3
  • DOI: 10.1007/s00122-017-3033-y

A deep convolutional neural network approach for predicting phenotypes from genotypes
journal, August 2018


Does genomic selection have a future in plant breeding?
journal, September 2013


Genomic selection: genome-wide prediction in plant improvement
journal, September 2014


Deep learning for biology
journal, February 2018


Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps
journal, April 2001


Marker-assisted selection to improve drought adaptation in maize: the backcross approach, perspectives, limitations, and alternatives
journal, November 2006

  • Ribaut, J. -M.; Ragot, M.
  • Journal of Experimental Botany, Vol. 58, Issue 2
  • DOI: 10.1093/jxb/erl214

Accuracy of breeding values of 'unrelated' individuals predicted by dense SNP genotyping
journal, June 2009


A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers
journal, December 2009

  • Moser, Gerhard; Tier, Bruce; Crump, Ron E.
  • Genetics Selection Evolution, Vol. 41, Issue 1
  • DOI: 10.1186/1297-9686-41-56

Genome-wide prediction of discrete traits using bayesian regressions and machine learning
journal, February 2011

  • González-Recio, Oscar; Forni, Selma
  • Genetics Selection Evolution, Vol. 43, Issue 1
  • DOI: 10.1186/1297-9686-43-7

A comparison of statistical methods for genomic selection in a mice population
journal, January 2012

  • Neves, Haroldo HR; Carvalheiro, Roberto; Queiroz, Sandra A.
  • BMC Genetics, Vol. 13, Issue 1
  • DOI: 10.1186/1471-2156-13-100

Genomic selection accuracies within and between environments and small breeding groups in white spruce
journal, January 2014


Genomic selection of agronomic traits in hybrid rice using an NCII population
journal, May 2018


Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle
journal, March 2015

  • Ehret, Anita; Hochstuhl, David; Gianola, Daniel
  • Genetics Selection Evolution, Vol. 47, Issue 1
  • DOI: 10.1186/s12711-015-0097-5

Genome-enabled prediction using probabilistic neural network classifiers
journal, March 2016

  • González-Camacho, Juan Manuel; Crossa, José; Pérez-Rodríguez, Paulino
  • BMC Genomics, Vol. 17, Issue 1
  • DOI: 10.1186/s12864-016-2553-1

machine.
journal, October 2001


Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits
journal, February 2008


A Ranking Approach to Genomic Selection
journal, June 2015


Deep learning for computational biology
journal, July 2016

  • Angermueller, Christof; Pärnamaa, Tanel; Parts, Leopold
  • Molecular Systems Biology, Vol. 12, Issue 7
  • DOI: 10.15252/msb.20156651

Accuracy of Genomic Prediction in Switchgrass ( Panicum virgatum L.) Improved by Accounting for Linkage Disequilibrium
journal, February 2016

  • Ramstein, Guillaume P.; Evans, Joseph; Kaeppler, Shawn M.
  • G3: Genes|Genomes|Genetics, Vol. 6, Issue 4
  • DOI: 10.1534/g3.115.024950

Assessing Predictive Properties of Genome-Wide Selection in Soybeans
journal, June 2016

  • Xavier, Alencar; Muir, William M.; Rainey, Katy Martin
  • G3: Genes|Genomes|Genetics, Vol. 6, Issue 8
  • DOI: 10.1534/g3.116.032268

Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy
journal, July 2018

  • Norman, Adam; Taylor, Julian; Edwards, James
  • G3: Genes|Genomes|Genetics, Vol. 8, Issue 9
  • DOI: 10.1534/g3.118.200311

Genomic-Assisted Prediction of Genetic Value With Semiparametric Procedures
journal, April 2006


Genome-Wide Regression and Prediction with the BGLR Statistical Package
journal, July 2014


Genomic Selection for Crop Improvement
journal, January 2009


Genetic Diversity of a Maize Association Population with Restricted Phenology
journal, January 2011


Genomic Selection in Plant Breeding: A Comparison of Models
journal, January 2012


Predictive ability of subsets of single nucleotide polymorphisms with and without parent average in US Holsteins
journal, December 2010

  • Vazquez, A. I.; Rosa, G. J. M.; Weigel, K. A.
  • Journal of Dairy Science, Vol. 93, Issue 12
  • DOI: 10.3168/jds.2010-3335

The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets
journal, January 2013

  • González-Recio, O.; Jiménez-Montero, J. A.; Alenda, R.
  • Journal of Dairy Science, Vol. 96, Issue 1
  • DOI: 10.3168/jds.2012-5630

Dominance and Epistasis Interactions Revealed as Important Variants for Leaf Traits of Maize NAM Population
journal, June 2018


Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP
journal, January 2011


Extensive Genetic Diversity is Present within North American Switchgrass Germplasm
journal, January 2018


High heritability does not imply accurate prediction under the small additive effects hypothesis
preprint, January 2020


Works referencing / citing this record:

Benchmarking parametric and machine learning models for genomic prediction of complex traits
dataset, January 2019


A Multiple-Trait Bayesian Lasso for Genome-Enabled Analysis and Prediction of Complex Traits
journal, February 2020


Benchmarking parametric and machine learning models for genomic prediction of complex traits
dataset, January 2019


Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.