DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

Abstract

While template-free protein structure prediction protocols now produce good quality models for many targets, modelling failure remains common. For these methods to be useful it is important that users can both choose the best model from the hundreds to thousands of models that are commonly generated for a target, and determine whether this model is likely to be correct. We have developed Random Forest Quality Assessment (RFQAmodel), which assesses whether models produced by a protein structure prediction pipeline have the correct fold. RFQAmodel uses a combination of existing quality assessment scores with two predicted contact map alignment scores. These alignment scores are able to identify correct models for targets that are not otherwise captured. Our classifier was trained on a large set of protein domains that are structurally diverse and evenly balanced in terms of protein features known to have an effect on modelling success, and then tested on a second set of 244 protein domains with a similar spread of properties. When models for each target in this second set were ranked according to the RFQAmodel score, the highest-ranking model had a high-confidence RFQAmodel score for 67 modelling targets, of which 52 had the correct fold. At the othermore » end of the scale RFQAmodel correctly predicted that for 59 targets the highest-ranked model was incorrect. In comparisons to other methods we found that RFQAmodel is better able to identify correct models for targets where only a few of the models are correct. We found that RFQAmodel achieved a similar performance on the model sets for CASP12 and CASP13 free-modelling targets. Finally, by iteratively generating models and running RFQAmodel until a model is produced that is predicted to be correct with high confidence, we demonstrate how such a protocol can be used to focus computational efforts on difficult modelling targets. RFQAmodel and the accompanying data can be downloaded from http://opig.stats.ox.ac.uk/resources.« less

Authors:
ORCiD logo [1];  [2];  [1]
  1. Oxford Univ. (United Kingdom)
  2. Stanford Univ., CA (United States)
Publication Date:
Research Org.:
SLAC National Accelerator Lab., Menlo Park, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1633852
Grant/Contract Number:  
AC02-76SF00515
Resource Type:
Accepted Manuscript
Journal Name:
PLoS ONE
Additional Journal Information:
Journal Volume: 14; Journal Issue: 10; Journal ID: ISSN 1932-6203
Publisher:
Public Library of Science
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES

Citation Formats

West, Clare E., de Oliveira, Saulo H. P., and Deane, Charlotte M. RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold. United States: N. p., 2019. Web. doi:10.1371/journal.pone.0218149.
West, Clare E., de Oliveira, Saulo H. P., & Deane, Charlotte M. RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold. United States. https://doi.org/10.1371/journal.pone.0218149
West, Clare E., de Oliveira, Saulo H. P., and Deane, Charlotte M. Mon . "RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold". United States. https://doi.org/10.1371/journal.pone.0218149. https://www.osti.gov/servlets/purl/1633852.
@article{osti_1633852,
title = {RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold},
author = {West, Clare E. and de Oliveira, Saulo H. P. and Deane, Charlotte M.},
abstractNote = {While template-free protein structure prediction protocols now produce good quality models for many targets, modelling failure remains common. For these methods to be useful it is important that users can both choose the best model from the hundreds to thousands of models that are commonly generated for a target, and determine whether this model is likely to be correct. We have developed Random Forest Quality Assessment (RFQAmodel), which assesses whether models produced by a protein structure prediction pipeline have the correct fold. RFQAmodel uses a combination of existing quality assessment scores with two predicted contact map alignment scores. These alignment scores are able to identify correct models for targets that are not otherwise captured. Our classifier was trained on a large set of protein domains that are structurally diverse and evenly balanced in terms of protein features known to have an effect on modelling success, and then tested on a second set of 244 protein domains with a similar spread of properties. When models for each target in this second set were ranked according to the RFQAmodel score, the highest-ranking model had a high-confidence RFQAmodel score for 67 modelling targets, of which 52 had the correct fold. At the other end of the scale RFQAmodel correctly predicted that for 59 targets the highest-ranked model was incorrect. In comparisons to other methods we found that RFQAmodel is better able to identify correct models for targets where only a few of the models are correct. We found that RFQAmodel achieved a similar performance on the model sets for CASP12 and CASP13 free-modelling targets. Finally, by iteratively generating models and running RFQAmodel until a model is produced that is predicted to be correct with high confidence, we demonstrate how such a protocol can be used to focus computational efforts on difficult modelling targets. RFQAmodel and the accompanying data can be downloaded from http://opig.stats.ox.ac.uk/resources.},
doi = {10.1371/journal.pone.0218149},
journal = {PLoS ONE},
number = 10,
volume = 14,
place = {United States},
year = {2019},
month = {10}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Fig 1 Fig 1: Number of targets out of the 244 targets in our Training set for which a correct model was produced and selected as the highest-ranked model according to 13 methods. Three SAINT2 scores (SAINT2, SAINT2_Contact and SAINT2_Raw), seven existing quality assessment scores (ProQ3D, ProQRosCenD, ProQRosFAD, Pcons, PcombC, ProQ2D andmore » PPV), and three predicted contact map alignment scores (EigenTHREADER, Map_align and map_length) are shown, as well as all methods combined (“Consensus”) and the total number of targets with a correct model (“Total Successes”), for three Beff bins and across all bins. The total number of targets in each Beff bin is indicated with a dashed line.« less

Save / Share:

Works referenced in this record:

Assessment of model accuracy estimations in CASP12
journal, September 2017

  • Kryshtafovych, Andriy; Monastyrskyy, Bohdan; Fidelis, Krzysztof
  • Proteins: Structure, Function, and Bioinformatics, Vol. 86
  • DOI: 10.1002/prot.25371

Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields
journal, January 2016

  • Wang, Sheng; Peng, Jian; Ma, Jianzhu
  • Scientific Reports, Vol. 6, Issue 1
  • DOI: 10.1038/srep18962

MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins
journal, November 2014


Sequential search leads to faster, more efficient fragment-based de novo protein structure prediction
journal, November 2017


Scoring function for automated assessment of protein structure template quality
journal, January 2004

  • Zhang, Yang; Skolnick, Jeffrey
  • Proteins: Structure, Function, and Bioinformatics, Vol. 57, Issue 4
  • DOI: 10.1002/prot.20264

MQAPsingle: A quasi single-model approach for estimation of the quality of individual protein structure models: MQAPsingle
journal, May 2016

  • Pawlowski, Marcin; Kozlowski, Lukasz; Kloczkowski, Andrzej
  • Proteins: Structure, Function, and Bioinformatics, Vol. 84, Issue 8
  • DOI: 10.1002/prot.24787

Comparing co-evolution methods and their application to template-free protein structure prediction
journal, September 2016


The Protein Data Bank
journal, May 2002

  • Berman, Helen M.; Battistuz, Tammy; Bhat, T. N.
  • Acta Crystallographica Section D Biological Crystallography, Vol. 58, Issue 6
  • DOI: 10.1107/s0907444902003451

Scoring function for automated assessment of protein structure template quality
journal, June 2007

  • Zhang, Yang; Skolnick, Jeffrey
  • Proteins: Structure, Function, and Bioinformatics, Vol. 68, Issue 4
  • DOI: 10.1002/prot.21643

Biopython: freely available Python tools for computational molecular biology and bioinformatics
journal, March 2009


Predicting accurate contacts in thousands of Pfam domain families using PconsC3
journal, May 2017


Asymmetric electron and hole transport in a high-mobility n -type conjugated polymer
journal, October 2012


ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models
journal, April 2017

  • Maghrabi, Ali H. A.; McGuffin, Liam J.
  • Nucleic Acids Research, Vol. 45, Issue W1
  • DOI: 10.1093/nar/gkx332

The Pfam protein families database
journal, November 2011

  • Punta, M.; Coggill, P. C.; Eberhardt, R. Y.
  • Nucleic Acids Research, Vol. 40, Issue D1
  • DOI: 10.1093/nar/gkr1065

EigenTHREADER: analogous protein fold recognition by efficient contact map threading
journal, April 2017


Large-scale structure prediction by improved contact predictions and model quality assessment
journal, July 2017


Protein structure determination using metagenome sequence data
journal, January 2017


A Selected Core Microbiome Drives the Early Stages of Three Popular Italian Cheese Manufactures
journal, February 2014


The Protein Data Bank
journal, January 2000


Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms
journal, September 2014


ProQ3D: improved model quality assessments using deep learning
journal, January 2017


SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures
journal, December 2013

  • Fox, Naomi K.; Brenner, Steven E.; Chandonia, John-Marc
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1240

Critical assessment of methods of protein structure prediction (CASP)-Round XII
journal, December 2017

  • Moult, John; Fidelis, Krzysztof; Kryshtafovych, Andriy
  • Proteins: Structure, Function, and Bioinformatics, Vol. 86
  • DOI: 10.1002/prot.25415

How significant is a protein structure similarity with TM-score = 0.5?
journal, February 2010


Protein secondary structure prediction using deep convolutional neural fields
preprint, January 2015


Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.