skip to main content

SciTech ConnectSciTech Connect

Title: Improving the chances of successful protein structure determination with a random forest classifier

Using an extended set of protein features calculated separately for protein surface and interior, a new version of XtalPred based on a random forest classifier achieves a significant improvement in predicting the success of structure determination from the primary amino-acid sequence. Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007 ▶), Protein Sci.16, 2472–2482] was developed. XtalPred classifies proteins into five ‘crystallization classes’ based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success overmore » the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.« less
Authors:
 [1] ;  [2] ; ;  [1] ;  [2] ;  [2]
  1. Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307 (United States)
  2. (United States)
Publication Date:
OSTI Identifier:
22351312
Resource Type:
Journal Article
Resource Relation:
Journal Name: Acta Crystallographica. Section D: Biological Crystallography; Journal Volume: 70; Journal Issue: Pt 3; Other Information: PMCID: PMC3949519; PMID: 24598732; PUBLISHER-ID: wd5222; OAI: oai:pubmedcentral.nih.gov:3949519; Copyright (c) International Union of Crystallography 2014; Country of input: International Atomic Energy Agency (IAEA)
Country of Publication:
Denmark
Language:
English
Subject:
75 CONDENSED MATTER PHYSICS, SUPERCONDUCTIVITY AND SUPERFLUIDITY; ALGORITHMS; CRYSTALLIZATION; CRYSTALS; DIFFRACTION; ENTROPY; FORECASTING; POTENTIALS; PROTEIN STRUCTURE; PROTEINS; SURFACES; TESTING