skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Improving the chances of successful protein structure determination with a random forest classifier

Journal Article · · Acta Crystallographica. Section D: Biological Crystallography
 [1]; ;  [1]
  1. Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307 (United States)

Using an extended set of protein features calculated separately for protein surface and interior, a new version of XtalPred based on a random forest classifier achieves a significant improvement in predicting the success of structure determination from the primary amino-acid sequence. Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007 ▶), Protein Sci.16, 2472–2482] was developed. XtalPred classifies proteins into five ‘crystallization classes’ based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.

OSTI ID:
22351312
Journal Information:
Acta Crystallographica. Section D: Biological Crystallography, Vol. 70, Issue Pt 3; Other Information: PMCID: PMC3949519; PMID: 24598732; PUBLISHER-ID: wd5222; OAI: oai:pubmedcentral.nih.gov:3949519; Copyright (c) International Union of Crystallography 2014; Country of input: International Atomic Energy Agency (IAEA); ISSN 0907-4449
Country of Publication:
Denmark
Language:
English