skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Complete fold annotation of the human proteome using a novel structural feature space

Abstract

Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Finally, our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.

Authors:
 [1];  [2];  [3]
  1. Univ. of Pennsylvania, Philadelphia, PA (United States). Genomics and Computational Biology Program
  2. Univ. of Pennsylvania, Philadelphia, PA (United States). Dept. of Computer Science
  3. Univ. of Pennsylvania, Philadelphia, PA (United States). Genomics and Computational Biology Program; Univ. of Pennsylvania, Philadelphia, PA (United States). Dept. of Biology
Publication Date:
Research Org.:
Krell Inst., Ames, IA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1366516
Grant/Contract Number:
FG02-97ER25308
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Scientific Reports
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 2045-2322
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES

Citation Formats

Middleton, Sarah A., Illuminati, Joseph, and Kim, Junhyong. Complete fold annotation of the human proteome using a novel structural feature space. United States: N. p., 2017. Web. doi:10.1038/srep46321.
Middleton, Sarah A., Illuminati, Joseph, & Kim, Junhyong. Complete fold annotation of the human proteome using a novel structural feature space. United States. doi:10.1038/srep46321.
Middleton, Sarah A., Illuminati, Joseph, and Kim, Junhyong. Thu . "Complete fold annotation of the human proteome using a novel structural feature space". United States. doi:10.1038/srep46321. https://www.osti.gov/servlets/purl/1366516.
@article{osti_1366516,
title = {Complete fold annotation of the human proteome using a novel structural feature space},
author = {Middleton, Sarah A. and Illuminati, Joseph and Kim, Junhyong},
abstractNote = {Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Finally, our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.},
doi = {10.1038/srep46321},
journal = {Scientific Reports},
number = ,
volume = 7,
place = {United States},
year = {Thu Apr 13 00:00:00 EDT 2017},
month = {Thu Apr 13 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • Human identification from biological material is largely dependent on the ability to characterize genetic polymorphisms in DNA. Unfortunately, DNA can degrade in the environment, sometimes below the level at which it can be amplified by PCR. Protein however is chemically more robust than DNA and can persist for longer periods. Protein also contains genetic variation in the form of single amino acid polymorphisms. These can be used to infer the status of non-synonymous single nucleotide polymorphism alleles. To demonstrate this, we used mass spectrometry-based shotgun proteomics to characterize hair shaft proteins in 66 European-American subjects. A total of 596 singlemore » nucleotide polymorphism alleles were correctly imputed in 32 loci from 22 genes of subjects’ DNA and directly validated using Sanger sequencing. Estimates of the probability of resulting individual non-synonymous single nucleotide polymorphism allelic profiles in the European population, using the product rule, resulted in a maximum power of discrimination of 1 in 12,500. Imputed non-synonymous single nucleotide polymorphism profiles from European–American subjects were considerably less frequent in the African population (maximum likelihood ratio = 11,000). The converse was true for hair shafts collected from an additional 10 subjects with African ancestry, where some profiles were more frequent in the African population. Genetically variant peptides were also identified in hair shaft datasets from six archaeological skeletal remains (up to 260 years old). Furthermore, this study demonstrates that quantifiable measures of identity discrimination and biogeographic background can be obtained from detecting genetically variant peptides in hair shaft protein, including hair from bioarchaeological contexts.« less
  • Human identification from biological material is largely dependent on the ability to characterize genetic polymorphisms in DNA. Unfortunately, DNA can degrade in the environment, sometimes below the level at which it can be amplified by PCR. Protein however is chemically more robust than DNA and can persist for longer periods. Protein also contains genetic variation in the form of single amino acid polymorphisms. These can be used to infer the status of non-synonymous single nucleotide polymorphism alleles. To demonstrate this, we used mass spectrometry-based shotgun proteomics to characterize hair shaft proteins in 66 European-American subjects. A total of 596 singlemore » nucleotide polymorphism alleles were correctly imputed in 32 loci from 22 genes of subjects’ DNA and directly validated using Sanger sequencing. Estimates of the probability of resulting individual non-synonymous single nucleotide polymorphism allelic profiles in the European population, using the product rule, resulted in a maximum power of discrimination of 1 in 12,500. Imputed non-synonymous single nucleotide polymorphism profiles from European–American subjects were considerably less frequent in the African population (maximum likelihood ratio = 11,000). The converse was true for hair shafts collected from an additional 10 subjects with African ancestry, where some profiles were more frequent in the African population. Genetically variant peptides were also identified in hair shaft datasets from six archaeological skeletal remains (up to 260 years old). Furthermore, this study demonstrates that quantifiable measures of identity discrimination and biogeographic background can be obtained from detecting genetically variant peptides in hair shaft protein, including hair from bioarchaeological contexts.« less
  • Automated multidimensional capillary liquid chromatography-tandem mass spectrometry (LC-MS/MS) has been increasingly applied in various large scale proteome profiling efforts. However, comprehensive global proteome analysis remains technically challenging due to issues associated with sample complexity and dynamic range of protein abundances, which is particularly apparent in mammalian biological systems. We report here the application of a high efficiency cysteinyl-peptide enrichment (CPE) approach to the global proteome analysis of human mammary epithelial cells (HMECs) which significantly improved both sequence coverage of protein identifications and the overall proteome coverage. The cysteinyl-peptides were specifically enriched by using a thiol-specific covalent resin, fractionated by strongmore » cation exchange chromatography, and subsequently analyzed by reversed-phase capillary LC-MS/MS. An HMEC tryptic digest without CPE was also fractionated and analyzed under the same conditions for comparison. The combined analyses of HMEC tryptic digests with and without CPE resulted in a total of 14,416 confidently identified peptides covering 4,294 different proteins with an estimated 10% gene coverage of the human geome. By using the high efficiency CPE, an additional 1,096 relatively low abundance proteins were identified, resulting in 34.3% increase in proteome coverage; 1,390 proteomes were observed with increased sequence coverage. Comparative protein distribution analyses revealed that the CPE method is not biased by protein molecular weight, pI, gene location, cellular location, or biological functions. These results demonstrate that the use of the CPE approach provides improved efficiency in comprehensive proteome-wide analyses of highly complex mammalian biological systems.« less