skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data

Journal Article · · Proteomics
 [1];  [1];  [2];  [1];  [3];  [3];  [3];  [3];  [2]
  1. Univ. of California, San Diego, CA (United States). Dept. of Electrical and Computer Engineering
  2. Univ. of California, San Diego, CA (United States). Dept. of Computer Science and Engineering
  3. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and non-sample-recruited mutations, which emphasize the strength of our approach.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States). Environmental Molecular Sciences Laboratory (EMSL)
Sponsoring Organization:
USDOE; National Institutes of Health (NIH); National Science Foundation (NSF)
Grant/Contract Number:
AC05-76RL01830; DGE-0504645; U24-CA-160019; P41GM103493
OSTI ID:
1166875
Report Number(s):
PNNL-SA-105664; 46206; 48135; 400412000
Journal Information:
Proteomics, Vol. 14, Issue 23-24; ISSN 1615-9853
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 48 works
Citation information provided by
Web of Science

References (29)

Integrated genomic analyses of ovarian carcinoma journal June 2011
Correlation between Protein and mRNA Abundance in Yeast journal March 1999
Correlation of mRNA and protein abundance in the developing maize leaf journal April 2014
A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics journal March 2011
CanProVar: a human cancer proteome variation database journal March 2010
Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data journal December 2011
TopHat: discovering splice junctions with RNA-Seq journal March 2009
TopHat-Fusion: an algorithm for discovery of novel fusion transcripts journal January 2011
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions journal January 2013
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data journal July 2010
A framework for variation discovery and genotyping using next-generation DNA sequencing data journal April 2011
Proteogenomic Database Construction Driven from Large Scale RNA-seq Data journal July 2013
Improving gene annotation using peptide mass spectrometry journal January 2007
Discovery and revision of Arabidopsis genes by proteogenomics journal December 2008
Proteogenomics to discover the full coding content of genomes: A computational perspective journal October 2010
An Automated Proteogenomic Method Uses Mass Spectrometry to Reveal Novel Genes in Zea mays journal October 2013
Novel peptide identification from tandem mass spectra using ESTs and sequence database compression journal January 2007
The Sequence Alignment/Map format and SAMtools journal June 2009
The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search journal September 2010
Ensembl 2013 journal November 2012
dbSNP: the NCBI database of genetic variation journal January 2001
De novo derivation of proteomes from transcriptomes for transcript and protein identification journal November 2012
Proteogenomic Analysis of Bacteria and Archaea: A 46 Organism Case Study journal November 2011
customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search journal September 2013
Ancient genomes reveal social and genetic structure of Late Neolithic Switzerland journal April 2020
SNHG7 is a lncRNA oncogene controlled by Insulin-like Growth Factor signaling through a negative feedback loop to tightly regulate proliferation journal May 2020
Integrated genomic analyses of ovarian carcinoma text January 2011
Comprehensive molecular portraits of human breast tumours text January 2012
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions text January 2013

Cited By (12)

Proteogenomics from a bioinformatics angle: A growing field: PROTEOGENOMICS FROM A BIOINFORMATICS ANGLE journal December 2015
Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology journal December 2018
Comprehensive analysis of human protein N-termini enables assessment of various protein forms journal July 2017
FusionPro, a Versatile Proteogenomic Tool for Identification of Novel Fusion Transcripts and Their Potential Translation Products in Cancer Cells journal June 2019
Onco-proteogenomics: Multi-omics level data integration for accurate phenotype prediction journal August 2017
Origins and clinical relevance of proteoforms in pediatric malignancies journal February 2019
High throughput discovery of protein variants using proteomics informed by transcriptomics journal April 2018
Proteogenomic annotation of the Chinese hamster reveals extensive novel translation events and endogenous retroviral elements journal November 2018
Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification journal December 2016
Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins journal October 2017
CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms journal January 2016
Proteogenomic analysis prioritises functional single nucleotide variants in cancer samples journal September 2017