DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data

Abstract

Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomicmore » data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and non-sample-recruited mutations, which emphasize the strength of our approach.« less

Authors:
 [1];  [1];  [2];  [1];  [3];  [3];  [3];  [3];  [2]
  1. Univ. of California, San Diego, CA (United States). Dept. of Electrical and Computer Engineering
  2. Univ. of California, San Diego, CA (United States). Dept. of Computer Science and Engineering
  3. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States). Environmental Molecular Sciences Laboratory (EMSL)
Sponsoring Org.:
USDOE; National Institutes of Health (NIH); National Science Foundation (NSF)
OSTI Identifier:
1166875
Report Number(s):
PNNL-SA-105664
Journal ID: ISSN 1615-9853; 46206; 48135; 400412000
Grant/Contract Number:  
AC05-76RL01830; DGE-0504645; U24-CA-160019; P41GM103493
Resource Type:
Accepted Manuscript
Journal Name:
Proteomics
Additional Journal Information:
Journal Volume: 14; Journal Issue: 23-24; Journal ID: ISSN 1615-9853
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 60 APPLIED LIFE SCIENCES; Proteogenomics; Ovarian cancer; Mutated peptide identification; Cancer; MS

Citation Formats

Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., and Bafna, Vineet. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data. United States: N. p., 2014. Web. doi:10.1002/pmic.201400206.
Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., & Bafna, Vineet. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data. United States. https://doi.org/10.1002/pmic.201400206
Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., and Bafna, Vineet. Mon . "Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data". United States. https://doi.org/10.1002/pmic.201400206. https://www.osti.gov/servlets/purl/1166875.
@article{osti_1166875,
title = {Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data},
author = {Woo, Sunghee and Cha, Seong Won and Na, Seungjin and Guest, Clark and Liu, Tao and Smith, Richard D. and Rodland, Karin D. and Payne, Samuel H. and Bafna, Vineet},
abstractNote = {Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and non-sample-recruited mutations, which emphasize the strength of our approach.},
doi = {10.1002/pmic.201400206},
journal = {Proteomics},
number = 23-24,
volume = 14,
place = {United States},
year = {Mon Nov 17 00:00:00 EST 2014},
month = {Mon Nov 17 00:00:00 EST 2014}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 48 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Integrated genomic analyses of ovarian carcinoma
journal, June 2011


Correlation between Protein and mRNA Abundance in Yeast
journal, March 1999

  • Gygi, Steven P.; Rochon, Yvan; Franza, B. Robert
  • Molecular and Cellular Biology, Vol. 19, Issue 3
  • DOI: 10.1128/MCB.19.3.1720

Correlation of mRNA and protein abundance in the developing maize leaf
journal, April 2014

  • Ponnala, Lalit; Wang, Yupeng; Sun, Qi
  • The Plant Journal, Vol. 78, Issue 3
  • DOI: 10.1111/tpj.12482

A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics
journal, March 2011

  • Li, Jing; Su, Zengliu; Ma, Ze-Qiang
  • Molecular & Cellular Proteomics, Vol. 10, Issue 5
  • DOI: 10.1074/mcp.M110.006536

CanProVar: a human cancer proteome variation database
journal, March 2010

  • Li, Jing; Duncan, Dexter T.; Zhang, Bing
  • Human Mutation, Vol. 31, Issue 3
  • DOI: 10.1002/humu.21176

Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data
journal, December 2011

  • Wang, Xiaojing; Slebos, Robbert J. C.; Wang, Dong
  • Journal of Proteome Research, Vol. 11, Issue 2
  • DOI: 10.1021/pr200766z

TopHat: discovering splice junctions with RNA-Seq
journal, March 2009


TopHat-Fusion: an algorithm for discovery of novel fusion transcripts
journal, January 2011


TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
journal, January 2013


The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010


A framework for variation discovery and genotyping using next-generation DNA sequencing data
journal, April 2011

  • DePristo, Mark A.; Banks, Eric; Poplin, Ryan
  • Nature Genetics, Vol. 43, Issue 5
  • DOI: 10.1038/ng.806

Proteogenomic Database Construction Driven from Large Scale RNA-seq Data
journal, July 2013

  • Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer
  • Journal of Proteome Research, Vol. 13, Issue 1
  • DOI: 10.1021/pr400294c

Improving gene annotation using peptide mass spectrometry
journal, January 2007


Discovery and revision of Arabidopsis genes by proteogenomics
journal, December 2008

  • Castellana, N. E.; Payne, S. H.; Shen, Z.
  • Proceedings of the National Academy of Sciences, Vol. 105, Issue 52
  • DOI: 10.1073/pnas.0811066106

Proteogenomics to discover the full coding content of genomes: A computational perspective
journal, October 2010


An Automated Proteogenomic Method Uses Mass Spectrometry to Reveal Novel Genes in Zea mays
journal, October 2013

  • Castellana, Natalie E.; Shen, Zhouxin; He, Yupeng
  • Molecular & Cellular Proteomics, Vol. 13, Issue 1
  • DOI: 10.1074/mcp.M113.031260

Novel peptide identification from tandem mass spectra using ESTs and sequence database compression
journal, January 2007


The Sequence Alignment/Map format and SAMtools
journal, June 2009


The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search
journal, September 2010

  • Kim, Sangtae; Mischerikow, Nikolai; Bandeira, Nuno
  • Molecular & Cellular Proteomics, Vol. 9, Issue 12
  • DOI: 10.1074/mcp.M110.003731

Ensembl 2013
journal, November 2012

  • Flicek, Paul; Ahmed, Ikhlak; Amode, M. Ridwan
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1236

dbSNP: the NCBI database of genetic variation
journal, January 2001


De novo derivation of proteomes from transcriptomes for transcript and protein identification
journal, November 2012

  • Evans, Vanessa C.; Barker, Gary; Heesom, Kate J.
  • Nature Methods, Vol. 9, Issue 12
  • DOI: 10.1038/nmeth.2227

Proteogenomic Analysis of Bacteria and Archaea: A 46 Organism Case Study
journal, November 2011


customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search
journal, September 2013


Ancient genomes reveal social and genetic structure of Late Neolithic Switzerland
journal, April 2020

  • Furtwängler, Anja; Rohrlach, A. B.; Lamnidis, Thiseas C.
  • Nature Communications, Vol. 11, Issue 1
  • DOI: 10.1038/s41467-020-15560-x

SNHG7 is a lncRNA oncogene controlled by Insulin-like Growth Factor signaling through a negative feedback loop to tightly regulate proliferation
journal, May 2020


Integrated genomic analyses of ovarian carcinoma
text, January 2011

  • Charles, Perou,
  • The University of North Carolina at Chapel Hill University Libraries
  • DOI: 10.17615/hvp3-wg08

Comprehensive molecular portraits of human breast tumours
text, January 2012

  • Charles, Perou,
  • The University of North Carolina at Chapel Hill University Libraries
  • DOI: 10.17615/hyeb-c392

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
text, January 2013


Works referencing / citing this record:

Proteogenomics from a bioinformatics angle: A growing field: PROTEOGENOMICS FROM A BIOINFORMATICS ANGLE
journal, December 2015

  • Menschaert, Gerben; Fenyö, David
  • Mass Spectrometry Reviews, Vol. 36, Issue 5
  • DOI: 10.1002/mas.21483

Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology
journal, December 2018


Comprehensive analysis of human protein N-termini enables assessment of various protein forms
journal, July 2017


FusionPro, a Versatile Proteogenomic Tool for Identification of Novel Fusion Transcripts and Their Potential Translation Products in Cancer Cells
journal, June 2019

  • Kim, Chae-Yeon; Na, Keun; Park, Saeram
  • Molecular & Cellular Proteomics, Vol. 18, Issue 8
  • DOI: 10.1074/mcp.ra119.001456

Onco-proteogenomics: Multi-omics level data integration for accurate phenotype prediction
journal, August 2017

  • Dimitrakopoulos, Lampros; Prassas, Ioannis; Diamandis, Eleftherios P.
  • Critical Reviews in Clinical Laboratory Sciences, Vol. 54, Issue 6
  • DOI: 10.1080/10408363.2017.1384446

Origins and clinical relevance of proteoforms in pediatric malignancies
journal, February 2019


High throughput discovery of protein variants using proteomics informed by transcriptomics
journal, April 2018

  • Saha, Shyamasree; Matthews, David A.; Bessant, Conrad
  • Nucleic Acids Research, Vol. 46, Issue 10
  • DOI: 10.1093/nar/gky295

Proteogenomic annotation of the Chinese hamster reveals extensive novel translation events and endogenous retroviral elements
journal, November 2018

  • Li, Shangzhong; Cha, Seong Won; Hefner, Kelly
  • Journal of Proteome Research
  • DOI: 10.1101/468181

Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification
journal, December 2016


Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins
journal, October 2017


Comprehensive analysis of human protein N-termini enables assessment of various protein forms
journal, July 2017


CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms
journal, January 2016

  • Krasnov, George S.; Dmitriev, Alexey A.; Melnikova, Nataliya V.
  • Nucleic Acids Research, Vol. 44, Issue 7
  • DOI: 10.1093/nar/gkv1478

Proteogenomic analysis prioritises functional single nucleotide variants in cancer samples
journal, September 2017


Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins
journal, October 2017