Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data
Abstract
Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomicmore »
- Authors:
-
- Univ. of California, San Diego, CA (United States). Dept. of Electrical and Computer Engineering
- Univ. of California, San Diego, CA (United States). Dept. of Computer Science and Engineering
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Publication Date:
- Research Org.:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States). Environmental Molecular Sciences Laboratory (EMSL)
- Sponsoring Org.:
- USDOE; National Institutes of Health (NIH); National Science Foundation (NSF)
- OSTI Identifier:
- 1166875
- Report Number(s):
- PNNL-SA-105664
Journal ID: ISSN 1615-9853; 46206; 48135; 400412000
- Grant/Contract Number:
- AC05-76RL01830; DGE-0504645; U24-CA-160019; P41GM103493
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Proteomics
- Additional Journal Information:
- Journal Volume: 14; Journal Issue: 23-24; Journal ID: ISSN 1615-9853
- Publisher:
- Wiley
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; 60 APPLIED LIFE SCIENCES; Proteogenomics; Ovarian cancer; Mutated peptide identification; Cancer; MS
Citation Formats
Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., and Bafna, Vineet. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data. United States: N. p., 2014.
Web. doi:10.1002/pmic.201400206.
Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., & Bafna, Vineet. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data. United States. https://doi.org/10.1002/pmic.201400206
Woo, Sunghee, Cha, Seong Won, Na, Seungjin, Guest, Clark, Liu, Tao, Smith, Richard D., Rodland, Karin D., Payne, Samuel H., and Bafna, Vineet. Mon .
"Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data". United States. https://doi.org/10.1002/pmic.201400206. https://www.osti.gov/servlets/purl/1166875.
@article{osti_1166875,
title = {Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data},
author = {Woo, Sunghee and Cha, Seong Won and Na, Seungjin and Guest, Clark and Liu, Tao and Smith, Richard D. and Rodland, Karin D. and Payne, Samuel H. and Bafna, Vineet},
abstractNote = {Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and non-sample-recruited mutations, which emphasize the strength of our approach.},
doi = {10.1002/pmic.201400206},
journal = {Proteomics},
number = 23-24,
volume = 14,
place = {United States},
year = {Mon Nov 17 00:00:00 EST 2014},
month = {Mon Nov 17 00:00:00 EST 2014}
}
Web of Science
Works referenced in this record:
Integrated genomic analyses of ovarian carcinoma
journal, June 2011
- Network, Atlas Research
- Nature, Vol. 474, Issue 7353, p. 609-615
Correlation between Protein and mRNA Abundance in Yeast
journal, March 1999
- Gygi, Steven P.; Rochon, Yvan; Franza, B. Robert
- Molecular and Cellular Biology, Vol. 19, Issue 3
Correlation of mRNA and protein abundance in the developing maize leaf
journal, April 2014
- Ponnala, Lalit; Wang, Yupeng; Sun, Qi
- The Plant Journal, Vol. 78, Issue 3
A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics
journal, March 2011
- Li, Jing; Su, Zengliu; Ma, Ze-Qiang
- Molecular & Cellular Proteomics, Vol. 10, Issue 5
CanProVar: a human cancer proteome variation database
journal, March 2010
- Li, Jing; Duncan, Dexter T.; Zhang, Bing
- Human Mutation, Vol. 31, Issue 3
Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data
journal, December 2011
- Wang, Xiaojing; Slebos, Robbert J. C.; Wang, Dong
- Journal of Proteome Research, Vol. 11, Issue 2
TopHat: discovering splice junctions with RNA-Seq
journal, March 2009
- Trapnell, Cole; Pachter, Lior; Salzberg, Steven L.
- Bioinformatics, Vol. 25, Issue 9
TopHat-Fusion: an algorithm for discovery of novel fusion transcripts
journal, January 2011
- Kim, Daehwan; Salzberg, Steven L.
- Genome Biology, Vol. 12, Issue 8
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
journal, January 2013
- Kim, Daehwan; Pertea, Geo; Trapnell, Cole
- Genome Biology, Vol. 14, Issue 4
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
journal, July 2010
- McKenna, A.; Hanna, M.; Banks, E.
- Genome Research, Vol. 20, Issue 9
A framework for variation discovery and genotyping using next-generation DNA sequencing data
journal, April 2011
- DePristo, Mark A.; Banks, Eric; Poplin, Ryan
- Nature Genetics, Vol. 43, Issue 5
Proteogenomic Database Construction Driven from Large Scale RNA-seq Data
journal, July 2013
- Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer
- Journal of Proteome Research, Vol. 13, Issue 1
Improving gene annotation using peptide mass spectrometry
journal, January 2007
- Tanner, S.; Shen, Z.; Ng, J.
- Genome Research, Vol. 17, Issue 2
Discovery and revision of Arabidopsis genes by proteogenomics
journal, December 2008
- Castellana, N. E.; Payne, S. H.; Shen, Z.
- Proceedings of the National Academy of Sciences, Vol. 105, Issue 52
Proteogenomics to discover the full coding content of genomes: A computational perspective
journal, October 2010
- Castellana, Natalie; Bafna, Vineet
- Journal of Proteomics, Vol. 73, Issue 11
An Automated Proteogenomic Method Uses Mass Spectrometry to Reveal Novel Genes in Zea mays
journal, October 2013
- Castellana, Natalie E.; Shen, Zhouxin; He, Yupeng
- Molecular & Cellular Proteomics, Vol. 13, Issue 1
Novel peptide identification from tandem mass spectra using ESTs and sequence database compression
journal, January 2007
- Edwards, Nathan J.
- Molecular Systems Biology, Vol. 3, Issue 1
The Sequence Alignment/Map format and SAMtools
journal, June 2009
- Li, H.; Handsaker, B.; Wysoker, A.
- Bioinformatics, Vol. 25, Issue 16
The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search
journal, September 2010
- Kim, Sangtae; Mischerikow, Nikolai; Bandeira, Nuno
- Molecular & Cellular Proteomics, Vol. 9, Issue 12
Ensembl 2013
journal, November 2012
- Flicek, Paul; Ahmed, Ikhlak; Amode, M. Ridwan
- Nucleic Acids Research, Vol. 41, Issue D1
dbSNP: the NCBI database of genetic variation
journal, January 2001
- Sherry, S. T.
- Nucleic Acids Research, Vol. 29, Issue 1
De novo derivation of proteomes from transcriptomes for transcript and protein identification
journal, November 2012
- Evans, Vanessa C.; Barker, Gary; Heesom, Kate J.
- Nature Methods, Vol. 9, Issue 12
Proteogenomic Analysis of Bacteria and Archaea: A 46 Organism Case Study
journal, November 2011
- Venter, Eli; Smith, Richard D.; Payne, Samuel H.
- PLoS ONE, Vol. 6, Issue 11
customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search
journal, September 2013
- Wang, Xiaojing; Zhang, Bing
- Bioinformatics, Vol. 29, Issue 24
Ancient genomes reveal social and genetic structure of Late Neolithic Switzerland
journal, April 2020
- Furtwängler, Anja; Rohrlach, A. B.; Lamnidis, Thiseas C.
- Nature Communications, Vol. 11, Issue 1
SNHG7 is a lncRNA oncogene controlled by Insulin-like Growth Factor signaling through a negative feedback loop to tightly regulate proliferation
journal, May 2020
- Boone, David N.; Warburton, Andrew; Som, Sreeroopa
- Scientific Reports, Vol. 10, Issue 1
Integrated genomic analyses of ovarian carcinoma
text, January 2011
- Charles, Perou,
- The University of North Carolina at Chapel Hill University Libraries
Comprehensive molecular portraits of human breast tumours
text, January 2012
- Charles, Perou,
- The University of North Carolina at Chapel Hill University Libraries
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
text, January 2013
- Kim, Daehwan; Pertea, Geo; Trapnell, Cole
- Springer Nature
Works referencing / citing this record:
Proteogenomics from a bioinformatics angle: A growing field: PROTEOGENOMICS FROM A BIOINFORMATICS ANGLE
journal, December 2015
- Menschaert, Gerben; Fenyö, David
- Mass Spectrometry Reviews, Vol. 36, Issue 5
Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology
journal, December 2018
- Low, Teck Yew; Mohtar, M. Aiman; Ang, Mia Yang
- PROTEOMICS, Vol. 19, Issue 10
Comprehensive analysis of human protein N-termini enables assessment of various protein forms
journal, July 2017
- Yeom, Jeonghun; Ju, Shinyeong; Choi, YunJin
- Scientific Reports, Vol. 7, Issue 1
FusionPro, a Versatile Proteogenomic Tool for Identification of Novel Fusion Transcripts and Their Potential Translation Products in Cancer Cells
journal, June 2019
- Kim, Chae-Yeon; Na, Keun; Park, Saeram
- Molecular & Cellular Proteomics, Vol. 18, Issue 8
Onco-proteogenomics: Multi-omics level data integration for accurate phenotype prediction
journal, August 2017
- Dimitrakopoulos, Lampros; Prassas, Ioannis; Diamandis, Eleftherios P.
- Critical Reviews in Clinical Laboratory Sciences, Vol. 54, Issue 6
Origins and clinical relevance of proteoforms in pediatric malignancies
journal, February 2019
- Lorentzian, Amanda; Uzozie, Anuli; Lange, Philipp F.
- Expert Review of Proteomics, Vol. 16, Issue 3
High throughput discovery of protein variants using proteomics informed by transcriptomics
journal, April 2018
- Saha, Shyamasree; Matthews, David A.; Bessant, Conrad
- Nucleic Acids Research, Vol. 46, Issue 10
Proteogenomic annotation of the Chinese hamster reveals extensive novel translation events and endogenous retroviral elements
journal, November 2018
- Li, Shangzhong; Cha, Seong Won; Hefner, Kelly
- Journal of Proteome Research
Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification
journal, December 2016
- Li, Honglan; Joh, Yoon Sung; Kim, Hyunwoo
- BMC Genomics, Vol. 17, Issue S13
Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins
journal, October 2017
- Samandi, Sondos; Roy, Annie V.; Delcourt, Vivian
- eLife, Vol. 6
Comprehensive analysis of human protein N-termini enables assessment of various protein forms
journal, July 2017
- Yeom, Jeonghun; Ju, Shinyeong; Choi, YunJin
- Scientific Reports, Vol. 7, Issue 1
CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms
journal, January 2016
- Krasnov, George S.; Dmitriev, Alexey A.; Melnikova, Nataliya V.
- Nucleic Acids Research, Vol. 44, Issue 7
Proteogenomic analysis prioritises functional single nucleotide variants in cancer samples
journal, September 2017
- Ma, Shiyong; Menon, Ranjeeta; Poulos, Rebecca C.
- Oncotarget, Vol. 8, Issue 56
Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins
journal, October 2017
- Samandi, Sondos; Roy, Annie V.; Delcourt, Vivian
- eLife, Vol. 6