skip to main content

Title: Proteogenomic strategies for identification of aberrant cancer peptides using large-scale Next Generation Sequencing data

Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular sub-typing of cancers, and the discovery of novel biomarkers. The availability of genomics technologies (mainly wholegenome and exome sequencing, and transcript sampling via RNA-seq, collectively referred to as NGS) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome using only genomic approaches. Recently, combination of proteomic and genomic technologies are increasingly employed. However, the complexity and redundancy of NGS data remains a challenge for proteogenomics, and various trade-offs must be made to allow for the searches to take place. This paperprovides a discussion of two such trade-offs, relating to large database search, and FDR calculations, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any mass spectrometry sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database which contained 2,787,062 novel splice junctions, 38,464 deletions, 1105 insertions, and 182,302 substitutions. Proteomicmore » data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and non-sample-recruited mutations, which emphasize the strength of our approach.« less
 [1] ;  [1] ;  [2] ;  [1] ;  [3] ;  [3] ;  [3] ;  [3] ;  [2]
  1. Univ. of California, San Diego, CA (United States). Dept. of Electrical and Computer Engineering
  2. Univ. of California, San Diego, CA (United States). Dept. of Computer Science and Engineering
  3. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
Journal ID: ISSN 1615-9853; 46206; 48135; 400412000
Grant/Contract Number:
AC05-76RL01830; DGE-0504645; U24-CA-160019; P41GM103493
Accepted Manuscript
Journal Name:
Additional Journal Information:
Journal Volume: 14; Journal Issue: 23-24; Journal ID: ISSN 1615-9853
Research Org:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Environmental Molecular Sciences Laboratory (EMSL)
Sponsoring Org:
USDOE; National Institutes of Health (NIH); National Science Foundation (NSF)
Country of Publication:
United States
59 BASIC BIOLOGICAL SCIENCES; 60 APPLIED LIFE SCIENCES; Proteogenomics; Ovarian cancer; Mutated peptide identification; Cancer; MS