The dbPepVar is a new proteogenomics database which combines genetic variation information from dbSNP with protein sequences from NCBI's RefSeq. We then perform a pan-cancer analysis (Ovarian, Colorectal, Breast and Prostate) using public mass spectrometry datasets to identify genetic variations and genes present in the analyzed samples. As results, were identified 2,661 variant peptides in breast cancer (BrCa), 2,411 in colon-rectal cancer (CrCa), 3,726 in ovarian cancer (OvCa), and 2,543 in prostate cancer (PrCa).
Compared to other approaches, our database contains a greater diversity of variants, including missense, nonsense mutations, loss of termination codon, insertions, deletions (of any size), frameshifts and mutations that alter the start translation. Besides, for each protein, only the variant tryptic peptides derived from enzymatic cleavage (i.e., trypsin) are inserted, following the criteria of size, allelic frequency and affected regions of the protein. In our approach, Mass spectrometry (MS) data is submitted to the dbPepVar variant and reference base separately. The outputs are compared and filtered by the scores for each base. Using public MS data from four types of cancer, we mostly identified cancer-specific SNPs, but shared mutations were also present in a lower amount.
The dbPepVar fasta file construction process:
A) Initially, the reference protein is mutated according to dbSNP information. The mutated peptides are then located on the generated protein.
B) A list containing the mutated peptides for each protein present in RefSeq is generated.
C) Final fasta file is generated by concatenating the mutated peptides of each protein, generating a new theoretical sequence.
The dbPepVar provides a log file containing information about mutated peptides. The header fields are the protein identifier (RefSeq), the SNP identifier, and the position of the peptide in the reference protein. A tab delimits the fields. Each entry has the sequence of reference and the mutated peptide. Each type of mutation is in separate files, and the missense and nonsense mutations are available in the Minor Allele Frequency (MAF) files.