k partite graphs Sarah Gester Ermir Qeli Christian H Ahrens and Peter Buhlmann Problem Description Given peptides and scoresprobabilities infer the set of proteins present in the sample ID: 363147
Download Presentation The PPT/PDF document "Protein and gene model inference based o..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Protein and gene model inference based on statistical modeling in k-partite graphs
Sarah
Gester
,
Ermir
Qeli
, Christian H. Ahrens, and Peter
BuhlmannSlide2
Problem DescriptionGiven peptides and scores/probabilities, infer the set of proteins present in the sample.
PERFGKLMQK
MLLTDFSSAWCR
FFRDESQINNR
TGYIPPPLJMGKR
Protein A
Protein B
Protein CSlide3
Previous ApproachesN-peptides rule
ProteinProphet
(
Nesvizhskii
et al. 2003. Anal Chem)Assumes peptide scores are correct.Nested mixture model (Li et al. 2010. Ann
Appl Statist)Rescores peptides while doing the protein inferenceDoes not allow shared peptidesPeptide scores are independentHierarchical statistical model (
Shen et al. 2008. Bioinformatics)Allows for shared peptidesAssume PSM scores for the same peptide are independentImpractical on normal datasets
MSBayesPro (Li et al. 2009. J Comput Biol)Uses peptide detectabilities to determine peptide priors.Slide4
Markovian Inference of Proteins and Gene Models (MIPGEM)
Inclusion of shared/degenerate peptides in the model.
Treats peptide scores/probabilities as random values
Model allows dependence of peptide scores.
Inference of gene modelsSlide5
Why scores as random values?
PERFGKLMQK
MLLTDFSSAWCR
FFRDESQINNR
TGYIPPPLJMGKR
Protein A
Protein B
Protein CSlide6
Building the bipartite graphSlide7
Shared peptidesSlide8
DefinitionsLet pi
be the score/
probabilitiy
of peptide
i. I is the set of all peptides.Let Z
j be the indicator variable for protein j. J is the set of all proteins.Slide9
Simple Probability RulesSlide10
Bayes Rule
Prior probability on the protein being present
Joint probability of seeing these peptide scores
Probability of observing these peptide scores given that the protein is presentSlide11
AssumptionsPrior probabilities of proteins are independent
Dependencies can be included with a little more effort.
This does not mean that proteins are independent.Slide12
AssumptionsConnected components are independentSlide13
AssumptionsPeptide scores are independent given their neighboring proteins.
Ne(i
) is the set of proteins connected to peptide
i
in the graph.Ir is the set of peptides belonging to the
rth connected componentR(Ir) is the set of proteins connected to peptides in IrSlide14
AssumptionsConditional peptide probabilities are modeled by a mixture model.
The specific mixture model they use is based on the peptide scores used (from
PeptideProphet
).Slide15
Bayes Rule
Prior probability on the protein being present
Joint probability of seeing these peptide scores
Probability of observing these peptide scores given that the protein is presentSlide16
Joint peptide score distribution
Assumption: peptides in different components are independent
I
r
is the set of peptides in component r
R(Ir) is the set of proteins connected to peptides in IrSlide17
Conditional ProbabilityMixture modelSlide18
Conditional ProbabilityMixture modelSlide19
f1(x) – pdf of P
(p
i
|
{zj})
medianSlide20
Choosing b1 and b2
Seek to maximize the log likelihood of observing the peptide scores.Slide21
Choosing b1 and b2
It turns out:Slide22
Conditional Protein ProbabilitiesSlide23
Conditional Protein ProbabilitiesSlide24
Conditional Protein Probabilities(NEC Correction)Slide25
Conditional Protein ProbabilitiesSlide26
Conditional Protein ProbabilitiesSlide27
Conditional Protein ProbabilitiesSlide28
Shared PeptidesSlide29
Shared PeptidesSlide30
Shared PeptidesIf the shared peptide has p
i
≥ medianSlide31
Shared PeptidesIf the shared peptide has p
i
< medianSlide32
Gene Model InferenceSlide33
Gene Model InferenceAssume a gene model, X, has only protein sequences which belong to the same connected component.
Peptide 1
Peptide 2
Peptide 3
Peptide 4
Protein A
Protein B
Gene XSlide34
Gene Model InferenceAssume a gene model, X, has only protein sequences which belong to the same connected component.
R(X) is the set of proteins with edges to X.
I
r(X
) is the set of peptides with edges to proteins with edges to XSlide35
Gene Model InferenceGene model, X, has proteins from different connected components of the peptide-protein graph.
Peptide 1
Peptide 2
Peptide 3
Peptide 4
Protein A
Protein B
Gene XSlide36
Gene Model InferenceGene model, X, has proteins from different connected components of the peptide-protein graph.
R
l
(X
) is the set of proteins with edges to X in component l.
Il(X) is the set of peptides with edges to proteins with edges to X in component l.Slide37
DatasetsMixture of 18 purified proteinsMixture of 49 proteins (Sigma49)
Drosophila
melanogaster
Saccharomyces
cerevisiae (~4200 proteins)Arabidopis thaliana (~4580 gene models)Slide38
Comparisons with other toolsSmall datasets with a known answer
Mix of 18 proteins
Sigma49Slide39
Sigma49
Comparisons with other tools
One hit wonders
Sigma49 no one hit wondersSlide40
Comparison with other tools Arabidopsis thaliana dataset has many proteins with high sequence similarity.Slide41
Splice isoformsSlide42
Conclusion +CriticismDeveloped a model for protein and gene model inference.
Comparisons with other tools do not justify complexity:
Value of a small FP rate at the expense of many FN is not shared for all applications.
Discard some useful information such as #spectra/peptide
Assumptions of parsimony from pruning may be too aggressive.