/
Protein and gene model inference based on statistical model Protein and gene model inference based on statistical model

Protein and gene model inference based on statistical model - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
562 views
Uploaded On 2016-06-15

Protein and gene model inference based on statistical model - PPT Presentation

k partite graphs Sarah Gester Ermir Qeli Christian H Ahrens and Peter Buhlmann Problem Description Given peptides and scoresprobabilities infer the set of proteins present in the sample ID: 363147

protein peptide proteins model peptide protein model proteins peptides gene scores set shared inference probabilities connected probability conditional mixture

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Protein and gene model inference based o..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Protein and gene model inference based on statistical modeling in k-partite graphs

Sarah

Gester

,

Ermir

Qeli

, Christian H. Ahrens, and Peter

BuhlmannSlide2

Problem DescriptionGiven peptides and scores/probabilities, infer the set of proteins present in the sample.

PERFGKLMQK

MLLTDFSSAWCR

FFRDESQINNR

TGYIPPPLJMGKR

Protein A

Protein B

Protein CSlide3

Previous ApproachesN-peptides rule

ProteinProphet

(

Nesvizhskii

et al. 2003. Anal Chem)Assumes peptide scores are correct.Nested mixture model (Li et al. 2010. Ann

Appl Statist)Rescores peptides while doing the protein inferenceDoes not allow shared peptidesPeptide scores are independentHierarchical statistical model (

Shen et al. 2008. Bioinformatics)Allows for shared peptidesAssume PSM scores for the same peptide are independentImpractical on normal datasets

MSBayesPro (Li et al. 2009. J Comput Biol)Uses peptide detectabilities to determine peptide priors.Slide4

Markovian Inference of Proteins and Gene Models (MIPGEM)

Inclusion of shared/degenerate peptides in the model.

Treats peptide scores/probabilities as random values

Model allows dependence of peptide scores.

Inference of gene modelsSlide5

Why scores as random values?

PERFGKLMQK

MLLTDFSSAWCR

FFRDESQINNR

TGYIPPPLJMGKR

Protein A

Protein B

Protein CSlide6

Building the bipartite graphSlide7

Shared peptidesSlide8

DefinitionsLet pi

be the score/

probabilitiy

of peptide

i. I is the set of all peptides.Let Z

j be the indicator variable for protein j. J is the set of all proteins.Slide9

Simple Probability RulesSlide10

Bayes Rule

Prior probability on the protein being present

Joint probability of seeing these peptide scores

Probability of observing these peptide scores given that the protein is presentSlide11

AssumptionsPrior probabilities of proteins are independent

Dependencies can be included with a little more effort.

This does not mean that proteins are independent.Slide12

AssumptionsConnected components are independentSlide13

AssumptionsPeptide scores are independent given their neighboring proteins.

Ne(i

) is the set of proteins connected to peptide

i

in the graph.Ir is the set of peptides belonging to the

rth connected componentR(Ir) is the set of proteins connected to peptides in IrSlide14

AssumptionsConditional peptide probabilities are modeled by a mixture model.

The specific mixture model they use is based on the peptide scores used (from

PeptideProphet

).Slide15

Bayes Rule

Prior probability on the protein being present

Joint probability of seeing these peptide scores

Probability of observing these peptide scores given that the protein is presentSlide16

Joint peptide score distribution

Assumption: peptides in different components are independent

I

r

is the set of peptides in component r

R(Ir) is the set of proteins connected to peptides in IrSlide17

Conditional ProbabilityMixture modelSlide18

Conditional ProbabilityMixture modelSlide19

f1(x) – pdf of P

(p

i

|

{zj})

medianSlide20

Choosing b1 and b2

Seek to maximize the log likelihood of observing the peptide scores.Slide21

Choosing b1 and b2

It turns out:Slide22

Conditional Protein ProbabilitiesSlide23

Conditional Protein ProbabilitiesSlide24

Conditional Protein Probabilities(NEC Correction)Slide25

Conditional Protein ProbabilitiesSlide26

Conditional Protein ProbabilitiesSlide27

Conditional Protein ProbabilitiesSlide28

Shared PeptidesSlide29

Shared PeptidesSlide30

Shared PeptidesIf the shared peptide has p

i

≥ medianSlide31

Shared PeptidesIf the shared peptide has p

i

< medianSlide32

Gene Model InferenceSlide33

Gene Model InferenceAssume a gene model, X, has only protein sequences which belong to the same connected component.

Peptide 1

Peptide 2

Peptide 3

Peptide 4

Protein A

Protein B

Gene XSlide34

Gene Model InferenceAssume a gene model, X, has only protein sequences which belong to the same connected component.

R(X) is the set of proteins with edges to X.

I

r(X

) is the set of peptides with edges to proteins with edges to XSlide35

Gene Model InferenceGene model, X, has proteins from different connected components of the peptide-protein graph.

Peptide 1

Peptide 2

Peptide 3

Peptide 4

Protein A

Protein B

Gene XSlide36

Gene Model InferenceGene model, X, has proteins from different connected components of the peptide-protein graph.

R

l

(X

) is the set of proteins with edges to X in component l.

Il(X) is the set of peptides with edges to proteins with edges to X in component l.Slide37

DatasetsMixture of 18 purified proteinsMixture of 49 proteins (Sigma49)

Drosophila

melanogaster

Saccharomyces

cerevisiae (~4200 proteins)Arabidopis thaliana (~4580 gene models)Slide38

Comparisons with other toolsSmall datasets with a known answer

Mix of 18 proteins

Sigma49Slide39

Sigma49

Comparisons with other tools

One hit wonders

Sigma49 no one hit wondersSlide40

Comparison with other tools Arabidopsis thaliana dataset has many proteins with high sequence similarity.Slide41

Splice isoformsSlide42

Conclusion +CriticismDeveloped a model for protein and gene model inference.

Comparisons with other tools do not justify complexity:

Value of a small FP rate at the expense of many FN is not shared for all applications.

Discard some useful information such as #spectra/peptide

Assumptions of parsimony from pruning may be too aggressive.