/
Expectation Maximization Meets Sampling in Motif Finding Expectation Maximization Meets Sampling in Motif Finding

Expectation Maximization Meets Sampling in Motif Finding - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
404 views
Uploaded On 2016-08-02

Expectation Maximization Meets Sampling in Motif Finding - PPT Presentation

Zhizhuo Zhang Outline Review of Mixture Model and EM algorithm Importance Sampling Resampling EM Extending EM Integrate Other Features Result Review Motif Finding Mixture modeling Given a dataset ID: 429507

motif sampling pwm data sampling motif data pwm model extending step original likelihood background good result function sites feature

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Expectation Maximization Meets Sampling ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Expectation Maximization Meets Sampling in Motif Finding

Zhizhuo

Zhang Slide2

Outline

Review of Mixture Model and EM algorithm

Importance Sampling

Re-sampling EM

Extending EM

Integrate Other Features

ResultSlide3

Review Motif Finding: Mixture modeling

Given a dataset

X

, a motif model

Ѳ

, and a background model θ0, the likelihood of observed X, is defined as :To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:

Motif Component

Background ComponentSlide4

Review Motif Finding: EM

E-step:

M-step:Slide5

Pros and Cons

Pros:

Pure Probabilistic Modeling

EM is a well known method

The complexity of each iteration is linear

Cons:In each iteration, it examines all the sites (most is background sites)EM is sensitive to its starting conditionThe length of motif is assumed givenSlide6

Sampling Idea (1)

Simple Example: 20

A

s and 10

B

s AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBLet’s define a sampling function Q(x), and Q(x)=1 when x is sampled:E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABBwe can recover the original data from “AA

BB”2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in originalSlide7

Sampling Idea (2)

Almost every sampling function can recover the statistics in the original, which is known as “

Importance sampling

We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites.

According the parameter complexity, motif model need more samples than background to achieve the same level of accuracy.Slide8

Re-sampling EM

Sampling function Q(.)

,

and sampled data X

Q

E-step: the same as original EMM-step:Slide9

Re-sampling EMSlide10

Re-sampling EM

How to find a good sampling function

Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.

Fortunately, a approximate PWM model already can do a good job in practice. Slide11

How to find a good approximating PWM?

Unknown length

Unknown distributionSlide12

Extending EM

Start from all over-represented 5-mers

Similarly, we find a motif model(PWM) contains the given 5-mer which maximizes the likelihood of the observed data.

We define a extending EM process which optimizes the flanking columns included in the final PWM.Slide13

Extending EM

Imagine we have a length-25 PWM

Ѳ

with 5-mer

q

“ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax). Po1

2……10

11

1213

141516

……2425

A

0.25

0.25

……

0.25

1

0

0

0

0

0.25

……

0.25

0.25

C

0.25

0.25

……

0.25

0

1

0

0

0

0.25

……

0.25

0.25

G

0.25

0.25

……

0.25

0

0

0

0

1

0.25

……

0.25

0.25

T

0.25

0.25

……

0.25

0

0

1

1

0

0.25

……

0.25

0.25Slide14

Extending EM

We use two indices to maintain the start and end of the real motif PWMSlide15

Extending EM

The M-step is the same as original EM, but we need to determine which column should be included.

The increase of log-likelihood by including column

j

Slide16

Consider other features in EM

Other features

Positional Bias

Strand Bias

Sequence Rank Bias

We integrate them into mixture modelNew likelihood ratioBoolean variable to determine whether include feature or not.Slide17

Consider other features in EM

If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:

The multinomial parameters

φ

also can be learned in the M-step:Slide18

All togetherSlide19

PWM Model

Position Prior

Model

Peak Rank Prior ModelSlide20

Simulation ResultSlide21

Simulation ResultSlide22

Real Data Result

163 ChIP-seq datasets

Compare 6 popular motif finders.

Half for training, half for testingSlide23

Real Data Result

De novo AP1 Model

De novo FOXA1 Model

De novo ER ModelSlide24

Conclusion

SEME can

perform EM on biased sampled data but estimate parameters

unbiasedly

vary PWM size in EM procedure by starting with a short

5-merautomatically learn and select other feature information during EM iterationsSlide25