Zhizhuo Zhang Outline Review of Mixture Model and EM algorithm Importance Sampling Resampling EM Extending EM Integrate Other Features Result Review Motif Finding Mixture modeling Given a dataset ID: 429507
Download Presentation The PPT/PDF document "Expectation Maximization Meets Sampling ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Expectation Maximization Meets Sampling in Motif Finding
Zhizhuo
Zhang Slide2
Outline
Review of Mixture Model and EM algorithm
Importance Sampling
Re-sampling EM
Extending EM
Integrate Other Features
ResultSlide3
Review Motif Finding: Mixture modeling
Given a dataset
X
, a motif model
Ѳ
, and a background model θ0, the likelihood of observed X, is defined as :To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:
Motif Component
Background ComponentSlide4
Review Motif Finding: EM
E-step:
M-step:Slide5
Pros and Cons
Pros:
Pure Probabilistic Modeling
EM is a well known method
The complexity of each iteration is linear
Cons:In each iteration, it examines all the sites (most is background sites)EM is sensitive to its starting conditionThe length of motif is assumed givenSlide6
Sampling Idea (1)
Simple Example: 20
A
s and 10
B
s AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBLet’s define a sampling function Q(x), and Q(x)=1 when x is sampled:E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABBwe can recover the original data from “AA
BB”2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in originalSlide7
Sampling Idea (2)
Almost every sampling function can recover the statistics in the original, which is known as “
Importance sampling
”
We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites.
According the parameter complexity, motif model need more samples than background to achieve the same level of accuracy.Slide8
Re-sampling EM
Sampling function Q(.)
,
and sampled data X
Q
E-step: the same as original EMM-step:Slide9
Re-sampling EMSlide10
Re-sampling EM
How to find a good sampling function
Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.
Fortunately, a approximate PWM model already can do a good job in practice. Slide11
How to find a good approximating PWM?
Unknown length
Unknown distributionSlide12
Extending EM
Start from all over-represented 5-mers
Similarly, we find a motif model(PWM) contains the given 5-mer which maximizes the likelihood of the observed data.
We define a extending EM process which optimizes the flanking columns included in the final PWM.Slide13
Extending EM
Imagine we have a length-25 PWM
Ѳ
with 5-mer
q
“ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax). Po1
2……10
11
1213
141516
……2425
A
0.25
0.25
……
0.25
1
0
0
0
0
0.25
……
0.25
0.25
C
0.25
0.25
……
0.25
0
1
0
0
0
0.25
……
0.25
0.25
G
0.25
0.25
……
0.25
0
0
0
0
1
0.25
……
0.25
0.25
T
0.25
0.25
……
0.25
0
0
1
1
0
0.25
……
0.25
0.25Slide14
Extending EM
We use two indices to maintain the start and end of the real motif PWMSlide15
Extending EM
The M-step is the same as original EM, but we need to determine which column should be included.
The increase of log-likelihood by including column
j
Slide16
Consider other features in EM
Other features
Positional Bias
Strand Bias
Sequence Rank Bias
We integrate them into mixture modelNew likelihood ratioBoolean variable to determine whether include feature or not.Slide17
Consider other features in EM
If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:
The multinomial parameters
φ
also can be learned in the M-step:Slide18
All togetherSlide19
PWM Model
Position Prior
Model
Peak Rank Prior ModelSlide20
Simulation ResultSlide21
Simulation ResultSlide22
Real Data Result
163 ChIP-seq datasets
Compare 6 popular motif finders.
Half for training, half for testingSlide23
Real Data Result
De novo AP1 Model
De novo FOXA1 Model
De novo ER ModelSlide24
Conclusion
SEME can
perform EM on biased sampled data but estimate parameters
unbiasedly
vary PWM size in EM procedure by starting with a short
5-merautomatically learn and select other feature information during EM iterationsSlide25