/
A+ C+ A+ C+

A+ C+ - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
358 views
Uploaded On 2015-09-25

A+ C+ - PPT Presentation

G T A C G T A modeling Example CpG islands in DNA sequences Methylation amp Silencing One way cells differentiate is methylation Addition of CH 3 in Cnucleotides Silences genes in region ID: 139681

islands cpg model parameters cpg islands parameters model pseudocounts training transitions estimate akl probabilities times transition calculate guaranteed log

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A+ C+" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A+

C+

G+

T+

A-

C-

G-

T-

A modeling Example

CpG islands in DNA sequencesSlide2

Methylation & Silencing

One way cells differentiate is methylationAddition of CH3

in C-nucleotidesSilences genes in regionCG (denoted CpG) often mutates to TG, when methylatedIn each cell, one copy of X is silenced, methylation plays role

Methylation is inherited during cell divisionSlide3

Example: CpG Islands

CpG nucleotides in the genome are frequently methylated

(Write CpG not to confuse with CG base pair)

C

 methyl-C  T

Methylation often suppressed around genes, promoters

 CpG islandsSlide4

Example: CpG Islands

In CpG islands,

CG is more frequentOther pairs (AA, AG, AT…) have different frequencies

Question: Detect CpG islands computationallySlide5

A model of CpG Islands – (1) Architecture

A+

C+

G+

T+

A-

C-

G-

T-

CpG Island

Not CpG IslandSlide6

A model of CpG Islands – (2) Transitions

How do we estimate parameters of the model?

Emission probabilities: 1/0Transition probabilities within CpG islands

Established from known CpG islands

(Training Set)Transition probabilities within other regions

Established from known non-CpG islands (Training Set)

Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states

+

A

C

G

T

A

.180

.274

.426

.120

C

.171

.368

.274

.188

G

.161

.339

.375

.125

T

.079

.355

.384

.182

-

A

C

G

T

A

.300

.205

.285

.210

C

.233

.298

.078

.302

G

.248

.246

.298

.208

T

.177

.239

.292

.292

= 1

= 1

= 1

= 1

= 1

= 1

= 1

= 1Slide7

Log Likehoods—

Telling “CpG Island” from “Non-CpG Island”

A

C

G

T

A

-0.740

+0.419

+0.580

-0.803

C

-0.913

+0.302

+1.812

-0.685

G

-0.624

+0.461

+0.331

-0.730

T

-1.169

+0.573

+0.393

-0.679

Another way to see effects of transitions:

Log likelihoods

L(u, v) = log[ P(uv | + ) / P(uv | -) ]

Given a region x = x

1

…x

N

A quick-&-dirty way to decide whether entire x is CpG

P(x is CpG) > P(x is not CpG)

i

L(x

i

, x

i+1

) > 0Slide8

A model of CpG Islands – (2) Transitions

What about transitions between (+) and (-) states?They affect

Avg. length of CpG islandAvg. separation between two CpG islands

X

Y

1-p

1-q

p

q

Length distribution of region X:

P[l

X

= 1] = 1-p

P[l

X

= 2] = p(1-p)

P[l

X

= k] = p

k-1

(1-p)

E[l

X

] = 1/(1-p)

Geometric distribution, with mean 1/(1-p)Slide9

Applications of the model

Given a DNA region x,

The Viterbi algorithm predicts locations of CpG islandsGiven a nucleotide xi

, (say xi = A)

The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario

The Forward/Backward algorithms can calculate

P(xi is in CpG island) = P(

i = A+ | x)

Posterior Decoding can assign locally optimal predictions of CpG islands

^

i = argmaxk

P(i = k | x)

Advantage: ?Each nucleotide is more likely to be called correctly

Disadvantage: ?The overall parse will be “choppy”—CpG islands too short

Advantage/Disadvantage?Slide10

What if a new genome comes?

We just sequenced the porcupine genomeWe know CpG islands play the same role in this genome

However, we have no known CpG islands for porcupinesWe suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model? LEARNINGSlide11

Learning

Re-estimate the parameters of the model based on training dataSlide12

Two learning scenarios

Estimation when the “right answer” is known

Examples: GIVEN:

a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

Estimation when the “right answer” is unknownExamples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters  of the model to maximize P(x|

)Slide13

1. When the right answer is known

Given x = x1…xN

for which the true  = 1…

N is known,Define:

Akl

= # times kl transition occurs in  Ek(b) = # times state k in  emits b in x

We can show that the maximum likelihood parameters 

(maximize P(x|)) are:

Akl Ek(b)

akl = –––––

ek(b) = ––––––– 

i Aki 

c Ek(c)Slide14

1. When the right answer is known

Intuition: When we know the underlying states,

Best estimate is the normalized frequency of transitions & emissions that occur in the training dataDrawback:

Given little data, there may be overfitting: P(x|

) is maximized, but  is unreasonable 0 probabilities – BAD

Example: Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3

 = F, F, F, F, F, F, F, F, F, F Then:

aFF = 1; a

FL = 0 eF(1) = e

F(3) = .2; eF

(2) = .3; eF(4) = 0; eF(5) = e

F(6) = .1 Slide15

Pseudocounts

Solution for small training sets: Add pseudocounts

Akl = # times kl transition occurs in  + r

kl Ek(b) = # times state k in  emits b in x + r

k(b)rkl, r

k(b) are pseudocounts representing our prior beliefLarger pseudocounts  Strong priof belief

Small pseudocounts ( < 1): just to avoid 0 probabilities Slide16

Pseudocounts

Example: dishonest casino

We will observe player for one day, 600 rolls Reasonable pseudocounts:

r0F = r0L = r

F0 = rL0 = 1; r

FL = rLF = rFF

= rLL = 1; rF

(1) = rF(2) = … = rF(6) = 20

(strong belief fair is fair) rL(1) = r

L(2) = … = rL(6) = 5 (wait and see for loaded)

Above #s are arbitrary – assigning priors is an artSlide17

2. When the right answer is unknown

We don’t know the true Akl

, Ek(b)Idea:

We estimate our “best guess” on what Akl, E

k(b) areOr, we start with random / uniform values

We update the parameters of the model, based on our guessWe repeatSlide18

2. When the right answer is unknown

Starting with our best guess of a model M, parameters :

Given x = x1…xN for which the true

 = 1…N is unknown,

We can get to a provably more likely parameter set 

i.e.,  that increases the probability P(x | )

Principle: EXPECTATION MAXIMIZATION

Estimate Akl, Ek(b) in the training data

Update  according to Akl, Ek

(b)Repeat 1 & 2, until convergenceSlide19

Estimating new parameters

To estimate Akl:

(assume “| 

CURRENT”, in all formulas below)

At each position i of sequence x, find probability transition kl is used:

P(i = k, i+1

= l | x) = [1/P(x)]  P(i = k, 

i+1 = l, x1…xN

) = Q/P(x)where Q = P(x1

…xi, i = k, 

i+1 = l, xi+1…xN

) = = P(i+1 = l, xi+1

…xN | i = k) P(x

1…xi, i

= k) = = P(i+1 = l, x

i+1xi+2…xN | 

i = k) fk(i) =

= P(xi+2…xN | 

i+1 = l) P(xi+1 | i+1

= l) P(i+1 = l | i = k) f

k(i) = = bl(i+1) e

l(xi+1) akl

fk(i)

fk(i) akl e

l(xi+1) bl

(i+1)

So: P(i = k, i+1 = l | x, ) = ––––––––––––––––––

P(x | CURRENT

)Slide20

Estimating new parameters

So, Akl is the E[# times transition

kl, given current ] f

k(i) akl el(x

i+1) bl(i+1)

Akl = i P(

i = k, i+1 = l | x, ) = 

i ––––––––––––––––– P(x | )

Similarly,

Ek(b) = [1/P(x | )]

{i | xi = b} f

k(i) bk(i)

k

l

x

i+1

a

kl

e

l

(x

i+1

)

b

l

(i+1)

f

k

(i)

x

1

………x

i-1

x

i+2

………x

N

x

iSlide21

The Baum-Welch Algorithm

Initialization: Pick the best-guess for model parameters

(or arbitrary)Iteration:

ForwardBackwardCalculate

Akl, Ek(b), given

CURRENT

Calculate new model parameters NEW : a

kl, ek(b)

Calculate new log-likelihood P(x | NEW)

GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION

Until P(x | ) does not change muchSlide22

The Baum-Welch Algorithm

Time Complexity:

# iterations  O(K2N)

Guaranteed to increase the log likelihood P(x | )

Not guaranteed to find globally best parameters

Converges to local optimum, depending on initial conditions

Too many parameters / too large model: OvertrainingSlide23

Alternative: Viterbi Training

Initialization: Same

Iteration:Perform Viterbi, to find

*Calculate A

kl, Ek(b) according to *

+ pseudocountsCalculate the new parameters akl, ek(b)

Until convergenceNotes:

Not guaranteed to increase P(x | )Guaranteed to increase P(x |

, *)In general, worse performance than Baum-Welch

Related Contents


Next Show more