/
Linear-Space Alignment Linear-Space Alignment

Linear-Space Alignment - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
399 views
Uploaded On 2015-10-16

Linear-Space Alignment - PPT Presentation

Linearspace alignment Using 2 columns of space we can compute for k 1M FM2 k F r M2 N k PLUS the backpointers x 1 x M2 y 1 x M y N x 1 x M21 ID: 162702

alignment space sequence fair space alignment fair sequence die casino prob http linear state patterns loaded blast model time

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Linear-Space Alignment" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Linear-Space AlignmentSlide2

Linear-space alignment

Using 2 columns of space, we can compute

for k = 1…M, F(M/2, k), F

r

(M/2, N – k)

PLUS the backpointers

x

1

x

M/2

y

1

x

M

y

N

x

1

x

M/2+1

x

M

y

1

y

N

…Slide3

Linear-space alignmentNow, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)Also, we can trace the path exiting column M/2 from k*

k

*

k

*

+1

0 1 …… M/2

M/2+1 …… M M+1Slide4

Linear-space alignmentIterate this procedure to the left and right!N-k*M/2

M/2k*Slide5

Linear-space alignmentHirschberg’s Linear-space algorithm:MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)Let h = (l’-l)/2Find (in Time O((l’ – l)  (r’ – r)), Space O(r’ – r)) the optimal path, Lh, entering column h – 1, exiting column h Let k1 = pos’n at column h – 2 where Lh enters k2 = pos’n at column h + 1 where Lh exits3. MEMALIGN(l, h – 2, r, k1)4. Output LhMEMALIGN(h + 1, l’, k2, r’)Top level call: MEMALIGN(1, M, 1, N)Slide6

Linear-space alignmentTime, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column, For box of size M  N, Space: 2N Time: cMN, for some constant cThen, left, right calls cost c( M/2  k* + M/2  (N – k*) ) = cMN/2All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN) Total Space: O(N) for computation, O(N + M) to store the optimal alignmentSlide7

Heuristic Local AlignerersThe basic indexing & extension techniqueIndexing: techniques to improve sensitivity Pairs of Words, PatternsSystems for local alignmentSlide8

Indexing-based local alignmentDictionary: All words of length k (~10) Alignment initiated between words of alignment score  T (typically T = k)Alignment: Ungapped extensions until score below statistical thresholdOutput: All local alignments with score > statistical threshold……

……queryDB

query

scanSlide9

Indexing-based local alignment—Extensions

A C G A A G T A A G G T C C A G T

C T G A T C C T G G A T T G C G A

Gapped extensions until threshold

Extensions with gaps until score < C below best score so far

Output:

GTAAGGTCCAGT

GTTAGGTC-AGTSlide10

Sensitivity-Speed Tradeofflong words(k = 15)short words(k = 7)SensitivitySpeed

Kent WJ, Genome Research 2002Sens.SpeedX%Slide11

Sensitivity-Speed TradeoffMethods to improve sensitivity/speedUsing pairs of wordsUsing inexact wordsPatterns—non consecutive positions……ATAACGGACGACTGATTACACTGATTCTTAC…………GGCACGGACCAGTGACTACTCTGATTCCCAG…………ATAACGGACGACTGATTACACTGATTCTTAC……

……GGCGCCGACGAGTGATTACACAGATTGCCAG……TTTGATTACACAGAT T G TT CAC GSlide12

Measured improvementKent WJ, Genome Research 2002Slide13

Non-consecutive words—Patterns Patterns increase the likelihood of at least one match within a long conserved region

3 common

5 common

7 common

Consecutive Positions

Non-Consecutive Positions

6 common

On a 100-long 70% conserved region:

Consecutive

Non-consecutive

Expected # hits: 1.07 0.97

Prob[at least one hit]: 0.30 0.47

Slide14

Advantage of Patterns

11 positions11 positions10 positionsSlide15

Multiple patternsK patternsTakes K times longer to scanPatterns can complement one anotherComputational problem:Given: a model (prob distribution) for homology between two regionsFind: best set of K patterns that maximizes Prob(at least one match) TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G

Buhler et al. RECOMB 2003Sun & Buhler RECOMB 2004How long does it take to search the query?Slide16

Variants of BLASTNCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html Optimized to align very similar sequencesWorks best when k = 4i  16Linear gap penaltyWU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ Very good optimizationsGood set of features & command line argumentsBLAT http://genome.ucsc.edu/cgi-bin/hgBlat Faster, less sensitive than BLASTGood for aligning huge numbers of queriesCHAOS http://www.cs.berkeley.edu/~brudno/chaos Uses inexact k-mers, sensitive PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php Uses patterns instead of k-mersBlastZ http://www.psc.edu/general/software/packages/blastz/ Uses patterns, good for finding genesTyphon http://typhon.stanford.edu Uses multiple alignments to improve sensitivity/speed tradeoffSlide17

Hidden Markov Models12K…12K…12K…

………12K…x1

x

2

x

3

x

K

2

1

K

2Slide18

Outline for our next topicHidden Markov models – the theoryProbabilistic interpretation of alignments using HMMsLater in the course:Applications of HMMs to biological sequence modeling and discovery of features such as genesSlide19

Example: The Dishonest CasinoA casino has two dice:Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2Casino player switches back-&-forth between fair and loaded die once every 20 turnsGame:You bet $1You roll (always with a fair die)Casino player rolls (maybe with fair die, maybe with loaded die)Highest number wins $2Slide20

Question # 1 – EvaluationGIVENA sequence of rolls by the casino player1245526462146146136136661664661636616366163616515615115146123562344QUESTIONHow likely is this sequence, given our model of how the casino works?This is the EVALUATION problem in HMMsProb = 1.3 x 10-35Slide21

Question # 2 – DecodingGIVENA sequence of rolls by the casino player1245526462146146136136661664661636616366163616515615115146123562344QUESTIONWhat portion of the sequence was generated with the fair die, and what portion with the loaded die?This is the DECODING question in HMMsFAIRLOADED

FAIRSlide22

Question # 3 – LearningGIVENA sequence of rolls by the casino player1245526462146146136136661664661636616366163616515615115146123562344QUESTIONHow “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back?This is the LEARNING question in HMMsProb(6) = 64%Slide23

The dishonest casino modelFAIRLOADED0.050.050.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2Slide24

The dishonest casino modelFAIRLOADED0.050.050.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2Slide25

A HMM is memory-lessAt each time step t, the only thing that affects future states is the current state tK1…2Slide26

Definition of a hidden Markov modelDefinition: A hidden Markov model (HMM)Alphabet  = { b1, b2, …, bM }Set of states Q = { 1, ..., K }Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…KStart probabilities a0i a01 + … + a0K = 1Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…KK1…2

End Probabilities ai0in Durbin; not neededSlide27

A HMM is memory-lessAt each time step t, the only thing that affects future states is the current state tP(t+1 = k | “whatever happened so far”) =P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) =P(t+1 = k | t)K1…2Slide28

A HMM is memory-lessAt each time step t, the only thing that affects xt is the current state tP(xt = b | “whatever happened so far”) =P(xt = b | 1, 2, …, t, x1, x2, …, xt-1) =P(xt = b | t)K1…2Slide29

A parse of a sequenceGiven a sequence x = x1……xN,A parse of x is a sequence of states  = 1, ……, N12K…12K…12K

…………12K…

x

1

x

2

x

3

x

K

2

1

K

2Slide30

Generating a sequence by the modelGiven a HMM, we can generate a sequence of length n as follows:Start at state 1 according to prob a01 Emit letter x1 according to prob e1(x1)Go to state 2 according to prob a12… until emitting xn 12K…12K…12K

…………12K…

x

1

x

2

x

3

x

n

2

1

K

2

0

e

2

(x

1

)

a

02