/
Sequence Local Alignment using Directed Acyclic Word Graph Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Local Alignment using Directed Acyclic Word Graph - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
407 views
Uploaded On 2017-03-15

Sequence Local Alignment using Directed Acyclic Word Graph - PPT Presentation

Do Huy Hoang Sequence Alignment Sequence Similarity Alignment Arrange DNAProtein sequences to show the similarity denotes the insertiondeletion event Other variations Edit distance ID: 524525

dawg alignment log score alignment dawg score log local meaningful gap time bwtsw tree suffix nodes average similarity running

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Sequence Local Alignment using Directed ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Sequence Local Alignment using Directed Acyclic Word Graph

Do Huy HoangSlide2

Sequence AlignmentSlide3

Sequence Similarity

Alignment

Arrange DNA/Protein sequences to show the similarity

“” denotes the insertion/deletion eventSlide4

Other variations

Edit distance

Longest common substring

Affine gap scoring

Using scoring matrix (BLOSUM, PAM)Slide5

Alignment score computation

Needleman–

Wunsch

Dynamic programmingSlide6

Other variations

Name

Problem

Worst time

Average time

Memory

Four Russian

Edit distance

1,0

M

*N/log(N)

<not good>

MN

Ukkonen

Global edit (linear cost)

ND

N+D

2

D

2

Waterman

Local alignment

MN

MN

MN

Tree tree

Local alignment

M

2

N

2

<close to M

2

N

2

>

BWTSW

Meaningful local alignment

M

N

2

M

N

0.68Slide7

Local alignment

Local alignment

Find the best alignments of two substring from the sequencesSlide8

BWTSWSlide9

BWTSW

Motivation

Scoring 75% similarity

Local alignment table most are zero

Meaningful alignment

Suffix tree

Meaningful alignment

Meaningful alignment with gap

How good is it?Slide10

Meaningful alignment (1)

Sequences similarity sometimes implies functional similarity.

Biologists is NOT usually interested in sequences with less than 70% similarity.

BLAST score

Match = 1

Mismatch = -3

Open Gap = -5

Extending gap = -2Slide11

Meaningful alignment (2)

BLAST score

Match = 1

Mismatch = -3

Open Gap = -5

Extending Gap = -2

At least 70% match to have none zero score Slide12

Meaningful alignment (3)

BLAST score

Match = 1

Mismatch = -3

Open Gap = -5

Extending Gap = -2

How many none zero entries in the local alignment DP table?Slide13

How to improve?

Idea:

Not storing zero score entries

Using suffix tree to prune off earlySlide14

BWTSW details

FM index for suffix tree representation

Prune zero entries

Store DP vector using linked listSlide15

Analysis

Text length = N

Pattern length = M

Alphabet size =  Slide16

Average running time (1)

Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0

Sizeof

{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0}

F(L) counts the number of pairs of 75% identity.

F(L)

= sum(

i

=0..L/4, Binomial(

L,i

) * (-1)

i

)

F(L)  k

1

k

2

L

F(log(N))  k

3

* N

0.68Slide17

Average running time (2)

Given S1, Pr(Score(S1,S2)

>

0|S1)

= F(L)/

L

For M < log(N)

The number of entries are

O(M * F(M)) < O(log(N)*F(log(N))

For M > log (N)

O(M * N * F(M) /

L

)

On average

Time = O(M*F(log(N))) = M * N

0.68Slide18

DAWGSlide19

Possible improvement of BWTSW

Worst case running time O(N

2

M)

When M=N

O(M N

0.68

+M

3

) When M is substring of N

What about ST vs. ST?Slide20

What we used in BWTSW is Suffix

Trie

(not suffix tree).

#Prove it#

Suffix

trie has O(N2

)nodes

DAWG is a similar structure with O(N) nodesSlide21

DAWG (1)Slide22

DAWG (2)

DAWG: Directed Acyclic Word Graph

DAWG is a cyclic automata that recognizes all the sub-strings of the given string.Slide23

DAWG (3)

Example:

DAWG of “

abcbc

”Slide24

DAWG (4)

End-set viewSlide25

Trivial DAWG construction

Using End-set classSlide26

DAWG properties

For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edgesSlide27

D(w) and ST(w

R

)

There is a map between nodes in DAWG and implicit ST(

w

R

)

Example: w=

abcbc

,

w

R

=

cbcba

Store DAWG using ST, which uses only o(N) bits

a

a

b

cb

cba

a

cbaSlide28

D(w) and ST(w

R

) (2)

list all incoming edges of node q in

Dw

using ST(

w^R

)Slide29

Local Alignment using DAWG

Basis

InductionSlide30

Extensions

Meaningful alignment using DAWG

Prune

the nodes whose Score is less than

zero

Shortest path pruning styleCache log(N) nodes

 the worst case running time is M*N*log(N), average case is the same for M << N.