Do Huy Hoang Sequence Alignment Sequence Similarity Alignment Arrange DNAProtein sequences to show the similarity denotes the insertiondeletion event Other variations Edit distance ID: 524525
Download Presentation The PPT/PDF document "Sequence Local Alignment using Directed ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Sequence Local Alignment using Directed Acyclic Word Graph
Do Huy HoangSlide2
Sequence AlignmentSlide3
Sequence Similarity
Alignment
Arrange DNA/Protein sequences to show the similarity
“” denotes the insertion/deletion eventSlide4
Other variations
Edit distance
Longest common substring
Affine gap scoring
Using scoring matrix (BLOSUM, PAM)Slide5
Alignment score computation
Needleman–
Wunsch
Dynamic programmingSlide6
Other variations
Name
Problem
Worst time
Average time
Memory
Four Russian
Edit distance
1,0
M
*N/log(N)
<not good>
MN
Ukkonen
Global edit (linear cost)
ND
N+D
2
D
2
Waterman
Local alignment
MN
MN
MN
Tree tree
Local alignment
M
2
N
2
<close to M
2
N
2
>
BWTSW
Meaningful local alignment
M
N
2
M
N
0.68Slide7
Local alignment
Local alignment
Find the best alignments of two substring from the sequencesSlide8
BWTSWSlide9
BWTSW
Motivation
Scoring 75% similarity
Local alignment table most are zero
Meaningful alignment
Suffix tree
Meaningful alignment
Meaningful alignment with gap
How good is it?Slide10
Meaningful alignment (1)
Sequences similarity sometimes implies functional similarity.
Biologists is NOT usually interested in sequences with less than 70% similarity.
BLAST score
Match = 1
Mismatch = -3
Open Gap = -5
Extending gap = -2Slide11
Meaningful alignment (2)
BLAST score
Match = 1
Mismatch = -3
Open Gap = -5
Extending Gap = -2
At least 70% match to have none zero score Slide12
Meaningful alignment (3)
BLAST score
Match = 1
Mismatch = -3
Open Gap = -5
Extending Gap = -2
How many none zero entries in the local alignment DP table?Slide13
How to improve?
Idea:
Not storing zero score entries
Using suffix tree to prune off earlySlide14
BWTSW details
FM index for suffix tree representation
Prune zero entries
Store DP vector using linked listSlide15
Analysis
Text length = N
Pattern length = M
Alphabet size = Slide16
Average running time (1)
Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0
Sizeof
{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0}
F(L) counts the number of pairs of 75% identity.
F(L)
= sum(
i
=0..L/4, Binomial(
L,i
) * (-1)
i
)
F(L) k
1
k
2
L
F(log(N)) k
3
* N
0.68Slide17
Average running time (2)
Given S1, Pr(Score(S1,S2)
>
0|S1)
= F(L)/
L
For M < log(N)
The number of entries are
O(M * F(M)) < O(log(N)*F(log(N))
For M > log (N)
O(M * N * F(M) /
L
)
On average
Time = O(M*F(log(N))) = M * N
0.68Slide18
DAWGSlide19
Possible improvement of BWTSW
Worst case running time O(N
2
M)
When M=N
O(M N
0.68
+M
3
) When M is substring of N
What about ST vs. ST?Slide20
What we used in BWTSW is Suffix
Trie
(not suffix tree).
#Prove it#
Suffix
trie has O(N2
)nodes
DAWG is a similar structure with O(N) nodesSlide21
DAWG (1)Slide22
DAWG (2)
DAWG: Directed Acyclic Word Graph
DAWG is a cyclic automata that recognizes all the sub-strings of the given string.Slide23
DAWG (3)
Example:
DAWG of “
abcbc
”Slide24
DAWG (4)
End-set viewSlide25
Trivial DAWG construction
Using End-set classSlide26
DAWG properties
For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edgesSlide27
D(w) and ST(w
R
)
There is a map between nodes in DAWG and implicit ST(
w
R
)
Example: w=
abcbc
,
w
R
=
cbcba
Store DAWG using ST, which uses only o(N) bits
a
a
b
cb
cba
a
cbaSlide28
D(w) and ST(w
R
) (2)
list all incoming edges of node q in
Dw
using ST(
w^R
)Slide29
Local Alignment using DAWG
Basis
InductionSlide30
Extensions
Meaningful alignment using DAWG
Prune
the nodes whose Score is less than
zero
Shortest path pruning styleCache log(N) nodes
the worst case running time is M*N*log(N), average case is the same for M << N.