a General Weighted Cost Function Heikki Hyyrö University of Tampere Finland Kazuyuki Narisawa Kyushu University Japan and Shunsuke Inenaga Kyushu University Japan ID: 257935
Download Presentation The PPT/PDF document "Dynamic Edit Distance Table under" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dynamic Edit Distance Table under a General Weighted Cost Function
Heikki
Hyyrö
(University of Tampere, Finland)
Kazuyuki
Narisawa
(Kyushu University, Japan)
and
Shunsuke
Inenaga
(Kyushu University, Japan)Slide2
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide3
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide4
Edit Distance
minimum total cost
d
for transforming string
x
[1:n] to y[1:m]
x = prague, y = passage Ins. = Del. = Sub. = 1
prague⇓⇓⇓⇓passage
Edit Distance= Sub. + Ins. + Ins. + Del.= 1+1+1+1= 4
Example
Edit Operation
Cost
Insertion
Ins.
=
δ
(
ε
,
b
)
Deletion
Del.
=
δ
(
a
,
ε
)
Substitution
Sub.
=
δ
(
a
,
b
)Slide5
Dynamic Programming
a
a
b
1
2
a
11Slide6
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide7
Right Increment/Decrement
Right I/D of Edit Distance
input :
D
of strings
A
and
Boutput : D’ of strings A and B’ ( B =
B’a or Ba = B’ )easy to computeinsert or delete right column of D → D’ : O(m)
decrementincrementSlide8
Left Increment/DecrementLeft I/D of ED
input :
D
of strings
A
and
Boutput : D of strings A and B’ ( B = aB’
or aB = B’ )difficult to computevalues of left side effect to the values of right sidedecrement
incrementSlide9
ContributionPropose an efficient algorithm for Left I/D problem with
any nonnegative integer costs
Left I/D problem
input : ED table
D
of strings
A
and Boutput : ED table D’ of strings A and B’B =
aB’ (decrement)B’ = aB (increment)costs of operations are nonnegative integersSlide10
ApplicationsCyclic String Comparison [Landau et. al 1998]
Computing Approximate periods [Schmidt 1998]
Edit distance for sliding window
String Kernel based on Edit distance
kernel is mapping to high dimensional feature space
used in Support Vector Machine(classifier)Slide11
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide12
Related Work
naïve method
compute
D
’
from scratch
O(nm) timeKim & Park algorithm [2004]Each operation has cost 1Compute difference representation
DR of table D Using Change Table ChO(n+m) timeSlide13
DefinitionLeft Increment/Decrement Problem
input
:
DR
table of string
A
and Boutput : DR’ table of string A and B’B
= aB’ (decrement)B’ = aB (increment)Each cost (Ins., Del., Sub.) is a Non Negative IntegerKim & Park algorithm : each cost is 1Slide14
Difference Representation
under minus upper
right minus leftSlide15
DR’
–
DR
We need not update all cellsSlide16
Change Table
Ch
[
i
,
j
] =
D’[i, j] – D[i, j
]cost = 1values in Ch : –1, 0, 1is separated into three areas Slide17
Affected Entries
entrie
s
where
DR
’
[i, j] ≠ DR[i, j
]they must be updatedaffected entries are along the borders of three areas in ChSlide18
Sketch
of Kim & Park Algorithm
Update affected entries
scan borders in
Ch
, computing
Ch
and DR’ Time Complexity : O(n+m)Slide19
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide20
General Costs
Ch
can be
separated into more than three areas
the number of areas depends on the costs
the values are not limited to
–1, 0, 1Kim & Park algorithm is specialized to the three area casecan not be applied with general costsIns. = 2, Del. = 2, Sub.
= 1ExampleSlide21
Our AlgorithmUpdate only affected entries
without
Ch
compute only
DR
’.
U
and DR’.LTime complexity : O(min{c(n+m
), nm})c is the maximum cost
DR’.U – DR.UDR’.L – DR.LD’ – DSlide22
Affected Entry
DR’
[
i
,
j
] ≠
DR[i, j]Kim & Park Algorithm computes DR’ and Ch for computing Affected Entry
Our Algorithmcompute affected entry by only DR tableuse following lemmaDR’[i, j] is Affected Entry ⇔ DR’[i–1,
j].L ≠ DR[i–1 , j].LorDR’[i, j–1].U ≠ DR[i, j–1].U Slide23
comparison of pseudo codes
Our Algorithm
1
for
i
=1
to m do2 prev
⊿[i] = i + 1; DR[i,1].U = δ(ai, ε);3i = 1; j = 1; DR[0, j].L = δ(ε, bj); currIdx = 1; prevIdx = 1;4while i ≦
m and j ≦ n do
5
while
i
≦
m
do
6
x
=
DR
[
i
-1,
j
].
L
;
y
=
DR
[
i
,
j
-1].
U
7
z
=
min{
x
+
δ
(
a
i
,
ε
),
y
+
δ
(
ε
,
b
j
),
δ
(
a
i
,
b
j
)}
8
new.L
=
z-y
;
new.U = z-x
;
9
old.L = DR
[
i, j
]
.L
;
old.U = DR
[
i, j
]
.U
;
10
DR
[
i, j
].
L
= new.L
;
DR
[
i, j
].
U = new.U
;
11
if
old.U
≠
new.U
then
12
curr
⊿
[
currIdx
]
= i
;
currIdx = currIdx +
1;
13
i = i +
1;
14
if
old.L = new.L
then
15
now = i
;
16
repeat
17
i = prev
⊿
[
prevIdx
];
prevIdx = prevIdx
+ 1;
18
until
i
≧
now
19
curr⊿[currIdx] = m + 1;20 Interchange the roles of the tables curr⊿ and prev⊿;21 currIdx = 1; i = prev⊿[1]; prevIdx = 2; j = j + 1;
Kim
& Park
Algorithm
1
Let
k
be the smallest index in
A
such that
A
[
k
] =
B
[1]
2
i
-1
= 0;
j
-1
= 1;
i
1
=
k
;
j
1
= 0;
f
(0)
= 0;
g
(0) =
k
;
3
finished
-1
=
false
;
finished
1
=
false
;
4
while
(
finished
-1
=
false
)
or
(
finished
1
=
false
)
do
5
if
i
-1
<
i
1
–
1
then
/* case1 */
6
if
j
-1
>
j
1
+ 1
then
7
if
j
-1
>
j
1
+1
then
X
= -1;
8
else
X
= 0;
9
Y
= 0;
10
else
11
if
f
(
i
-1
) <
j
-1
then
X
= -1;
12
else if
g
(
j
1
)
≦
i
-1
then
X
= 1;
13
else
X
= 0;
14
if
g
(
j
1
)
≦
i
-1
+ 1
then
Y
= 1;
15
else
Y
= 0;
16
Z = -1;
17
Ch
[
i
-1
+1,
j
-1
]= min{ -
DR
[
i
-1
+1,
j
-1
+1].
UL
+ X+
δ
i
-1+1,
j
-1+1
, -
DR
[
i
-1
+1,
j
-1
+1].
U
+
Z
+1,
-
DR
[
i
-1
+1,
j
-1
+1].
L
+
Y
+1};
18
DR
’[
i
-1
+1,
j
-1
].
U
=
DR
[
i
-1
+1,
j
-1
+1].
U
–
Ch
[
i
-1
+1,
j
-1
] +
Z
;
19
DR
’[
i
-1
+1,
j
-1
].
L
=
DR
[
i
-1
+1,
j
-1
+1].
L
–
Ch
[
i
-1
+1,
j
-1
] +
Y
;
20
if
Ch
[
i
-1
+1,
j
-1
] = -1
then
i
-1
=
i
-1
+ 1;
f
(
i
-1
) =
j
-1
;
21
else
j
-1
=
j
-1
+ 1;
22
else if
j
1
<
j
-1
-1
then
/* case2 */
23
if
i
1
>
i
-1
+1
then
24
if
g
(
j
1
) <
i
1
then
X
=1;
25
else
X
= 0;
26
Y
= 0;
27
else
28
if
g
(
j
1
) <
i
1
then
X
=1;
29
else if
f
(
i
-1
)
≦
j
1
then
X
= 0;
30
else
X
= 0;
31
if
f
(
i
1-1
)
≦
j
1
+ 1
then
Y
=-1;
32
else
Y
= 0;
33
Z
= 1;
34
Ch
[
i
1
,
j
1
+1
]= min{ -
DR
[
i
1
,
j
1
+2].
UL
+
X
+
δ
i
1,
j
1+2
, -
DR
[
i
1
,
j
1
+2].
U
+
Y
+1,
-
DR
[
i
1
,
j
1
+2].
L
+
Z
+1};
35
DR
’[
i
1
,
j
1
+1].
U
=
DR
[
i
1
,
j
-1
+2].
U
–
Ch
[
i
1
,
j
1
+1] +
Y
;
36
DR
’[
i
1
,
j
1
+1].
L
=
DR
[
i
1
,
j
-1
+2].
L
–
Ch
[
i
1
,
j
1
+1] +
Z
;
37
if
Ch
[
i
1
,
j
1
+1] = 1
then
j
1
=
j
1
+ 1;
g
(
j
1
) =
i
1
;
38
else
i
1
=
i
1
+ 1;
39
else
/* case3 */
40
if
f
(
i
-1
<
j
-1
)
then
X
= -1;
41
else if
g
(
j
1
)
≦
i
-1
then
X
= 1;
42
else
X
= 0;
43
Y
= -1;
Z
= 1;
44
Ch
[
i
-
1
+1,
j
-
1
]= min{ -
DR
[
i
-
1
+1
,
j
-
1
+1].
UL
+
X
+
δ
i-
1+1,
j-
1+1
, -
DR
[
i
-
1
+1
,
j
-
1
+1].
U
+
Y
+1,
-
DR
[
i
-
1
+1
,
j
-
1
+1].
L
+
Z
+1};
45
DR
’[
i
-
1
+1,
j
-
1
].
U
=
DR
[
i
-
1
+1,
j
-1
+1].
U
–
Ch
[
i
-
1
+1,
j
-
1
] +
Y
;
46
DR
’[
i
-
1
+1,
j
-
1
].
L
=
DR
[
i
-
1
+1,
j
-1
+1].
L
–
Ch
[
i
-
1
+1,
j
-
1
] +
Z
;
47
if
Ch
[
i
-1
+1,
j
-1
] = 1
then
j
-1
=
j
-1
+ 1;
j
1
=
j
1
+ 1;
g
(
j
1
) =
i
1
;
48
else if
Ch
[
i
-1
+1,
j
-1
] = 1
then
j
-1
=
j
-1
+ 1;
j
1
=
j
1
+ 1;
g
(
j
1
) =
i
1
;
49
else
j
-1
=
j
-1
+ 1;
i
1
=
i
1
+ 1;
50
if
(
i
-1
=
m
)
or
(
j
-1
=
n
)
then
51
finished
-1
=
true
;
52
if
(
i
1
=
m
+1
)
or
(
j
1
=
n
-1
)
then
53
finished
1
=
true
;Slide24
comparison of behaviors
our algorithm
Kim & Park algorithmSlide25
ContentsEdit Distance
Left Increment/Decrement Edit Distance Problem
Related Work
Our Algorithm
Experiments
SummarySlide26
Experimentsstrings
A
[
1
:
m
] and B[1:m]Total time of computing representations of edit distance between A and
B[ j:m] for j = m, m–1,…, 1left incremental computationMachine SpecificationsCentOS LinuxXeon 3.0GhHz16GB memorySlide27
Experiment 1Time comparison with naïve algorithm
costs
chosen
randomly
Insertion = 137,
Deletion = 116, Substitution = 242
Random dataalphabet size 2,3, …, 52string length 100, 200, …, 5000Slide28
Result 1Slide29
Result 1Slide30
Experiment 2Time comparison with Kim & Park
algorithm
costs
Insertion = Deletion = Substitution = 1
Random data
alphabet size 2, 3, , …, 52
string length 100, 200, …, 5000Slide31
Result 2Slide32
Result 2Slide33
Experiment 3TimeCompare with naïve algorithm
Corpus
English(
reuters
news)
costs
Insertion
= 137, Deletion = 116, Substitution = 242string length : 1000, 2000, 3000, 4000, 5000Protein data(canterbury corpus: E.coli)costs proposed in [Kurtz 1996]string length : 1000, 2000, 3000, 4000, 5000δ
εAC
GTε03333A302
1
3
C
3
2
0
2
1
G
3
1
2
0
2
T
3
2
1
2
0Slide34
Result 3
length
Time (
seconds)
Our
algorithm
Naïve
algorithm10000.041.5020000.27
12.030000.7140.440001.3697.150002.29189lengthTime (seconds)Our
algorithmNaïve algorithm10000.011.4320000.0911.530000.2338.840000.4392.850000.70181English NewsProtein DataSlide35
SummaryAlgorithm for Left I/D problem
nonnegative integer costs
O
( min{
c
(
n
+m), nm} )c is the maximum costexperimentally fast
Our AlgorithmNaïve AlgorithmKim & Park AlgorithmCostsNon negative integer
Real number1Tables to computeDRDDR and ChTime ComplexityO( min{c(n+m), nm} )O(nm)O(n+m)Source codeSimpleSimpleCumbersomeSpeedFast
Very SlowSlowerSlide36Slide37Slide38
Related Work
naïve method
compute
D
’
from scratch
O(nm)
timeKim & Park algorithm [2004]Each operation has cost 1Compute difference representation DR → DR’Using Change Table ChO(n
+m) timeDD’DR, ChDR’, ChEdit DistanceO(nm)
O(nm)O(1)
O(n+
m
)
O
(
n
+
m
)
naïve
Kim & Park