/
Dynamic Edit Distance Table under Dynamic Edit Distance Table under

Dynamic Edit Distance Table under - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
397 views
Uploaded On 2016-03-16

Dynamic Edit Distance Table under - PPT Presentation

a General Weighted Cost Function Heikki Hyyrö University of Tampere Finland Kazuyuki Narisawa Kyushu University Japan and Shunsuke Inenaga Kyushu University Japan ID: 257935

edit algorithm left distance algorithm edit distance left decrement increment amp

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dynamic Edit Distance Table under" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dynamic Edit Distance Table under a General Weighted Cost Function

Heikki

Hyyrö

(University of Tampere, Finland)

Kazuyuki

Narisawa

(Kyushu University, Japan)

and

Shunsuke

Inenaga

(Kyushu University, Japan)Slide2

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide3

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide4

Edit Distance

minimum total cost

d

for transforming string

x

[1:n] to y[1:m]

x = prague, y = passage  Ins. = Del. = Sub. = 1

prague⇓⇓⇓⇓passage

Edit Distance= Sub. + Ins. + Ins. + Del.= 1+1+1+1= 4

Example

Edit Operation

Cost

Insertion

Ins.

=

δ

(

ε

,

b

)

Deletion

Del.

=

δ

(

a

,

ε

)

Substitution

Sub.

=

δ

(

a

,

b

)Slide5

Dynamic Programming

a

a

b

1

2

a

11Slide6

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide7

Right Increment/Decrement

Right I/D of Edit Distance

input :

D

of strings

A

and

Boutput : D’ of strings A and B’ ( B =

B’a or Ba = B’ )easy to computeinsert or delete right column of D → D’ : O(m)

decrementincrementSlide8

Left Increment/DecrementLeft I/D of ED

input :

D

of strings

A

and

Boutput : D of strings A and B’ ( B = aB’

or aB = B’ )difficult to computevalues of left side effect to the values of right sidedecrement

incrementSlide9

ContributionPropose an efficient algorithm for Left I/D problem with

any nonnegative integer costs

Left I/D problem

input : ED table

D

of strings

A

and Boutput : ED table D’ of strings A and B’B =

aB’ (decrement)B’ = aB (increment)costs of operations are nonnegative integersSlide10

ApplicationsCyclic String Comparison [Landau et. al 1998]

Computing Approximate periods [Schmidt 1998]

Edit distance for sliding window

String Kernel based on Edit distance

kernel is mapping to high dimensional feature space

used in Support Vector Machine(classifier)Slide11

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide12

Related Work

naïve method

compute

D

from scratch

O(nm) timeKim & Park algorithm [2004]Each operation has cost 1Compute difference representation

DR of table D Using Change Table ChO(n+m) timeSlide13

DefinitionLeft Increment/Decrement Problem

input

DR

table of string

A

and Boutput : DR’ table of string A and B’B

= aB’ (decrement)B’ = aB (increment)Each cost (Ins., Del., Sub.) is a Non Negative IntegerKim & Park algorithm : each cost is 1Slide14

Difference Representation

under minus upper

right minus leftSlide15

DR’

DR

We need not update all cellsSlide16

Change Table

Ch

[

i

,

j

] =

D’[i, j] – D[i, j

]cost = 1values in Ch : –1, 0, 1is separated into three areas Slide17

Affected Entries

entrie

s

where

DR

[i, j] ≠ DR[i, j

]they must be updatedaffected entries are along the borders of three areas in ChSlide18

Sketch

of Kim & Park Algorithm

Update affected entries

scan borders in

Ch

, computing

Ch

and DR’ Time Complexity : O(n+m)Slide19

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide20

General Costs

Ch

can be

separated into more than three areas

the number of areas depends on the costs

the values are not limited to

–1, 0, 1Kim & Park algorithm is specialized to the three area casecan not be applied with general costsIns. = 2, Del. = 2, Sub.

= 1ExampleSlide21

Our AlgorithmUpdate only affected entries

without

Ch

compute only

DR

’.

U

and DR’.LTime complexity : O(min{c(n+m

), nm})c is the maximum cost

DR’.U – DR.UDR’.L – DR.LD’ – DSlide22

Affected Entry

DR’

[

i

,

j

] ≠

DR[i, j]Kim & Park Algorithm computes DR’ and Ch for computing Affected Entry

Our Algorithmcompute affected entry by only DR tableuse following lemmaDR’[i, j] is Affected Entry ⇔ DR’[i–1,

j].L ≠ DR[i–1 , j].LorDR’[i, j–1].U ≠ DR[i, j–1].U Slide23

comparison of pseudo codes

Our Algorithm

1

for

i

=1

to m do2 prev

⊿[i] = i + 1; DR[i,1].U = δ(ai, ε);3i = 1; j = 1; DR[0, j].L = δ(ε, bj); currIdx = 1; prevIdx = 1;4while i ≦

m and j ≦ n do

5

while

i

m

do

6

x

=

DR

[

i

-1,

j

].

L

;

y

=

DR

[

i

,

j

-1].

U

7

z

=

min{

x

+

δ

(

a

i

,

ε

),

y

+

δ

(

ε

,

b

j

),

δ

(

a

i

,

b

j

)}

8

new.L

=

z-y

;

new.U = z-x

;

9

old.L = DR

[

i, j

]

.L

;

old.U = DR

[

i, j

]

.U

;

10

DR

[

i, j

].

L

= new.L

;

DR

[

i, j

].

U = new.U

;

11

if

old.U

new.U

then

12

curr

[

currIdx

]

= i

;

currIdx = currIdx +

1;

13

i = i +

1;

14

if

old.L = new.L

then

15

now = i

;

16

repeat

17

i = prev

[

prevIdx

];

prevIdx = prevIdx

+ 1;

18

until

i

now

19

curr⊿[currIdx] = m + 1;20 Interchange the roles of the tables curr⊿ and prev⊿;21 currIdx = 1; i = prev⊿[1]; prevIdx = 2; j = j + 1;

Kim

& Park

Algorithm

1

Let

k

be the smallest index in

A

such that

A

[

k

] =

B

[1]

2

i

-1

= 0;

j

-1

= 1;

i

1

=

k

;

j

1

= 0;

f

(0)

= 0;

g

(0) =

k

;

3

finished

-1

=

false

;

finished

1

=

false

;

4

while

(

finished

-1

=

false

)

or

(

finished

1

=

false

)

do

5

if

i

-1

<

i

1

1

then

/* case1 */

6

if

j

-1

>

j

1

+ 1

then

7

if

j

-1

>

j

1

+1

then

X

= -1;

8

else

X

= 0;

9

Y

= 0;

10

else

11

if

f

(

i

-1

) <

j

-1

then

X

= -1;

12

else if

g

(

j

1

)

i

-1

then

X

= 1;

13

else

X

= 0;

14

if

g

(

j

1

)

i

-1

+ 1

then

Y

= 1;

15

else

Y

= 0;

16

Z = -1;

17

Ch

[

i

-1

+1,

j

-1

]= min{ -

DR

[

i

-1

+1,

j

-1

+1].

UL

+ X+

δ

i

-1+1,

j

-1+1

, -

DR

[

i

-1

+1,

j

-1

+1].

U

+

Z

+1,

-

DR

[

i

-1

+1,

j

-1

+1].

L

+

Y

+1};

18

DR

’[

i

-1

+1,

j

-1

].

U

=

DR

[

i

-1

+1,

j

-1

+1].

U

Ch

[

i

-1

+1,

j

-1

] +

Z

;

19

DR

’[

i

-1

+1,

j

-1

].

L

=

DR

[

i

-1

+1,

j

-1

+1].

L

Ch

[

i

-1

+1,

j

-1

] +

Y

;

20

if

Ch

[

i

-1

+1,

j

-1

] = -1

then

i

-1

=

i

-1

+ 1;

f

(

i

-1

) =

j

-1

;

21

else

j

-1

=

j

-1

+ 1;

22

else if

j

1

<

j

-1

-1

then

/* case2 */

23

if

i

1

>

i

-1

+1

then

24

if

g

(

j

1

) <

i

1

then

X

=1;

25

else

X

= 0;

26

Y

= 0;

27

else

28

if

g

(

j

1

) <

i

1

then

X

=1;

29

else if

f

(

i

-1

)

j

1

then

X

= 0;

30

else

X

= 0;

31

if

f

(

i

1-1

)

j

1

+ 1

then

Y

=-1;

32

else

Y

= 0;

33

Z

= 1;

34

Ch

[

i

1

,

j

1

+1

]= min{ -

DR

[

i

1

,

j

1

+2].

UL

+

X

+

δ

i

1,

j

1+2

, -

DR

[

i

1

,

j

1

+2].

U

+

Y

+1,

-

DR

[

i

1

,

j

1

+2].

L

+

Z

+1};

35

DR

’[

i

1

,

j

1

+1].

U

=

DR

[

i

1

,

j

-1

+2].

U

Ch

[

i

1

,

j

1

+1] +

Y

;

36

DR

’[

i

1

,

j

1

+1].

L

=

DR

[

i

1

,

j

-1

+2].

L

Ch

[

i

1

,

j

1

+1] +

Z

;

37

if

Ch

[

i

1

,

j

1

+1] = 1

then

j

1

=

j

1

+ 1;

g

(

j

1

) =

i

1

;

38

else

i

1

=

i

1

+ 1;

39

else

/* case3 */

40

if

f

(

i

-1

<

j

-1

)

then

X

= -1;

41

else if

g

(

j

1

)

i

-1

then

X

= 1;

42

else

X

= 0;

43

Y

= -1;

Z

= 1;

44

Ch

[

i

-

1

+1,

j

-

1

]= min{ -

DR

[

i

-

1

+1

,

j

-

1

+1].

UL

+

X

+

δ

i-

1+1,

j-

1+1

, -

DR

[

i

-

1

+1

,

j

-

1

+1].

U

+

Y

+1,

-

DR

[

i

-

1

+1

,

j

-

1

+1].

L

+

Z

+1};

45

DR

’[

i

-

1

+1,

j

-

1

].

U

=

DR

[

i

-

1

+1,

j

-1

+1].

U

Ch

[

i

-

1

+1,

j

-

1

] +

Y

;

46

DR

’[

i

-

1

+1,

j

-

1

].

L

=

DR

[

i

-

1

+1,

j

-1

+1].

L

Ch

[

i

-

1

+1,

j

-

1

] +

Z

;

47

if

Ch

[

i

-1

+1,

j

-1

] = 1

then

j

-1

=

j

-1

+ 1;

j

1

=

j

1

+ 1;

g

(

j

1

) =

i

1

;

48

else if

Ch

[

i

-1

+1,

j

-1

] = 1

then

j

-1

=

j

-1

+ 1;

j

1

=

j

1

+ 1;

g

(

j

1

) =

i

1

;

49

else

j

-1

=

j

-1

+ 1;

i

1

=

i

1

+ 1;

50

if

(

i

-1

=

m

)

or

(

j

-1

=

n

)

then

51

finished

-1

=

true

;

52

if

(

i

1

=

m

+1

)

or

(

j

1

=

n

-1

)

then

53

finished

1

=

true

;Slide24

comparison of behaviors

our algorithm

Kim & Park algorithmSlide25

ContentsEdit Distance

Left Increment/Decrement Edit Distance Problem

Related Work

Our Algorithm

Experiments

SummarySlide26

Experimentsstrings

A

[

1

:

m

] and B[1:m]Total time of computing representations of edit distance between A and

B[ j:m] for j = m, m–1,…, 1left incremental computationMachine SpecificationsCentOS LinuxXeon 3.0GhHz16GB memorySlide27

Experiment 1Time comparison with naïve algorithm

costs

chosen

randomly

Insertion = 137,

Deletion = 116, Substitution = 242

Random dataalphabet size 2,3, …, 52string length 100, 200, …, 5000Slide28

Result 1Slide29

Result 1Slide30

Experiment 2Time comparison with Kim & Park

algorithm

costs

Insertion = Deletion = Substitution = 1

Random data

alphabet size 2, 3, , …, 52

string length 100, 200, …, 5000Slide31

Result 2Slide32

Result 2Slide33

Experiment 3TimeCompare with naïve algorithm

Corpus

English(

reuters

news)

costs

Insertion

= 137, Deletion = 116, Substitution = 242string length : 1000, 2000, 3000, 4000, 5000Protein data(canterbury corpus: E.coli)costs proposed in [Kurtz 1996]string length : 1000, 2000, 3000, 4000, 5000δ

εAC

GTε03333A302

1

3

C

3

2

0

2

1

G

3

1

2

0

2

T

3

2

1

2

0Slide34

Result 3

length

Time (

seconds)

Our

algorithm

Naïve

algorithm10000.041.5020000.27

12.030000.7140.440001.3697.150002.29189lengthTime (seconds)Our

algorithmNaïve algorithm10000.011.4320000.0911.530000.2338.840000.4392.850000.70181English NewsProtein DataSlide35

SummaryAlgorithm for Left I/D problem

nonnegative integer costs

O

( min{

c

(

n

+m), nm} )c is the maximum costexperimentally fast

Our AlgorithmNaïve AlgorithmKim & Park AlgorithmCostsNon negative integer

Real number1Tables to computeDRDDR and ChTime ComplexityO( min{c(n+m), nm} )O(nm)O(n+m)Source codeSimpleSimpleCumbersomeSpeedFast

Very SlowSlowerSlide36
Slide37
Slide38

Related Work

naïve method

compute

D

from scratch

O(nm)

timeKim & Park algorithm [2004]Each operation has cost 1Compute difference representation DR → DR’Using Change Table ChO(n

+m) timeDD’DR, ChDR’, ChEdit DistanceO(nm)

O(nm)O(1)

O(n+

m

)

O

(

n

+

m

)

naïve

Kim & Park