/
Efficient Approximation Efficient Approximation

Efficient Approximation - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
378 views
Uploaded On 2015-11-06

Efficient Approximation - PPT Presentation

of Edit Distance Robert Krauthgamer Weizmann Institute of Science SPIRE 2013 TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A A A A ID: 184440

edit distance log approximation distance edit approximation log efficient time sampling tree embedding permutations bed blocks factor andoni block lemma distortion precision

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Efficient Approximation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Efficient Approximation of Edit Distance

Robert Krauthgamer, Weizmann Institute of ScienceSPIRE 2013

TexPoint

fonts used in EMF.

Read the

TexPoint

manual before you delete this box

.:

A

A

A

A

A

A

A

A

ASlide2

Efficient Approximation of Edit Distance

Gen

eric

Sea

rchEngine

Given two strings xn, ym:

ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y.

Edit Distance (Levenshtein distance)

Applications:

Computational Biology

Text processing Web search

Examples:

ed( banana , ananas ) = 2 ed(00000, 1111) = 5

For simplicity: m = n.

2Slide3

Efficient Approximation of Edit Distance

Dynamic Programming AlgorithmCompute ed(x,y)

for input

x,y

 nO(n2) time by dynamic programming [WF’74]O(n2/log2 n) time when ||·O(1)

[MP’80]ba

na

n

aa

n

a

n

a

s

2

5

6

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

3

3

3

3

3

3

3

3

4

4

4

4

4

5

5

D(i,j)= min

D(i-1, j-1)

, if

x[i]=y[j]

D(i, j-1) + 1

D(i-1, j) + 1

D(

i,j

) =

ed

( x[1:i], y[1:j] )

Faster algorithms?

3Slide4

Focus of This TalkApproximating edit distance

Multiplicatively: ed(x,y) · output ·

A

¢ed(x,y) Decision version: ed(x,y) · r or ed(x,y) > A¢

r Different computational modelsRAM, Sampling and query complexity, Sketching, (Streaming)Interactions (is it surprising?), Techniques Variants of the problemEfficient Approximation of Edit Distance4Slide5

RAM Model: SamplingIdea 1:

quickly estimate ed(x,y) by sampling a few positionsIntuition: If ed(

x,y

)

is small, then “many” large blocks should “match”“Test” this by reading few (randomly chosen) blocks Apply this idea recursively (inside blocks)Theorem [Batu-Ergun-Kilian-Magen-Raskhodinkova-Rubinfeld-Sami’03]: Factor nc “weak” approximation in sublinear time.Obstacles:“Block match” means both “similar pattern” and “similar location”Argue that if and only if

ed(x,y) is small then … Can only distinguish ed(x,y)·n/(8A) from ed(x,y)>n/8.Efficient Approximation of Edit Distance

Best approximation in (near) linear time?5Slide6

Learn from Past Success

Suppose x,y are permutationsEvery symbol of  appears exactly onceConsider transpositions=block moves (“block edit distance”)No Insert/delete (unreasonable), no substitution (not needed)Example:

bed(0

123

456789, 0457689123)=2An easy estimate (based on breakpoints)Compute Sx = {all length 2 substrings of x} = {x[i:i+1] | i=1,…,n-1} Lemma: bed(x,y) ·

½ |SxΔSy| · 3 bed(x,y) Proof idea: Fix x (wlog identity), let y=Each block move “creates” at most 3 new breakpointsBreak y at breakpoints, and move (rearrange) the blocks to get x Can compute |SxΔSy| in linear time!!Best approximation known in poly-time is 1.375 [Elias-Hartman’06]Efficient Approximation of Edit Distance

A B C D

Open: better approximation in linear-time?

6Slide7

Reduction to Hamming Distance|

SxΔSy| = Hamming distance between their characteristic vectorsIn fact, each vector has |

|

2=n2 coordinates, but only n-1 are non-zeroWe thus obtain f : Permutations  {0,1}n2 such that

8x,y, bed(x,y) · ½ ||f(x)-f(y)||1 · 3 bed(x,y).Such a reduction from one metric space (BED on permutations) to another (L1) is called an embedding. This one has distortion D=3.Known lower bound: distortion into L1 must be ¸ 4/3 [Polak-K.’12]Efficient Approximation of Edit Distance

More benefits of “good” embeddings?A sweet spot of fruitful interaction between

Math/Geometry (“comparing” metric spaces using embeddings) and

CS/Algorithms (solving new problems by “reducing” to old ones)7Slide8

Sketching ModelIdea:

“summarize” each string separately, then estimate ed(x,y) only from the short sketches s(x),s(y).Possible at all??

YES

for Hamming distance, and even L

1/L2 [Indyk-Motwani’98, Kushilevitz-Ostrosvky-Rabani’00]Approximation factor A=1+ε using sketch size O(ε-2) bitsIt’s essentially a “dimension reduction” [Johnson-Lindenstrauss’86]Achieved by projection on (inner product with) random direction in spaceConsequently, YES also for block edit distance on permutations:

Applies whenever there is an embedding into L1 !!Efficient Approximation of Edit DistanceBED on perm.HammingO(ε-2) bits sketch

fsdistort. D=3approx. 1+ε

8Slide9

Applications of SketchingInput:

large database M, with |M| strings of length n each.Output: all pairwise distances or closest pair (BED on perm)Naively: in time O(|M|

2

n)

Sketching [3+ε approx., decision version]: sketch each string, then estimate all pairs in time O(|M|n + |M|2/ε2) Practical viewpoint: filteration, i.e., fast pruning of “bad” pairsWorks similarly for Nearest Neighbor Search (NNS): Reduce NNS for permutations under

BED, to NNS for Hamming (L1)Efficient Approximation of Edit DistanceQ1. More embeddings?Q2. Sketching directly?Q3. Lower bounds?9Slide10

Efficient Approximation of Edit Distance

Embedding ED on Permutations

Theorem

[Charikar-K.’06]:

Edit distance on permutations of length n embeds into L1 with distortion O(log n).Proof. Define where

Lemma 1

: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q)‏Suppose Q is obtained from P by moving one symbol, say ‘s’General case then follows by applying triangle inequality on P,P’,P’’,…,Q Total contribution of coordinates s

2{a,b} is

2k (1/k) ≤ O(log n)‏other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)‏

Intuition:

sign(fa,b(P)) is indicator for “a appears before b” in PThus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q

10Slide11

Efficient Approximation of Edit Distance

Embedding ED on Permutations (2)

Recall

where

Lemma 1

: ||f(P)-f(Q)||1 ≤ O(log n)

ed(P,Q)‏

Lemma 2

:

||f(P)-f(Q)||1 ¸ ½ ed(P,Q) Assume wlog that P=identity

Edit Q into an increasing sequence (thus into P) using quicksort:Choose a random pivot, Delete all characters inverted wrt to pivotRepeat recursively on left and right portionsNow argue ||f(P)-f(Q)||1 ¸ E[ #quicksort deletions ] ¸ ½ ed(P,Q

)

QEDSurviving subsequence is increasing

ed(P,Q) ≤ 2 #deletionsFor every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|11Slide12

Embedding Edit DistanceTheorem [Ostrovsky-Rabani’05]:

Edit distance on all strings (not only permutations) embeds into L1 with distortion 2Õ(√log n

)

.

Previously, distortion nc was known [BarYossef-Jayram-K.-Kumar’04, Batu-Ergun-Sahinalp’06]Clever recursive method to match blocks much more accurately Penalizes both pattern and location errorsNot very fast (quadratic time), but influenced later work on near-linear time algorithms [Andoni-Onak’09, Andoni-Onak-K.’10]Immediate consequences:NNS algorithms for edit distanceSketchingEfficient Approximation of Edit Distance

12Slide13

Lower BoundsTheorem [Khot-Naor’05, K.-Rabani’06

]: Embedding edit distance into L1 requires distortion Ω(log n) Main technique: Fourier analysis [Kahn-Kalai-Linial’88]L1 embedding

$

sparsest-cuts

$ Boolean functions f:{0,1}n  {0,1} Stronger assertion: O(1)-size sketches for edit distance require Ω̃(log n) approximation, even only for permutations [Andoni-K.’06]Actually tradeoff between approximation and sketch-sizeTechniques: communication complexity and Fourier analysis reduce the problem to sketches that are linear functions (of their input x)

Efficient Approximation of Edit DistanceQ2’. Sketching vs embedding?13Slide14

RAM Model: Asymmetric SamplingIdea 1

’: Read all of y, and sampled positions of x Motivations: Better chances to “obtain” informationWhich y’s are easier/harder?

Sampling issues:

Focus on query complexity bounds (tight?)

Adaptive vs non-adaptive queriesQueries depend on y?Use dynamic programming in time O(n1+ε)?Efficient Approximation of Edit Distance

x

y

14Slide15

Efficient Approximation of Edit Distance

Asymmetric Sampling Results[Andoni-Onak-K.’10]Problem: Decide ed(

x,y

) ≥ n/10

vs ed(x,y) ≤ n/(10A)Complexity = #queries into x (unlimited access to y)

n

1-ε

A

Θ

(log n)

Θ

(log

2

n)Θ(log3 n)

Θ(logt n)#queries

n

1/2-

ε

n

1/2n1/3n1/4n1/t-εn1/(t+1)

Approximation

A

:

(log n)

O(1/

ε

)

# Queries:

O(n

ε

)

Ω

(n

ε/loglog n

)

[n

1/(t+1)

, n

1/t-

ε

]

O(log

t n)

Ω(logt n)15Slide16

Efficient Approximation of Edit Distance

Overview of Upper Bound

Theorem 1:

Can distinguish

ed

(x,y) ≥ n/10 vs ed(x,y) ≤ n/(10A) for A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0).Proof structure:1. Characterize edit by “tree-distance” Txy Parameter b≥2 (degree) Txy

≈ ed(x,y) up to 6b*log n factor 2. Prune the tree to subsample x

x

1

x

2

x

n

b

sampled positions in

x

16Slide17

Efficient Approximation of Edit Distance

Step 1: Tree DistancePartition x into b blocks, recursively, for h=log

b

n

levels

x[1:n]

x[1:⅓n]

x[⅔n:n]

x[1]

x[2]

x[3]

x[⅓n:⅔n]

y[1:n]

y[u:u+⅓n]

x[

s:s

+

n]

T

i

(

s,u

) = tree-distance

between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the block-length at level i17Slide18

Efficient Approximation of Edit Distance

Tree Distance: Recursive DefinitionRecall Ti(s,u

)

= tree-distance between

x[s:s+ℓi] and y[u:u+ℓi]Base case: Th

(s,u)=Hamming(x[s],y[u])Output: Txy=T0(s=1,u=1)x[s:s+ℓ

i]y[u:u+ℓi]

r

0

x

y

18Slide19

Efficient Approximation of Edit Distance

Tree Approximates Edit DistanceLemma: Txy

≈ed

(

x,y) up to 6b*logbn factor.Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05]All had approximation recurrence of the typeA(n) = c*A(n/b) + b for c≥2Solves to A(n) ≥ 2

√log n factor for every choice of bOur characterization has no multiplicative loss (c=1): A(n) = A(n/b) + bAnalysis inspired by algorithms for smoothed instances [Andoni-K.’08]19Slide20

Efficient Approximation of Edit Distance

Step 2: Compute the Tree DistanceFor b=2, tree-distance gives O(log n) approximation!BUT know only how to compute T-distance in

Õ(n

2

) time Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and approximate T-distance within factor 1+ε

Pruning: subsample (log n)O(1) children out of each nodeWorks only when ed(x,y) ≥ (n) Generally, must subsample the tree non-uniformly, using the Precision Sampling Lemma

b

sampled positions in

x

20Slide21

Efficient Approximation of Edit Distance

Key tool: non-uniform samplingGoal:For unknown a

1

, a

2, …an[0,1]Estimate their sum, up to an additive constant errorUsing only “weak” estimates ã1, ã2, …ãn

Sum Estimator

Adversary0. fix distribution U

1. Fix

a1,a2,…an

(unknown)2. pick “precisions”

ui(our algorithm: ui~U[0,1] i.i.d.)

3. provide

ã1,ã2,…ãn

s.t. |ai-ãi|<1/ui4. report S̃=S̃(ã1,…,u1,…) with |S̃ – ∑ai ̃| < 1.

21Slide22

Efficient Approximation of Edit Distance

Precision SamplingGoal: estimate ∑ai

from

{ãi} s.t. |ai-ãi|<1/ui

.Precision Sampling Lemma: Can achieve WHPadditive error 1 and multiplicative error 1.5 with expected precision Eu_i~U[ui]=O(log n).Inspired by a technique from [IW’05] for streaming (Fk moments)In fact, PSL gives simple & improved algorithms for Fk moments, cascaded (mixed) norms,

ℓp-sampling problems [AKO’11]Also distant relative of Priority Sampling [DLT’07]22Slide23

Efficient Approximation of Edit Distance

Precision Sampling for Edit DistanceApply Precision Sampling to the tree from the characterization recursively at each nodeIf a node has very weak precision, can trim the entire sub-tree

23Slide24

Fast Approximation AlgorithmTheorem [Andoni-Onak-K.’10]

: Can approximate ed(x,y) within factor (log n)

O(1/

ε

) using nε queries to x and in time n1+ε (for any ε>0).Exponential improvement over previous factor 2Õ(√log n) [

Andoni-Onak’09] Asymmetric sampling approach, implemented faster by data structure tricksSampling is non-adaptive, independent of y Efficient Approximation of Edit Distance24Slide25

Smoothed InstancesSmooth Instance

(x,y) constructed by:Start with arbitrary x*,y*2{0,1}n and their optimal alignment

A*

Replace each position w/probability p by random bit, but respect A* Theorem [Andoni-K.’08]: Can approximate ed(x,y) within constant factor, in smoothed runtime that is (whp) near-linear n1+ε.Some extensions to sublinear timeTechniques: Match blocks of length L=O(1/

p¢log n) that have edit distance ·εL.A known heuristic technique (e.g. PatternHunter)To find block matches quickly, we use naive NNS algorithmBecause of smoothing, blocks are likely to be distinct (and even far), so modulo overlaps between blocks, we “effectively” have permutationsEfficient Approximation of Edit Distance

Open: Better time n¢polylog(n)? Approximation independent of p?25Slide26

Variants of Edit DistanceEdit

distance with block operationsAdmits O(log n¢log*n) approximation in near-linear time, via embedding into L

1

[Muthukrishnan-Sahnialp’00,Cormode-Muthukrishnan’02]

Open: Distortion lower bounds? Better approximation in polytime?Edit distance between trees (generalizes strings)Basic operations: insert/delete/relabel vertex Can be computed in O(n3) time [Demaine-Mozes-Rossman-Weimann’07] Open: Embedding?Edit distance with “rich” alphabet Can model shape matching [Klein-Tirthapura-Sharvit-Kimia’00]Challenge: Cost of basic operation varies with symbols

Efficient Approximation of Edit Distance26Slide27

ConclusionHaving multiple computational models is fruitful

New ideas, techniques, viewpoints, applications  can come full circleLower bounds —in certain models — highlight limitations of methodsExplore which instances are easy/hard“Asymmetric algorithms” can work well for symmetric problemsConnections to other fields (sampling, embeddings, communication complexity, Fourier analysis) and computational problems (NNS)

Had much

progress, but still many

gaps, and much more to go Efficient Approximation of Edit DistanceThank You! 27