/
Dong Deng,  Guoliang  Li, Dong Deng,  Guoliang  Li,

Dong Deng, Guoliang Li, - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
395 views
Uploaded On 2018-02-19

Dong Deng, Guoliang Li, - PPT Presentation

Jianhua Feng Database Group Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search Search is Important Source httpwwwinternetlivestatscomgooglesearchstatistics ID: 633264

pivotal prefix grams pre prefix pivotal pre grams string gram piv filter inverted query complexity search index sorted set alignment disjoint edit

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dong Deng, Guoliang Li," is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dong Deng, Guoliang Li, Jianhua FengDatabase Group, Tsinghua UniversityPresent by Dong Deng

A Pivotal Prefix Based Filtering

Algorithm for String Similarity SearchSlide2

Search is ImportantSource: http://www.internetlivestats.com/google-search-statistics/Google Searches per YearSlide3

Speed MattersSource:Slide4

Data is Dirty

Typos

Typo in “title”

rela

x

ed

rela

t

ed

Argyri

o

s

Zymnis

Argyris

Zymnis

DBLP Complete SearchSlide5

Similarity SearchQuery

String DatasetAll the strings similar to the querySlide6

ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.

For example: ED(sigcom, sigmod) = 2

Edit Distance

sig

c

om

sigmo

m

sigmo

d

substitute c with m

s

ubstitute m with dSlide7

Problem DefinitionQuery string s = “yotubecom” and τ = 2

string dataset R

ed(s, r4

) <= 2

o

utput r

4

as a resultSlide8

ApplicationSpell CheckingCopy DetectionEntity LinkingBioinformatic ….Slide9

Challenge

Naïve MethodTime complexity:

for each query

 Slide10

NoFilter-and-Verification Framework Dataset RThreshold τ

Query string sResults

Filter:

Signature(s)

Signature(r) =

ϕ

?

Verify:

ED(

r,s

)

τ

?

Yes

IndexSlide11

Preliminary: q-gramq-gram of the substring with length q yoou

uttb

beec

co

om

youtbecom

2-gramSlide12

dddPreliminary: q-gram

1 edit operation destroies at most q grams.

τ

edit operations destroy

at most

q

τ

grams.

if r and s have more than

q

τ

mismatch grams

, ED(r, s)>τ.

youtecom

yo

ouut

t

eecco

omSlide13

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pre(r)

q(

s

): The sorted

q

-gram set of string

s

Pre(

) is the prefix of q(

)

|Pre(

)|= q

τ

+1

Prefix Filter:

If pre(r)

pre(s)

=

ϕ

,

ED(

r,s

)

>

τ

suffix(r)Slide14

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)g5

g6

g

11

g

12

g

13

g

1

g

2

g

7

g

8

g

9

g

10

g

12

g

3

g

4

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pre(r)

q(

s

): The sorted

q

-gram set of string

s

Pre(

) is the prefix of q(

)

|Pre(

)|= q

τ

+1

Prefix Filter:

If pre(r)

pre(s)

=

ϕ

,

ED(

r,s

)

>

τ

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

suffix(r)Slide15

ddPreliminary: disjoint q-gramOne edit operation destroies at most 1

disjoint gram. τ edit operations destroy

at most τ

disjoint

grams.

if

r and s have more than

τ

mismatch

disjoint

grams

, ED(r, s)>

τ

yout

ecom

eyo

ut

omSlide16

q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pre(r)

Piv

(

) is the pivotal prefix of q(

)

|

Piv

(

)|=

τ

+1 and the q-grams in

Piv

(

) are disjoint

Piv

(r)

Piv

(s)

suffix(r)

If

piv

(s)

pre(r)

=

ϕ

and

piv

(r

) ∩

pre(s

)

=

ϕ

,

ED(

r,s

)

>

τ

Slide17

q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)

g

8

g

10

g

5

g

6

g

9

g

11

g

13

g

1

g

3

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pivotal Prefix Filter:

If last(s)> last(r)

and

piv

(r

) ∩

pre(s

)

=

ϕ

,

ED(

r,s

)

>

τ

Pre(r)

Piv

(

) is the pivotal prefix of q(

)

|

Piv

(

)|=

τ

+1 and the q-grams in

Piv

(

) are disjoint

Piv

(r)

Piv

(s)

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

l

ast(r)

last(s)

suffix(r)Slide18

q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)

g6

g9

g

12

g

13

g

1

g

4

g

7

g

10

g

11

g

3

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pivotal Prefix Filter:

If last(r)> last(s)

and

piv

(s)

pre(r)

=

ϕ

,

ED(

r,s

)

>

τ

Pre(r)

Piv

(

) is the pivotal prefix of q(

)

|

Piv

(

)|=

τ

+1 and the q-grams in

Piv

(

) are disjoint

Piv

(r)

Piv

(s)

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

l

ast(r)

last(s)

suffix(r)Slide19

Pivotal Prefix FilterIf last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r)

and piv(r) ∩ pre(s) = ϕ, ED(

r,s) > τ Existence: There must exist τ

+1

disjoint grams in the

prefix

The Pivotal Prefix is a subset of the Prefix

The pivotal prefix filter dominates the prefix filter

Signature size are O(

τ

) and O(q

τ

) respectivelySlide20

Related WorkMethod|Sig(r)||Sig(s)|Prefix FilterO(q

τ)O(qτ)Mismatch FilterO(qτ)

O(qτ)Qchunk

Filter

O(

τ

)

O(

l

)

Pivotal Prefix Filter

O(

τ

)

O(q

τ

)

Mismatch Filter [Xiao VLDB08] :

Shorten prefix length, but still O(q

τ

)

Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(

l)Adaptive Prefix[Wang SIGMOD12]Increase prefix length to reduce candidate numberOrthogonal and can be integrated into our methodFlamingo[Li ICDE08]

Based on count filter. Accelerating counting process.Orthogonal and can be integrated into our methodSlide21

Pivotal Search AlgorithmIndexingBuild inverted indexes for both the prefix and the pivotal prefix of the data stringsQueryingGenerate prefix and pivotal prefix for the query stringProbe the prefix index with the pivotal prefix of the queryProbe the pivotal prefix index with the prefix of the query

Verify the candidates and output resultsSlide22

Pivotal Prefix SelectionEvaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.

 

 

For query string:

For data string:Slide23

Optimal Pivotal Prefix SelectionDynamic Programming:

Select

m-1

optimal

pivotal q-grams

from the first

n-1

q-grams in prefix

Select as last pivotal q-gram

Object: Select

m=

τ

+1

optimal

pivotal q-grams

from

the first

n=q

τ

+1

grams in the prefixSlide24

Optimal Pivotal Prefix SelectionDynamic Programming:

Select

m-1

optimal pivotal q-grams

from the first

n-2

q-grams

Select as last pivotal q-gramSlide25

Optimal Pivotal Prefix SelectionDynamic Programming:

Select

m-1

optimal

pivotal q-grams

from the first

m-1

q-grams

Select as last pivotal q-gram

 

 

Recursive formula:Slide26

NoFilter-and-Verification Framework Dataset RThreshold τ

Query string sResults

Filter:

Signature(s)

Signature(r) =

ϕ

?

Verify:

alignment filter?

If yes, ED(

r,s

)

τ

?

Yes

Index

Complexity Improvement:

Improved from

to

 Slide27

Alignment FilterIntuition of Alignment Filter: suppose in the best case we need err

i edit operations to transform to a substring of r, then

If

 Slide28

Alignment Filter is the minimum edit distance between

and any substring of r.  

Substring edit distance (sed)

Alignment filter

:

If

 Slide29

Alignment FilterAccelerating Calculation: The computation complexity of sed(, r) is O(

). By position filter, can only align to a substring xi of r

where |xi|<

.

Thus if

, ED(

𝑟

,

𝑠

)

The complexity reduced to

 

Complexity Improvement

:

Improved from

to

 Slide30

ExperimentsSettings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.Slide31

Evaluating Pivotal Prefix FilterAverage Search TimeMismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect

: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix SelectionSlide32

Evaluating Pivotal Prefix FilterCandidate NumberMismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect

: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix SelectionSlide33

Evaluating Alignment FilterAverage Search TimeNoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterSlide34

Evaluating Alignment FilterCandidate NumberNoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of resultsSlide35

Comparison with State-of-the-artsPivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]Slide36

ScalabilitySlide37

ConclusionPivotal prefix filterPivotal search algorithmOptimal pivotal prefix selectionAlignment filterSlide38

Thank youQ & AProject hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.htmlSlide39

OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide40

OutlineMotivation and Problem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide41

OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide42

OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide43

OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide44

ComplexitySpace Complexity: Time Complexity: Slide45

Pivotal Prefix SelectionEvaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.

 

 

For query string:

For data string:

Existence of Pivotal Prefix:

There must exist at least

τ

+1 disjoint q-grams in the

prefix pre(r) for any string rSlide46

ComplexitySpace Complexity: Prefix Inverted Index Size: Pivotal Prefix Inverted Index Size:

Query Time Complexity:Preprocess Query s:

Probing Inverted Indexes:

where

is the average length of probed prefix inverted

lists

Verification

Complexity

:

where

c

is the number of candidates and l is average string length

 Slide47

ComplexitySpace Complexity: Prefix Inverted Index Size: Pivotal Prefix Inverted Index Size:

Query Time Complexity:Preprocess Query s:

Probing Inverted Indexes:

where

is the average length of probed prefix inverted

lists

Verification

Complexity

:

where

c

is the number of candidates and l is average string length

 Slide48

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)g5

g6

g

9

g

10

g

11

g

1

g

2

g

7

g

8

g

11

g

12

g

13

g

3

g

4

q(

r

)

: The

sorted

q

-gram

set of

string

r

Pre(r)

q(

s

): The sorted

q

-gram set of string

s

Pre(

) is the prefix of q(

)

|Pre(

)|= q

τ

+1

Prefix Filter:

If pre(r)

pre(s)

=

ϕ

,

ED(

r,s

)

>

τ

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10

>g

10Slide49

Alignment Filternon-consecutive errors:youtubecomyoytupecxmq=3, the 3 non-consecutive errors destroy 8 q-grams

youtubecomyoutzpxcomq=3, the 3 consecutive errors only destroy 5 q-grams

consecutive errors:Slide50

IndexingFix a global gram order We use gram frequency ascending order

Global gram order

im

my

te

bu

un

nt

uc

bb

tb

oy

yt

ca

om

yo

ou

ut

ub

co

tubeec1

111111

111

12233

333

334Slide51

IndexingBuild inverted indexes for prefix and pivotal prefixGlobal gram order

immytebu

unntuc

bb

tb

oy

yt

ca

om

yo

ou

ut

ub

co

tu

be

ec

11

1111

1111

1223

333333

4

Piv

(

r

i

)Slide52

IndexingBuild inverted indexes for prefix and pivotal prefix

Pivotal Prefix Index

Prefix Index

Piv

(

r

i

)Slide53

QueryingGenerate prefix and pivotal prefix for the query stringGlobal gram order

immytebu

unntuc

bb

tb

oy

yt

ca

om

yo

ou

ut

ub

co

tu

be

ec

11

1111

1111

1223

333333

4Slide54

QueryingProbe the prefix index with the pivotal prefix of the queryProbe the pivotal prefix index with the prefix of the querySlide55

QueryingVerify the candidates and output resultsSlide56

Related WorkEDJoin [Xiao VLDB08]Shorten prefix length, but still O(qτ)Qchunk[Qin SIGMOD11]Shorten one to O(τ) but increased the other one to O(l)Adaptive Prefix[Wang SIGMOD12]Increase prefix length to reduce candidate number

Orthogonal and can be integrated into our methodFlamingo[Li ICDE08]Based on count filter. Accelerating counting process.Orthogonal and can be integrated into our methodSlide57

Optimal Pivotal Prefix Selection

 

Recursive formula:

Dynamic Programming:

1. First sort all the q-grams in prefix by their start positions and denote the

k

-

th

q-gram as

g

k

 

 

2. Let

f(

m,n

)

denote the optimal sum inverted list lengths to select

n

disjoint grams from the first

m

grams in the prefix.