Jianhua Feng Database Group Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search Search is Important Source httpwwwinternetlivestatscomgooglesearchstatistics ID: 633264
Download Presentation The PPT/PDF document "Dong Deng, Guoliang Li," is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dong Deng, Guoliang Li, Jianhua FengDatabase Group, Tsinghua UniversityPresent by Dong Deng
A Pivotal Prefix Based Filtering
Algorithm for String Similarity SearchSlide2
Search is ImportantSource: http://www.internetlivestats.com/google-search-statistics/Google Searches per YearSlide3
Speed MattersSource:Slide4
Data is Dirty
Typos
Typo in “title”
rela
x
ed
rela
t
ed
Argyri
o
s
Zymnis
Argyris
Zymnis
DBLP Complete SearchSlide5
Similarity SearchQuery
String DatasetAll the strings similar to the querySlide6
ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.
For example: ED(sigcom, sigmod) = 2
Edit Distance
sig
c
om
sigmo
m
sigmo
d
substitute c with m
s
ubstitute m with dSlide7
Problem DefinitionQuery string s = “yotubecom” and τ = 2
string dataset R
ed(s, r4
) <= 2
o
utput r
4
as a resultSlide8
ApplicationSpell CheckingCopy DetectionEntity LinkingBioinformatic ….Slide9
Challenge
Naïve MethodTime complexity:
for each query
Slide10
NoFilter-and-Verification Framework Dataset RThreshold τ
Query string sResults
Filter:
Signature(s)
∩
Signature(r) =
ϕ
?
Verify:
ED(
r,s
)
≤
τ
?
Yes
IndexSlide11
Preliminary: q-gramq-gram of the substring with length q yoou
uttb
beec
co
om
youtbecom
2-gramSlide12
dddPreliminary: q-gram
1 edit operation destroies at most q grams.
τ
edit operations destroy
at most
q
τ
grams.
if r and s have more than
q
τ
mismatch grams
, ED(r, s)>τ.
youtecom
yo
ouut
t
eecco
omSlide13
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pre(r)
q(
s
): The sorted
q
-gram set of string
s
Pre(
•
) is the prefix of q(
•
)
|Pre(
•
)|= q
τ
+1
Prefix Filter:
If pre(r)
∩
pre(s)
=
ϕ
,
ED(
r,s
)
>
τ
suffix(r)Slide14
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)g5
g6
g
11
g
12
g
13
g
1
g
2
g
7
g
8
g
9
g
10
g
12
g
3
g
4
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pre(r)
q(
s
): The sorted
q
-gram set of string
s
Pre(
•
) is the prefix of q(
•
)
|Pre(
•
)|= q
τ
+1
Prefix Filter:
If pre(r)
∩
pre(s)
=
ϕ
,
ED(
r,s
)
>
τ
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
suffix(r)Slide15
ddPreliminary: disjoint q-gramOne edit operation destroies at most 1
disjoint gram. τ edit operations destroy
at most τ
disjoint
grams.
if
r and s have more than
τ
mismatch
disjoint
grams
, ED(r, s)>
τ
yout
ecom
eyo
ut
omSlide16
q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pre(r)
Piv
(
•
) is the pivotal prefix of q(
•
)
|
Piv
(
•
)|=
τ
+1 and the q-grams in
Piv
(
•
) are disjoint
Piv
(r)
Piv
(s)
suffix(r)
If
piv
(s)
∩
pre(r)
=
ϕ
and
piv
(r
) ∩
pre(s
)
=
ϕ
,
ED(
r,s
)
>
τ
Slide17
q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)
g
8
g
10
g
5
g
6
g
9
g
11
g
13
g
1
g
3
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pivotal Prefix Filter:
If last(s)> last(r)
and
piv
(r
) ∩
pre(s
)
=
ϕ
,
ED(
r,s
)
>
τ
Pre(r)
Piv
(
•
) is the pivotal prefix of q(
•
)
|
Piv
(
•
)|=
τ
+1 and the q-grams in
Piv
(
•
) are disjoint
Piv
(r)
Piv
(s)
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
l
ast(r)
last(s)
suffix(r)Slide18
q(s): The sorted q-gram set of string sPivotal Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)
g6
g9
g
12
g
13
g
1
g
4
g
7
g
10
g
11
g
3
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pivotal Prefix Filter:
If last(r)> last(s)
and
piv
(s)
∩
pre(r)
=
ϕ
,
ED(
r,s
)
>
τ
Pre(r)
Piv
(
•
) is the pivotal prefix of q(
•
)
|
Piv
(
•
)|=
τ
+1 and the q-grams in
Piv
(
•
) are disjoint
Piv
(r)
Piv
(s)
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
l
ast(r)
last(s)
suffix(r)Slide19
Pivotal Prefix FilterIf last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r)
and piv(r) ∩ pre(s) = ϕ, ED(
r,s) > τ Existence: There must exist τ
+1
disjoint grams in the
prefix
The Pivotal Prefix is a subset of the Prefix
The pivotal prefix filter dominates the prefix filter
Signature size are O(
τ
) and O(q
τ
) respectivelySlide20
Related WorkMethod|Sig(r)||Sig(s)|Prefix FilterO(q
τ)O(qτ)Mismatch FilterO(qτ)
O(qτ)Qchunk
Filter
O(
τ
)
O(
l
)
Pivotal Prefix Filter
O(
τ
)
O(q
τ
)
Mismatch Filter [Xiao VLDB08] :
Shorten prefix length, but still O(q
τ
)
Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(
l)Adaptive Prefix[Wang SIGMOD12]Increase prefix length to reduce candidate numberOrthogonal and can be integrated into our methodFlamingo[Li ICDE08]
Based on count filter. Accelerating counting process.Orthogonal and can be integrated into our methodSlide21
Pivotal Search AlgorithmIndexingBuild inverted indexes for both the prefix and the pivotal prefix of the data stringsQueryingGenerate prefix and pivotal prefix for the query stringProbe the prefix index with the pivotal prefix of the queryProbe the pivotal prefix index with the prefix of the query
Verify the candidates and output resultsSlide22
Pivotal Prefix SelectionEvaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.
For query string:
For data string:Slide23
Optimal Pivotal Prefix SelectionDynamic Programming:
Select
m-1
optimal
pivotal q-grams
from the first
n-1
q-grams in prefix
Select as last pivotal q-gram
Object: Select
m=
τ
+1
optimal
pivotal q-grams
from
the first
n=q
τ
+1
grams in the prefixSlide24
Optimal Pivotal Prefix SelectionDynamic Programming:
Select
m-1
optimal pivotal q-grams
from the first
n-2
q-grams
Select as last pivotal q-gramSlide25
Optimal Pivotal Prefix SelectionDynamic Programming:
Select
m-1
optimal
pivotal q-grams
from the first
m-1
q-grams
Select as last pivotal q-gram
Recursive formula:Slide26
NoFilter-and-Verification Framework Dataset RThreshold τ
Query string sResults
Filter:
Signature(s)
∩
Signature(r) =
ϕ
?
Verify:
alignment filter?
If yes, ED(
r,s
)
≤
τ
?
Yes
Index
Complexity Improvement:
Improved from
to
Slide27
Alignment FilterIntuition of Alignment Filter: suppose in the best case we need err
i edit operations to transform to a substring of r, then
If
Slide28
Alignment Filter is the minimum edit distance between
and any substring of r.
Substring edit distance (sed)
Alignment filter
:
If
Slide29
Alignment FilterAccelerating Calculation: The computation complexity of sed(, r) is O(
). By position filter, can only align to a substring xi of r
where |xi|<
.
Thus if
, ED(
𝑟
,
𝑠
)
The complexity reduced to
Complexity Improvement
:
Improved from
to
Slide30
ExperimentsSettings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.Slide31
Evaluating Pivotal Prefix FilterAverage Search TimeMismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect
: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix SelectionSlide32
Evaluating Pivotal Prefix FilterCandidate NumberMismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect
: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix SelectionSlide33
Evaluating Alignment FilterAverage Search TimeNoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterSlide34
Evaluating Alignment FilterCandidate NumberNoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of resultsSlide35
Comparison with State-of-the-artsPivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]Slide36
ScalabilitySlide37
ConclusionPivotal prefix filterPivotal search algorithmOptimal pivotal prefix selectionAlignment filterSlide38
Thank youQ & AProject hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.htmlSlide39
OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide40
OutlineMotivation and Problem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide41
OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide42
OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide43
OutlineProblem DefinitionPivotal Prefix FilterThe Similarity Search AlgorithmAlignment FilterExperimentConclusionSlide44
ComplexitySpace Complexity: Time Complexity: Slide45
Pivotal Prefix SelectionEvaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.
For query string:
For data string:
Existence of Pivotal Prefix:
There must exist at least
τ
+1 disjoint q-grams in the
prefix pre(r) for any string rSlide46
ComplexitySpace Complexity: Prefix Inverted Index Size: Pivotal Prefix Inverted Index Size:
Query Time Complexity:Preprocess Query s:
Probing Inverted Indexes:
where
is the average length of probed prefix inverted
lists
Verification
Complexity
:
where
c
is the number of candidates and l is average string length
Slide47
ComplexitySpace Complexity: Prefix Inverted Index Size: Pivotal Prefix Inverted Index Size:
Query Time Complexity:Preprocess Query s:
Probing Inverted Indexes:
where
is the average length of probed prefix inverted
lists
Verification
Complexity
:
where
c
is the number of candidates and l is average string length
Slide48
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf Pre(s)g5
g6
g
9
g
10
g
11
g
1
g
2
g
7
g
8
g
11
g
12
g
13
g
3
g
4
q(
r
)
: The
sorted
q
-gram
set of
string
r
Pre(r)
q(
s
): The sorted
q
-gram set of string
s
Pre(
•
) is the prefix of q(
•
)
|Pre(
•
)|= q
τ
+1
Prefix Filter:
If pre(r)
∩
pre(s)
=
ϕ
,
ED(
r,s
)
>
τ
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10
>g
10Slide49
Alignment Filternon-consecutive errors:youtubecomyoytupecxmq=3, the 3 non-consecutive errors destroy 8 q-grams
youtubecomyoutzpxcomq=3, the 3 consecutive errors only destroy 5 q-grams
consecutive errors:Slide50
IndexingFix a global gram order We use gram frequency ascending order
Global gram order
im
my
te
bu
un
nt
uc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tubeec1
111111
111
12233
333
334Slide51
IndexingBuild inverted indexes for prefix and pivotal prefixGlobal gram order
immytebu
unntuc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tu
be
ec
11
1111
1111
1223
333333
4
Piv
(
r
i
)Slide52
IndexingBuild inverted indexes for prefix and pivotal prefix
Pivotal Prefix Index
Prefix Index
Piv
(
r
i
)Slide53
QueryingGenerate prefix and pivotal prefix for the query stringGlobal gram order
immytebu
unntuc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tu
be
ec
11
1111
1111
1223
333333
4Slide54
QueryingProbe the prefix index with the pivotal prefix of the queryProbe the pivotal prefix index with the prefix of the querySlide55
QueryingVerify the candidates and output resultsSlide56
Related WorkEDJoin [Xiao VLDB08]Shorten prefix length, but still O(qτ)Qchunk[Qin SIGMOD11]Shorten one to O(τ) but increased the other one to O(l)Adaptive Prefix[Wang SIGMOD12]Increase prefix length to reduce candidate number
Orthogonal and can be integrated into our methodFlamingo[Li ICDE08]Based on count filter. Accelerating counting process.Orthogonal and can be integrated into our methodSlide57
Optimal Pivotal Prefix Selection
Recursive formula:
Dynamic Programming:
1. First sort all the q-grams in prefix by their start positions and denote the
k
-
th
q-gram as
g
k
2. Let
f(
m,n
)
denote the optimal sum inverted list lengths to select
n
disjoint grams from the first
m
grams in the prefix.