Some techniques current being licensed to Bimaple Chen Li UC Irvine Overview of my UC Irvine Research Text research main focus of this talk Dataintensive computing ASTERIX project Vertical Search ID: 202072
Download Presentation The PPT/PDF document "Improving Search for Emerging Applicatio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Improving Search for Emerging Applications
* Some techniques current being licensed to Bimaple
Chen Li UC IrvineSlide2
Overview of my UC Irvine ResearchText research (main focus of this talk)Data-intensive computing: ASTERIX projectSlide3
Vertical SearchSearch on a specific segment of online contentDifferent from general Web search engineSlide4
Approach 1: Database built-in full-text searchExample (Oracle)
SELECT SCORE(1) score, id, name FROM Products WHERE CONTAINS(doc, ‘
iphone', 1) > 0 ORDER BY SCORE(1) DESC;LimitationsSpeedRankingSlide5
Approach 2: Open-source packagesSlide6
Approach 3: Home-grown search enginesSlide7
Recent Trend1: More Mobile ApplicationsFat fingers …Slide8
Recent Trend 2: More Location-Based ServicesSlide9
So: New requirements for Vertical Search
Find results faster
Instant search
Deal with errors
Fuzzy search
Be aware of the location Location-based SearchSlide10
Demos
http://psearch.ics.uci.edu
: Search on UCI directory;
http://ipubmed.ics.uci.edu
: Search on more than 21 million MEDLINE publications
http://www.omniplaces.com/
: Location-based search on 17 million geospatial objects.Slide11
11Search on People Directories
psearch.ics.uci.eduSlide12
12Search on Publications
ipubmed.ics.uci.eduSlide13
13Search on Business Listings
www.omniplaces.comSlide14
Our Focus: Instant Search in Vertical DomainsServer applications (enterprises)
E.g., e-commerce systemsPowerful featuresEfficientFull textFuzzy searchLocation-based search…Slide15
“Instant Search”Search as you typeType-ahead searchAutocompletion…
Benefits:Save user timeSuggestionsSave 2-3 seconds (Google Instant)Mobile devicesSlide16
Google InstantSlide17
Instant Search ClassificationQuery PredictionExample: Google InstantRely on query logs and user profiles“Fire” the most likely prediction
Searching directly on the dataExample: PSearch@UCINot relying on query logsSlide18
ChallengesPerformance< 100 msserver processing, network, javascript
, etcRequirement for high query throughput20 queries per second (QPS) 50ms/query (at most)100 QPS
10ms/queryOther challenges: RankingSpace requirements…Slide19
Next: two featuresFuzzy Search: finding results with approximate keywords
Full-text: find results with query keywords (not necessarily adjacently)Slide20
Edit Distance
Ed(s1, s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2
s1: v
e n k a t s u b r a m a n i a n
s2:
w
e n k a t s u b r a m a n i a n
ed(s1, s2) = 1Slide21
Problem SettingDataR: a set of recordsW: a set of distinct wordsQuery
Q = {p1, p2, …, pl}: a set of prefixesδ: Edit-distance threshold
Query resultRQ: a set of records such that each record has all query prefixes or their similar formsSlide22
Feature 1: Fuzzy SearchSlide23
FormulationRecord Strings
wenkatsubra
Find strings with a prefix similar to a query keywordDo it incrementally!venkatasubramanian
carey
jain
nicolau
smith
Query:Slide24
u n i v e r s a l
2-grams
Fuzzy search using grams
Find elements whose occurrences ≥ T
Ascending
order
MergeSlide25
The Flamingo Packagehttp://flamingo.ics.uci.edu/Slide26
ObservationStrings = {exam, example, exemplar, exempt, sample}
Edit-distance threshold δ = 2
PrefixDistanceexam2examp1exampl0example
1exemp2
exempt
2
exempl
1
exempla
2
sampl
2
Prefix
Distance
examp
2
exampl
1
example
0
exempl
2
exempl
a
2
sample
2
delete
e
delete
e
match
e
delete
e
replace
e
with
a
match
e
Q’ =
exampl
Q =
exampl
eSlide27
Trie Indexing Computing set of active nodes
ΦQInitializationIncremental step
ex
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix
Distance
examp
2
exampl
1
example
0
exempl
2
exempla
2
sample
2
Active nodes for Q =
example
e
2
1
0
2
2
2Slide28
InitializationQ = ε
e
xa
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix
Distance
0
1
1
2
2
Prefix
Distance
0
e
1
ex
2
s
1
sa
2
Prefix
Distance
ε
0
Initializing
Φ
ε
with all nodes within a depth of
δ
eSlide29
Incremental Algorithm: Overview
Access their leaf nodes as answers.Slide30
Feature 2: Full-text searchFind answers with query keywordsNot necessarily adjacentlySlide31
Multi-Prefix IntersectionQ = vldb li
IDRecord1Li data…
2data…3data Lin…4Lu Lin Luis…5Liu…6VLDB Lin data…7
VLDB…8
Li VLDB…
6
VLDB
Li
n data…
8
Li
VLDB
…
d
a
t
a
$
l
i
n
u
$
u
$
v
l
d
b
$
1
2
3
6
5
4
6
7
8
$
3
4
6
i
s
$
1
8
$
4Slide32
Multi-Prefix Intersection: Method 1
IDRecord1Li data…2data…
3data Lin…4Lu Lin Luis…5Liu…6VLDB Lin data…7VLDB…8
Li VLDB…
d
a
t
a
$
l
i
n
u
$
u
$
v
l
d
b
$
1
2
3
6
5
4
6
7
8
$
3
4
6
i
s
$
1
8
$
4
1 3 4 5 6 8
6 7 8
li
vldb
6 8
Q =
vldb li
Space cost
Inverted
index
Time cost
Union + intersection
More efficient intersection approaches…Slide33
Multi-Prefix Intersection: Method 2
Forward List 1 2 1 1 3
3 5 6 4 1 3 7 7 2 7
d
a
t
a
$
l
i
n
u
$
u
$
v
l
d
b
$
1
2
3
6
5
4
6
7
8
$
3
4
6
i
s
$
1
8
$
4
ID
Record
1
Li data…
2
data…
3
data Lin…
4
Lu Lin Luis…
5
Liu…
6
VLDB Lin data…
7
VLDB…
8
Li VLDB…
[1, 7]
[1, 1]
[1, 1]
[1, 1]
[1, 1]
[2, 6]
[2, 4]
1
2
3
4
5
6
7
[3, 3]
[4, 4]
[5, 6]
[6, 6]
[6, 6]
[7, 7]
[7, 7]
[7, 7]
[7, 7]
Q =
vldb li
6
7
8
[2, 4]
Read each
Verify/Probe
6
VLDB
Li
n data…
1
3
7
8
Li
VLDB
…
2
7
Space cost
Inverted
+ forward index
Time cost
Probing forward listsSlide34
Experimental ResultsComputing similar prefixesSlide35
Multi-prefix intersectionSlide36
Time ScalabilitySlide37
Index scalabilitySlide38
Research on data-intensive computinghttp://asterix.ics.uci.eduhttp://cherry.ics.uci.edu/Slide39
39Thank you!