Christopher Ré Nilesh Dalvi and Dan Suciu University of Washington Evaluating Complex SQL on PDBs 2 1282006 High Level Overview DBMS Precise answers over clean data Data are often imprecise ID: 484546
Download Presentation The PPT/PDF document "Top-K Query Evaluation on Probabilistic ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Top-K Query Evaluation on Probabilistic Data
Christopher
Ré
,
Nilesh
Dalvi
and Dan
Suciu
University of WashingtonSlide2
Evaluating Complex SQL on PDBs
2
12/8/2006
High Level Overview
DBMS: Precise answers over
clean
dataData are often impreciseInformation IntegrationInformation ExtractionProbabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability
This talk
: Compute Top-K
EfficientlySlide3
Evaluating Complex SQL on PDBs
3
12/8/2006
Overview
Motivating Example
Query Processing Background
MultisimulationExperimental ResultsSlide4
Evaluating Complex SQL on PDBs
4
12/8/2006
Overview
Motivating Example
Query Processing Background
MultisimulationExperimental ResultsSlide5
Evaluating Complex SQL on PDBs
5
12/8/2006
Example Application
IMDB
Lots of interesting data above movies (e.g. actors, directors)
Well
maintained
and
clean
But no reviews!
On the web there are lots of reviews
How will I know which movie they are about?
Alice needs to do
information extraction
and
object reconcillation
.
Is a movie good or bad?
Alice wants to do
sentiment analysis.
A
probabilistic database
can help Alice store and query her uncertain data.
Find all years where ‘
Anthony Hopkins
’ starred in a good movieSlide6
Evaluating Complex SQL on PDBs
6
12/8/2006
Imprecision is out there…
Object Reconciliation
RID
Title
r124
12 Monkeys
r155
Twelve Monkeys
r175
2 Monkey
r194
Monk
MID
Title
m232
12 Monkeys
m143
Monkey Love
Our Approach
: Convert scores to probabilities
Data extracted from Reviews
Clean IMDB Data
Output
: (RID,MID) pairs
12/8/2006
Match
No Match
t’
t
Felligi-Sunter
Approach
: Score (
s
) each (RID,MID) Slide7
Evaluating Complex SQL on PDBs
7
12/8/2006
Imprecision is out there…
Object Reconciliation
RID
Title
r124
12 Monkeys
r155
Twelve Monkeys
r175
2 Monkey
r194
Monk
MID
Title
m232
12 Monkeys
m143
Monkey Love
RID
MID
Prob
r175
m232
0.8
r175
m143
0.2
Felligi-Sunter
Approach
: Score (
s
) each (RID,MID)
Match
No Match
t’
tSlide8
Evaluating Complex SQL on PDBs
8
12/8/2006
Overview
Motivating Example
Query Processing Background
MultisimulationExperimental ResultsSlide9
Evaluating Complex SQL on PDBs
9
12/8/2006
Query Processing Background
RID
MID
Prob
r175
m232
0.8
r175
m143
0.2
Query Processing builds
event expression
Intensional Query Processing [FR97]
Associate to each tuple an
event
Probability event is satisfied = query value
Technical Point
: Projection as last operator implies result is a
DNFSlide10
Evaluating Complex SQL on PDBs
10
12/8/2006
DNF Sampling at a High Level
Estimate
p(t),
probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04]
Randomized Approximation [LK84]
Simulation
reduces uncertainty
0.0
1.0
Uncertain about
p(t)Slide11
Evaluating Complex SQL on PDBs
11
12/8/2006
Naïve Query Processing
Naïve
algorithm (PTIME): Simulate until all
small“Epsilon”-small
0.0
1.0
Christopher
Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
Can we do better?Slide12
Evaluating Complex SQL on PDBs
12
12/8/2006
Overview
Motivating Example
Query Processing Background
MultisimulationExperimental ResultsSlide13
Evaluating Complex SQL on PDBs
13
12/8/2006
A Better Method: Multisimulation
Separate Top-K with few simulations
Concentrate on intervals in
Top-KAsymptotically, confidence intervals are nestedCompare against OPT“knows” which intervals to simulate
Evaluating Complex SQL on PDBs
13
12/8/2006
0.0
1.0
Christopher
Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2Slide14
Evaluating Complex SQL on PDBs
14
12/8/2006
The Critical Region
The
critical region
is the interval (kth-highest min, k+1st higest max)For k = 2
0.0
1.0Slide15
Evaluating Complex SQL on PDBs
15
12/8/2006
Three Simple Rules:
Rule 1
0.0
1.0
Pick a “Double Crosser”
OPT
must pick this tooSlide16
Evaluating Complex SQL on PDBs
16
12/8/2006
Three Simple Rules:
Rule 2
All
lower/upper
crossers
then maximal
OPT
must pick this too
0.0
1.0Slide17
Evaluating Complex SQL on PDBs
17
12/8/2006
Three Simple Rules:
Rule 3
Pick an upper and a
lower crosserOPT may only pick 1 of these two
0.0
1.0Slide18
Evaluating Complex SQL on PDBs
18
12/8/2006
Multisimulation
is a
2-ApproxThm: Multisimulation performs at most twice as many simulations as OPTAnd, no deterministic algorithm can do better on every instance.
Extensions
Top-K Set (shown)
Anytime (produce from 1 to k)
Rank (produce top k
ranked)
All ( rank all intervals )Slide19
Evaluating Complex SQL on PDBs
19
12/8/2006
Overview
Motivating Example
Query Processing Background
MultisimulationExperimental ResultsSlide20
Evaluating Complex SQL on PDBs
20
12/8/2006
Experiment Details:
Uncertain tuples
Table
# Tuples
StringMatch
339k
ActorMatch
6,758k
DirectorMatch
18k
Table
# Tuples
Reviews
292kSlide21
Evaluating Complex SQL on PDBs
21
12/8/2006
Running TimeSlide22
Evaluating Complex SQL on PDBs
22
12/8/2006
Running Time
“Find all years in which Anthony Hopkins was in a highly rated movie”
(SS)
S
mall Number of Tuples Output (33)
S
mall DNFs per Output
(Avg. 20.4, Max 63)Slide23
Evaluating Complex SQL on PDBs
23
12/8/2006
Running Time
“Find all directors who have a highly rated drama but low rated comedy”
(LL)
Large #Tuples
Output (1415)
L
arge
DNFs per Output
(Avg. 234.8, Max. 9088)Slide24
Evaluating Complex SQL on PDBs
24
12/8/2006
Conclusions
Mystiq
is a general purpose probabilistic database
Multisimulation and Logical Optimization key to performance on large data sets Advert: Demo on my laptopSlide25
Evaluating Complex SQL on PDBs
25
12/8/2006
Running Time
“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction”
(SL)
S
mall Number of Tuples Output (33)
L
arge DNFs per Output
(Avg. 117.7,Max 685)Slide26
Evaluating Complex SQL on PDBs
26
12/8/2006
Running Time
“Find all directors in the 80s who had a highly rated movie”
(LS)
L
arge #Tuples Output (3259)
S
mall DNFs per Output
(Avg 3.03, Max 30)Slide27
Evaluating Complex SQL on PDBs
27
12/8/2006
0.0
1.0
Christopher
Walken
Harvey Keitel
Samuel L. Jackson
Bruce WillisSlide28
Evaluating Complex SQL on PDBs
28
12/8/2006
0.0
1.0
Christopher
Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2Slide29
Evaluating Complex SQL on PDBs
29
12/8/2006
0.0
1.0Slide30
Evaluating Complex SQL on PDBs
30
12/8/2006
0.0
1.0Slide31
Evaluating Complex SQL on PDBs
31
12/8/2006
0.0
1.0Slide32
Evaluating Complex SQL on PDBs
32
12/8/2006
0.0
1.0Slide33
Evaluating Complex SQL on PDBs
33
12/8/2006
0.0
1.0