Magdalena Balazinska Christopher Ré and Dan Suciu University of Washington One slide overview of motivation Data are uncertain in many applications Business Dedup Info Extraction ID: 736966
Download Presentation The PPT/PDF document "System Aspects of Probabilistic Data Man..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
System Aspects of Probabilistic Data Management
Magdalena Balazinska, Christopher Ré and Dan SuciuUniversity of WashingtonSlide2
One slide overview of motivation
Data are uncertain in many applicationsBusiness: Dedup, Info. ExtractionData from physical-world: RFID
2
Probabilistic DBs (
pDBs
) manage uncertainty
Integrate, Query, and Build Applications on uncertain data
Value
: Higher recall, without loss of precision
DB Niche: Community that
knows
scaleSlide3
Overview of tutorial
Part I: Basic Query Processing (Today)Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques Highlights:The intuition behind and how to compile
safe plansProcess any SELECT-FROM-WHERE (SFW) query
Process top-k queries
Aggregation: Top-k + measures, OLAP,
HAVING
3Slide4
Overview of tutorial
Part II: Advanced Techniques (Tomorrow)Correlations Advanced Representation & QPDiscussion and Open Problems
Highlights:
Lineage and View Processing (GBs of data)
Events on Correlated Streams (GBs of Streams)
Sophisticated Factor Evaluation (Highly Correlated)
Continuous DBs
4Slide5
Hasn’t this been solved?
(an analogy to keep in mind)5
AI
Databases
Deterministic
Theorem prover
Query processing
Probabilistic
Probabilistic inference
[this talk]
Impact
: Fortune 500 companies
rely
on DBs, but how many have theorem
provers
?
SCALESlide6
Ancillary Material
pDBs have a long history Cavallo&Pitarelli ’87ProbView [Lakshmanan et al’97]Many active projects today: Mystiq, Lahar, Trio,
MayBMS, Maryland, Orion, MCDB, Wisconsin, IBM, BayesStore, UMass, Waterloo, SFU and moreMany important topics omitted
Query languages
XML
6Slide7
Overview of tutorial
Part I: Basic Query Processing (Today)Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques Highlights:The intuition behind and how to compile
safe plansProcess any SELECT-FROM-WHERE (SFW) query
Process top-k queries
Aggregation: Top-k + measures, OLAP,
HAVING
7Slide8
Example 1: Querying RFID
8
C
B
A
D
E
Apps:
UbiComp
, Diary, Social Applications,..
In general, Event queries [Cayuga,
Sase
]
Joe entered office 422 at t=8
Query
: “Alert when Joe enters 422”
i.e. Joe outside 422, inside 422
[R,Letchner,B&S’07] [http://rfid.cs.washington.edu]Slide9
Challenges: Tracking Joe’s Location
9
6
th
Floor in CS building
Blue ring is Joe’s Location
Antennas
[RFID Ecosystem @ UW]Slide10
6
th Floor in CS buildingChallenges: Tracking Joe’s Location
10
Blue ring is Joe’s Location
Antennas
Two Problems:
Missed Readings
Granularity Mismatch
Model Based View (Probabilistic)
[
Deshpande
et al 04,
Kanagal
& Deshpande’08]
[Re et al ‘08,
Kanagal
& Deshpande’08]Slide11
Probabilities via particle filter
11
Each orange particle is a guess of Joe’s location
Blue ring is
ground truth
Antennas
Particles guess
many locations
per timestep, so data are uncertain
6
th
Floor in CS building
[
Doucet
et
al’01]Slide12
Probabilities via particle filter
12
6th Floor in CS building
[
R
et
al
’08] [
Kanagal
& Deshpande’08]
Tag
t
Loc
P
Joe
7
422
0.4
Hall3
0.4
Hall4
0.2
Joe
8
422
0.6
Hall3
0.2
Hall4
0.2
Sue
7
…
…
“Joe entered 422 at t=8 with probability 0.36”
Shameless Ad
: Markov Correlations on Day 2
Query Particle Filter output
via At, a model based view
At
(
tag
,loc
)Slide13
13
IMDB
IMDB:
Lots of data !
Well maintained and clean
But no reviews!
Example 2: Alice Looks for Movies
I’d like to know which
movies are really good…
[R,Dalvi&S’07]Slide14
14
IMDB
On the web there
are lots of reviews…
Which movie is the review about?
…is the review
positive or negative ?
…should I trust
the reviewer ?
Alice needs:
Information Extraction
Fuzzy
joins
Sentiment analysis
Social networks
Forced to deal with uncertaintySlide15
15
Find actors in Pulp Fiction who
appeared in two bad movies
five years earlier
Find years when
‘Anthony Hopkins’
starred in a good
movie
IMDB
A
probabilistic
database
can
help Alice store
and query her
uncertain data
Alice’s workflow:
Download reviews
Information Extraction
Fuzzy Joins
Query
pDB
IE
FJ pDB
Slide16
16
Alice needs Information Extraction
ID
House-No
Street
City
P
1
52
Goregaon West
Mumbai
0.1
1
52-A
Goregaon West
Mumbai
0.4
1
52
Goregaon
West Mumbai
0.2
1
52-A
Goregaon
West Mumbai
0.2
2
. . . .
. . . .
. . . .
. . . .
2
. . . .
...52 A Goregaon West Mumbai ...
Here probabilities are meaningful
Address
p
[Gupta&Sarawagi’2006]
IE
FJ
pDB
Slide17
Queries on IE
SELECT DISTINCT x.nameFROM Person x, Address
p yWHERE x.ID = y.ID and
y.city
= ‘West Mumbai’
Find people living in ‘West Mumbai’
IE
FJ
pDB
ID
House-No
Street
City
P
1
52
Goregaon West
Mumbai
0.1
1
52-A
Goregaon West
Mumbai
0.4
1
52
Goregaon
West Mumbai
0.2
1
52-A
Goregaon
West Mumbai
0.2
By
P
Joe
0.4
If kept only most likely extraction, would return empty setSlide18
18
Queries on IESELECT DISTINCT x.name
FROM Person x, Addressp y
WHERE x.ID = y.ID and y.city = ‘West Mumbai’
Find people living in ‘West Mumbai’
Today:
keep only
the
most likely extraction
: low recall.
pDBs
keeps
all
extractions: higher recall.
SELECT DISTINCT x.name, u.name
FROM Person x,
Address
p
y, Person u,
Address
p
v
WHERE x.ID = y.ID and
y.city = v.city and u.ID = v.IDFind people of the same age, living in the same cityIE
FJ
pDB
Slide19
19
Alice needs Fuzzy JoinsIMDB
Reviews
Title
Year
Twelve Monkeys
1995
Monkey Love 1997
1997
Monkey Love 1935
1935
Monkey Love Panet
2005
titles don’t
match
Review
By
Rating
12 Monkeys
Joe
4
Monkey Boy
Jim
2
Monkey Love
Joe
2
IE
FJ
pDB
Slide20
20
Result of a Fuzzy JoinTitleReviewMatchp
Movie
Review
P
Twelve Monkeys
12 Monkeys
0.7
Monkey Love 1997
12 Monkeys
0.45
Monkey Love 1935
Monkey Love
0.82
Monkey Love 1935
Monkey Boy
0.68
Monkey Love Planet
Monkey Love
0.8
[
Gravano
et al
’01,Arasu’06
]
IE
FJ
pDB
Higher scores, more likely to matchSlide21
21
Queries over Fuzzy Joins
MovieTitle
Year
Twelve Monkeys
1995
Monkey Love 97
1997
Monkey Love 35
1935
Monkey Love PL
2005
Review
By
Rating
12 Monkeys
Joe
4
Monkey Boy
Jim
2
Monkey Love
Joe
2
Movie
Review
P
Twelve Monkeys
12 Monkeys
0.7
Monkey Love 97
12 Monkeys
0.45
Monkey Love 35
Monkey Love
0.82
Monkey Love 35
Monkey Boy
0.68
Monkey Love Planet
Monkey Love
0.8
Who reviewed movies made in 1935 ?
By
P
Joe
0.73
Fred
0.68
Jim
0.43
. . .
0.12
IMDB
Reviews
TitleReviewMatch
p
SELECT DISTINCT z.By
FROM IMDB x, TitleReviewMatch
p
y, Amazon z
WHERE x.title=y.title
and x.year=1935 and y.review=z.review
Ranked !
Find movies reviewed by Jim and Joe
SELECT DISTINCT x.Title
FROM IMDB x, TitleReviewMatch
p
y1, Amazon z1,
TitleReviewMatch
p
y2, Amazon z2
WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .
Title
P
Gone with…
0.73
Amadeus
0.68
. . .
0.43
Answer:
Answer:
IE
FJ
pDB
Slide22
Application Summary
pDBs can manage outputs of great techniquesValue over standard RDBMs: RecallTo keep precision high, need ranking (by prob)
Major Theme: Get high quality efficiently!
RFID
:
Particle Filters, HMMS
Alice
needs:
Fuzzy Joins
IE
Sentiment Analysis
22Slide23
Overview of tutorial
Part I: Basic Query Processing Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques
23Slide24
24
Simple Probabilistic DB (pDB)
Object
Time
Person
P
Laptop77
9:07
John
0.62
Jim
0.34
Book302
9:18
Mary
0.45
John
0.33
Fred
0.11
HasObject
p
What does it
mean
?
Keys
Probability
Non-keys
[Barbara et al. ‘92]Slide25
25
Possible Worlds Semantics
Object
Time
Person
P
Laptop77
9:07
John
p
1
Jim
p
2
Book302
9:18
Mary
p
3
John
p
4
Fred
p
5
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
Mary
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
John
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
Fred
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
Mary
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
John
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
Fred
Object
Tim
Person
Laptop77
9:07
John
Object
Tim
Person
Laptop77
9:07
Jim
Object
Tim
Person
Book302
9:18
Mary
Object
Tim
Person
Book302
9:18
John
Object
Tim
Person
Book302
9:18
Fred
Object
Tim
Person
p
1
p
3
p
1
p
4
p
1
(1- p
3
-p
4
-p
5
)
Possible
worlds
PDB
HasObject
p
HasObject
[Fagin,Halpern,Megido’90]
Distribution over possible worldsSlide26
26
Two Approaches to QueriesStandard queries, probabilistic answersQuery: “find all movies with rating > 4”Answers: list of tuples with probabilities
Queries with explicit probabilitiesQuery: find all Movie-review matches with probability in [0.3, 0.8]
Answer: …
This tutorial
[Koch ’08]
MayBMSSlide27
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
Mary
27
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
John
Possible Worlds
Query
Semantics
Object
Time
Person
P
Laptop77
9:07
John
p
1
Jim
p
2
Book302
9:18
Mary
p
3
John
p
4
Fred
p
5
Object
Tim
Person
Laptop77
9:07
John
Book302
9:18
Fred
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
Mary
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
John
Object
Tim
Person
Laptop77
9:07
Jim
Book302
9:18
Fred
Object
Tim
Person
Laptop77
9:07
John
Object
Tim
Person
Laptop77
9:07
Jim
Object
Tim
Person
Book302
9:18
Mary
Object
Tim
Person
Book302
9:18
John
Object
Tim
Person
Book302
9:18
Fred
Object
Tim
Person
PDB
HasObject
p
HasObject
“John has laptop77 and doesn’t have book302
”
p
1
p
3
p
1
p
5
p
1
(1- p
3
-p
4
-p
5
)
= p
1
(1-p
4
)
QP Goal: Compute cleverly,
directlySlide28
Overview of Part I
Part I: Basic Query Processing (TODAY)Motivating Applications A Simple Data Model (Representation)Basic Query Processing Techniques28Slide29
Basic Query Processing Outline
SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K
Aggregation Queries + Probabilities
Top-K + Measures
OLAP Queries
HAVING Queries
Natural start, workhorse RDMS queries.
Believe these are
very
important for applications
29Slide30
30
Extensional Query EvaluationGoal: Make relational ops compute probabilities
s
v
p
v
p
JOIN
v
1
p
1
v
1
v
2
p
1
p
2
v
2
p
2
P
v
p
1
v
p
2
v
1-(1-p
1
)(1-p
2
)…
Why? It’s SQL–scale
and SQL-fast
[Fuhr&Roellke’97,
Dalvi
& S ‘04]
“Not
all
are false”
Removes DuplicatesSlide31
Extensional Plan to SQL
PersonLocpBobSEA
p1
Joe
NYC
p
2
Jon
SEA
p3
JeffSEAp
4
SELECT DISTINCT
loc
FROM
HomeOffice
Loc
P
SEA
1-(1-p
1
)(1-p
3)(1-p4)NYCp2SELECT loc, 1 – PRODUCT(1-p) as pFROM HomeOfficeGROUP BY locImportant point: Extensional Evaluation is SQL – so SQL fast
HomeOffice
[Fuhr&Roellke’97, Dalvi & S ‘04]
So
pDBs
are just SQL, but…
NB: Remove attribute
P
{-
person}
Translation
31Slide32
32
Jon
Sea
p
1
Jon
q
1
Jon
q
2
Jon
q
3
SELECT DISTINCT x.City
FROM Person
p
x, Purchase
p
y
WHERE x.Name = y.Cust
and y.Product = ‘Gadget’
Jon
Sea
p
1
q
1
Jon
Sea
p
1
q
2
Jon
Sea
p
1
q
3
Sea
1-(1-p
1
q
1
)(1- p
1
q
2
)(1- p
1
q
3
)
Jon
Sea
p
1
Jon
q
1
Jon
q
2
Jon
q
3
Jon
1-(1-q
1
)(1-q
2
)(1-q
3
)
Sea
p
1
(1-(1-q
1
)(1-q
2
)(1-q
3
))
Wrong !
Correct
Depends on plan !!!
[Dalvi&S’04]
JOIN
P
JOIN
P
Not independent!Slide33
Safe Plans
A plan that correctly computes probabilities is called a safe plan
Query Compilation = finding this condition
Q: When are projected
tuples
independent?
Intuition: A plan is safe if
it
only
multiplies
independent
probabilities
.
[Dalvi&S’04]
33Slide34
A Definition of Independence
No tuple used by
both qa and
qb
.
Query q is
independent
on variable x if q{x ←`a’} and q{x ← `b’} are independent events for any distinct constants
a,b
Fundamental judgment for large scale QP (GB, TB)
[
Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08]
Safe Plans: reduce problem of evaluate q to q{x
←
a} for some a.
If x is shared in all
subgoals
of q then x is independent on q.
And no Self-Joins
34
q = R(
x,y), S(x,y), T(z,x)q{ x ←`a’} = R(`a’,y), S(`a’,y), T(z,`a’)q{
x ←`b’} = R(`b’,y), S(`
b’,y), T(z,
`b
’
)Slide35
Compiling Safe Plans (Top-Down)
Example coming…Assuming no self-joins, tuple
indep.
Compile
[Query
q
]
returns
A plan
If single subgoal R with no variables
then return RIf
exists x s.t
.
q
is independent on x
then
Return
P
-{x}
(
Compile
[
q{x ← FreshConst()} ] )ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”35[Dalvi&S’04]Slide36
Compiling Safe Plans (Top-Down)
Compile[Query q
] returns A plan
If
single
subgoal
R with no variables
then
return R
If exists x s.t.
q is independent on x thenReturn
P-{x}(
Compile
[
q
{x
←
FreshConst
()} ] )
ElsIf
q=
q1q2
so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”Compile[ R(x),S(x,y) ]Compile[ R(`a’),S(`a’,y) ]Compile(R(`a’)
)Compile(S(`a’,y))
Compile(
S(`
a’,`b
’)
)
A safe plan!
R
S
JOIN
P
-{x}
P
-{y}
36
[
Dalvi&S’04]
Assuming no self-joins,
tuple
indep
.Slide37
Compiling Safe Plans (Top-Down)
Compile(R(x),S(x,y),T(y))
No Safe Plan!
Does our algorithm miss some plans?
Compile
[Query
q
]
returns
A plan
If
single
subgoal
R with no variables
then
return
R
If
exists x
s.t
.
q
is independent on x thenReturn P-{x}( Compile[ q{x ← FreshConst()} ] )ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”
37Assuming no self-joins, tuple indep.Slide38
38
Thm: The algorithm is CompleteQbad
:- R(x), S(x,y), T(y)
Data complexity
is #P complete
Theorem
The following are equivalent
Q has PTIME data complexity
Q admits an extensional plan (and one finds it in PTIME)
Q does not have
Q
bad
as a
subquery
Bottomline
: If there is a plan, we find it.
If we don’t find a plan, it’s provably hard
[Dalvi&S’04]
NB:
never
looked at the data, so is query compilationSlide39
Basic Query Processing Techniques
SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K
Aggregation Queries + Probabilities
Top-K + Measures
OLAP Queries
HAVING Queries
39Slide40
40
Intensional Query EvaluationGoal: Make relational ops compute Boolean expression f
s
v
f
v
f
v
1
f
1
v
1
v
2
f
1
˄ f
2
v
2
f
2
P
v
f
1
v
f
2
…
v
f
1
˅ f
2
…
[Fuhr&Roellke’97,
Graedel
et al. ’98,
Dalvi
& S ‘04]
f is a
small DNF
Pr[
q
]
reduced to
Pr[
f
is SAT
]
.
NB
: f is also known as
lineage
JOIN
Idea:
Approximate Pr[f is SAT]
Tuples
= variables in expressionSlide41
41
Monte Carlo SimulationSet Cnt
= 0
repeat
N times
randomly choose X
1
, X
2, X
3 in {0,1}
if E(X1, X2, X3
) = 1
then
Cnt
= Cnt+1
P =
Cnt
/N
return
P /*
'
Pr(E) */
(0/1)-Estimator Theorem. If then X1X2X1X3X2X3Naïve:Good: Works
for any E (not just DNF)
[Karp,Luby&Madras’89]
May be very
big
(Pr(E) very small)
Bad: Many samples (N) until get a sat assignment
sample
Estimate Pr[E] = 1/6Slide42
42
Monte Carlo SimulationLuby-Karp Theorem
.
If then
X
1
X
2
X
1
X
3
X
2
X
3
Improved:
[Karp,Luby&Madras’89]
Key idea
: Estimate
overlap
of SAT assigns
X1X2X1X3
X2X3Samples from here
Better now!
Bottom Line
: if E from SFW query, efficient technique
1. Pick a monomial (randomly) – satisfy it
2. Pick other
vars
randomly
3. Count
overlap
In 2 sets, so contributes ½
NB: Because DNF still
sats
ESlide43
Basic Query Processing Techniques
SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K
Aggregation Queries + Probabilities
Top-K + Measures
OLAP Queries
HAVING Queries
43Slide44
Motivation for Top-K for SFW queries
LK is fast in theory…
[R,Dalvi&S’07]
Find
the top actor in Pulp
Fiction
who appeared
in two
bad
movies five years earlier
0.0
1.0
1
3
4
2
Can we do better
?
Naïve
:
Sim
until all small
Christopher
Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
“Confidence intervals” contain true probability
44Slide45
45
A Better Method: MultisimulationSeparate Top-K with few simulationsConcentrate on intervals in Top-KAsymptotically, confidence intervals are nestedCompare against OPT:
“knows” intervals to simulate
Evaluating Complex SQL on PDBs
45
12/8/2006
0.0
1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
[R,Dalvi&S’07]Slide46
46
Key Idea: Critical RegionThe critical region is the interval (kth-highest min, k+1st higest max)For k = 2
0.0
1.0
[R,Dalvi&S’07]Slide47
47
Key Idea: Critical RegionThe critical region is the interval (kth-highest min, k+1st
higest max)For k = 2
0.0
1.0
[R,Dalvi&S’07]
Separated the top 2Slide48
48
Three Simple Rules: Rule 1
0.0
1.0
Pick a “Double Crosser”
OPT
must pick this tooSlide49
49
Three Simple Rules: Rule 2All lower/upper crossers then maximal
OPT must pick this too
0.0
1.0Slide50
50
Three Simple Rules: Rule 3Pick an upper and a lower crosserOPT may only pick 1 of these two
0.0
1.0Slide51
51
Multisimulation PerformanceThm: Multisimulation performs at most twice as many simulations as OPTAnd, no deterministic algorithm can do better on every instance.
Practice:
very
slow
w.o
. low-level optimization
Still slow with current techniques.
Open question!
[R,Dalvi&S’07]
Slow v. SQL, not
inferenceSlide52
Basic Query Processing Outline
SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K
Aggregation Queries + Probabilities
Top-K + Measures
OLAP Queries
HAVING Queries
52Slide53
3 Semantics for Top-K + Measures
The worst speeder? 2 speeders?Combine prob+measureAll 3 semantics:Create single scoreReturn ranked by score
License
Plate
Speed
P
A-123
200
0.2
500.8B-456
750.970
0.1C-789
74
1
[
Soliman
et al’07][Zhang&Chomicki’08
]
A-123
either
200
or
50
Differ in score def53Slide54
Semantic 1: Expectation
The worst speeder? 2 speeders?ExpectationScore=Expected SpeedLicense Plate
E[Speed]A-12380
B-456
74.5
C-789
74
Top1 = {A-123}
Top2 = {A-123,B-456}
Linear
apx, so fast to compute!
License Plate
Speed
Conf
A-123
200
0.2
50
0.8
B-456
75
0.9
70
0.1C-789741200 *.2 + 50 *.854Slide55
Semantic 2: U-kRanks
The worst speeder? 2 speeders?U-kRankScore(t)=Pr[t at rank k]
License Plate
Rank 1
Rank
2
A-123
0.2
0.0B-456
0.720.14C-789
0.080.496Top1 = {B-456}
Top2 = {B-456,C-789}
NB:
Soliman
et al
consider
correlations
[
Soliman
et al’07]
License
Plate
SpeedConfA-1232000.2500.8B-456750.9700.1C-7897410.8 * 0.9
55Slide56
Semantic 3: Global-Top-K
The worst speeder? 2 speeders?Global-Top-K Score(t)=Pr[t in top-k]
[Zhang&Chomicki’08
]
License
Plate
Top-1
Top-2
A-123
0.20.2
B-4560.720.98C-789
0.080.8
Top1 = {B-456}
Top2 = {B-456,C-789}
License
Plate
Speed
Conf
A-123
200
0.2
50
0.8
B-456750.9700.1C-78974156Slide57
Comparing the semantics
Z&C’s three properties for top-k[Zhang&Chomicki’08
]
Exact k
: If the cardinality of the db is large then the top-k has k exactly distinct values
Faithful
: If the probability and score of t is higher than u, then u in top-k implies t in top-k
Stability
: Raising the score/probability of a
tuple
in top-k, will not remove it from the top-k.
THM [Z&C’08]: Global-top-k has these properties.
Expectation also has these properties
57Slide58
Basic Query Processing Outline
SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K
Aggregation Queries + Probabilities
Top-K + Measures
OLAP Queries
HAVING Queries
58Slide59
Motivation for OLAP
Customer Relationship Management AppData is dirty:
Extracted/Classified from text (e.g. Color, Brake)Attributes are non-leaf/ambiguous (
e.g.
EAST)
Do we need probabilities?
[Burdick et
al’05]
Auto
Loc
Cost
Color
Brake?
F-150
NY
$200
R
:1,
B
:0
0.8
F-150
EAST
$140R:0.5,B:0.51.0TruckMA$500R:1,B:00.9Is it a brake repair?East = NY? East= MA?Sources of uncertainty59Slide60
OLAP Data & Query Model
AutoLoc
CostColor
Brake?
F-150
NY
$200
R
:1,B
:00.8F-150EAST
$140R:0.5,B:0.5
1.0
Truck
MA
$500
R
:1,
B
:0
0.9
NY
MA
T1
F-150RAMT1T2T3T2
T3EAST
TRUCKS
“Cost of F-150 brake repairs in NY”
“Cost of F-150 brake repairs in EAST”
Query Regions
[Burdick et
al’05]
Size is not significant
60Slide61
3 Semantics for OLAP
AutoLoc
CostColor
Brake?
F-150
NY
$200
R
:1,B:0
0.8F-150EAST
$140R:0.5,B:0.5
1.0
Truck
MA
$500
R
:1,
B
:0
0.9
NY
MA
T1
F-150RAMT1T2T3T2
T3EAST
TRUCKS
[Burdick et
al’05]
Size is not significant
Not faithful
: Color uncertainty, breaks report!
Sem
1,
None
. Any uncertainty, ignore
tuple
.
61Slide62
3 Semantics for OLAP
AutoLoc
CostColor
Brake?
F-150
NY
$200
R
:1,B:0
0.8F-150EAST
$140R:0.5,B:0.5
1.0
Truck
MA
$500
R
:1,
B
:0
0.9
NY
MA
T1
F-150RAMT1T2T3T2
T3EAST
TRUCKS
[Burdick et
al’05]
Size is not significant
Sem
2:
Contains
. Contained in query’s region.
Not Consistent
.
NY + MA != East
i.e.
Blue + Yellow ≠ Green
(t2 not in either.)
62Slide63
3 Semantics for OLAP
AutoLoc
CostColor
Brake?
F-150
NY
$200
R
:1,B:0
0.8F-150EAST
$140R:0.5,B:0.5
1.0
Truck
MA
$500
R
:1,
B
:0
0.9
NY
MA
T1
F-150RAMT1T2T3T2
T3EAST
TRUCKS
[Burdick et
al’05]
Size is not significant
Sem
3:
Overlaps
. Probability in each region
Motivation for
pDB
approach
Consistent for Sum
63Slide64
OLAP Algorithms
Answer semantics: expectationsSUMAVG
[Burdick
et al
’05]
Tuple
contributes to Q
When COUNT big, good approximation [
Jayram
et al
‘07]
Important, well-studied problem: I/O optimizations, constraints [Burdick
et al
’06,07]
Faithful, consistent and efficient!
Difficult to implement!
64Slide65
Motivation for HAVING
Item ForecasterAmountP
WidgetAlice
$-99k
0.99
Bob
$100M
0.01
Whatsit
Alice$1M1
SELECT SUM(Amount)FROM Profit
WHERE item=‘Widget’
SELECT item FROM Profit
WHERE item =‘Widget’
GROUP BY item
HAVING SUM(Amount) > 0
Expectation Style [OLAP Style]
HAVING style
Ans: -99k *.99 +100M*0.01 ~900K
Ans: 0.01
Profit
65
[R&S’07]Slide66
Summary of HAVING results
Safety uses the independence test Twist: Safety depends on the aggregateIf the “plan is safe” then so is COUNT, MIN,MAXNot true for SUM and AVG!Theoretical AlgorithmsRequire innovation to make SQL efficientNative operators, sort based algorithm, etc.
[R&S’07]
66Slide67
Top-K & Aggregation Summary
Diverse semantics driven by applicationsTop K: U-kRanks and Global-top-kOLAP & HAVINGSkylines too! [Pei et al ‘08]Lots of interest in the communityConjecture: Aggregation and Top-k are more important for probabilistic databases than RDBMS
Tuple carries less informationMany prob
tuples
not as valuable as 1 correct
tuple
67Slide68
Take-home messages of Day 1
pDBs used in diverse application domainsRFID, Information Extraction, Sentiment AnalysisValue: Higher Recall, without loss of precisionThe fundamentals of QP in pDBsCompile a safe query to SQLEvaluate an unsafe plan (Monte Carlo)Top-K Semantics for pDBs
OLAP on Probabilistic pDBs
68Slide69
Advertisement for Day Two
ApplicationsRFID with movies, Smoothed dataAdvanced representationsLineage, Markov Models, Graphical Models, World Sets, Continuous Function.Advanced QPLazy Evaluation in Trio, Probabilistic Automaton, Probabilistic Inference, Sampling Technique.
And More!
All sales final.
O
ffer not valid in Alaska, or where prohibited by law.
69Slide70
Thank you
70