/
System Aspects of Probabilistic Data Management System Aspects of Probabilistic Data Management

System Aspects of Probabilistic Data Management - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
345 views
Uploaded On 2018-12-06

System Aspects of Probabilistic Data Management - PPT Presentation

Magdalena Balazinska Christopher Ré and Dan Suciu University of Washington One slide overview of motivation Data are uncertain in many applications Business Dedup Info Extraction ID: 736966

top amp person query amp top query person queries object monkey tim book302 john safe love compile laptop77 basic

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "System Aspects of Probabilistic Data Man..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

System Aspects of Probabilistic Data Management

Magdalena Balazinska, Christopher Ré and Dan SuciuUniversity of WashingtonSlide2

One slide overview of motivation

Data are uncertain in many applicationsBusiness: Dedup, Info. ExtractionData from physical-world: RFID

2

Probabilistic DBs (

pDBs

) manage uncertainty

Integrate, Query, and Build Applications on uncertain data

Value

: Higher recall, without loss of precision

DB Niche: Community that

knows

scaleSlide3

Overview of tutorial

Part I: Basic Query Processing (Today)Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques Highlights:The intuition behind and how to compile

safe plansProcess any SELECT-FROM-WHERE (SFW) query

Process top-k queries

Aggregation: Top-k + measures, OLAP,

HAVING

3Slide4

Overview of tutorial

Part II: Advanced Techniques (Tomorrow)Correlations Advanced Representation & QPDiscussion and Open Problems

Highlights:

Lineage and View Processing (GBs of data)

Events on Correlated Streams (GBs of Streams)

Sophisticated Factor Evaluation (Highly Correlated)

Continuous DBs

4Slide5

Hasn’t this been solved?

(an analogy to keep in mind)5

AI

Databases

Deterministic

Theorem prover

Query processing

Probabilistic

Probabilistic inference

[this talk]

Impact

: Fortune 500 companies

rely

on DBs, but how many have theorem

provers

?

SCALESlide6

Ancillary Material

pDBs have a long history Cavallo&Pitarelli ’87ProbView [Lakshmanan et al’97]Many active projects today: Mystiq, Lahar, Trio,

MayBMS, Maryland, Orion, MCDB, Wisconsin, IBM, BayesStore, UMass, Waterloo, SFU and moreMany important topics omitted

Query languages

XML

6Slide7

Overview of tutorial

Part I: Basic Query Processing (Today)Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques Highlights:The intuition behind and how to compile

safe plansProcess any SELECT-FROM-WHERE (SFW) query

Process top-k queries

Aggregation: Top-k + measures, OLAP,

HAVING

7Slide8

Example 1: Querying RFID

8

C

B

A

D

E

Apps:

UbiComp

, Diary, Social Applications,..

In general, Event queries [Cayuga,

Sase

]

Joe entered office 422 at t=8

Query

: “Alert when Joe enters 422”

i.e. Joe outside 422, inside 422

[R,Letchner,B&S’07] [http://rfid.cs.washington.edu]Slide9

Challenges: Tracking Joe’s Location

9

6

th

Floor in CS building

Blue ring is Joe’s Location

Antennas

[RFID Ecosystem @ UW]Slide10

6

th Floor in CS buildingChallenges: Tracking Joe’s Location

10

Blue ring is Joe’s Location

Antennas

Two Problems:

Missed Readings

Granularity Mismatch

Model Based View (Probabilistic)

[

Deshpande

et al 04,

Kanagal

& Deshpande’08]

[Re et al ‘08,

Kanagal

& Deshpande’08]Slide11

Probabilities via particle filter

11

Each orange particle is a guess of Joe’s location

Blue ring is

ground truth

Antennas

Particles guess

many locations

per timestep, so data are uncertain

6

th

Floor in CS building

[

Doucet

et

al’01]Slide12

Probabilities via particle filter

12

6th Floor in CS building

[

R

et

al

’08] [

Kanagal

& Deshpande’08]

Tag

t

Loc

P

Joe

7

422

0.4

Hall3

0.4

Hall4

0.2

Joe

8

422

0.6

Hall3

0.2

Hall4

0.2

Sue

7

“Joe entered 422 at t=8 with probability 0.36”

Shameless Ad

: Markov Correlations on Day 2

Query Particle Filter output

via At, a model based view

At

(

tag

,loc

)Slide13

13

IMDB

IMDB:

Lots of data !

Well maintained and clean

But no reviews!

Example 2: Alice Looks for Movies

I’d like to know which

movies are really good…

[R,Dalvi&S’07]Slide14

14

IMDB

On the web there

are lots of reviews…

Which movie is the review about?

…is the review

positive or negative ?

…should I trust

the reviewer ?

Alice needs:

Information Extraction

Fuzzy

joins

Sentiment analysis

Social networks

Forced to deal with uncertaintySlide15

15

Find actors in Pulp Fiction who

appeared in two bad movies

five years earlier

Find years when

‘Anthony Hopkins’

starred in a good

movie

IMDB

A

probabilistic

database

can

help Alice store

and query her

uncertain data

Alice’s workflow:

Download reviews

Information Extraction

Fuzzy Joins

Query

pDB

IE

FJ pDB

Slide16

16

Alice needs Information Extraction

ID

House-No

Street

City

P

1

52

Goregaon West

Mumbai

0.1

1

52-A

Goregaon West

Mumbai

0.4

1

52

Goregaon

West Mumbai

0.2

1

52-A

Goregaon

West Mumbai

0.2

2

. . . .

. . . .

. . . .

. . . .

2

. . . .

...52 A Goregaon West Mumbai ...

Here probabilities are meaningful

Address

p

[Gupta&Sarawagi’2006]

IE

FJ

pDB

Slide17

Queries on IE

SELECT DISTINCT x.nameFROM Person x, Address

p yWHERE x.ID = y.ID and

y.city

= ‘West Mumbai’

Find people living in ‘West Mumbai’

IE

FJ

pDB

ID

House-No

Street

City

P

1

52

Goregaon West

Mumbai

0.1

1

52-A

Goregaon West

Mumbai

0.4

1

52

Goregaon

West Mumbai

0.2

1

52-A

Goregaon

West Mumbai

0.2

By

P

Joe

0.4

If kept only most likely extraction, would return empty setSlide18

18

Queries on IESELECT DISTINCT x.name

FROM Person x, Addressp y

WHERE x.ID = y.ID and y.city = ‘West Mumbai’

Find people living in ‘West Mumbai’

Today:

keep only

the

most likely extraction

: low recall.

pDBs

keeps

all

extractions: higher recall.

SELECT DISTINCT x.name, u.name

FROM Person x,

Address

p

y, Person u,

Address

p

v

WHERE x.ID = y.ID and

y.city = v.city and u.ID = v.IDFind people of the same age, living in the same cityIE

FJ

pDB

Slide19

19

Alice needs Fuzzy JoinsIMDB

Reviews

Title

Year

Twelve Monkeys

1995

Monkey Love 1997

1997

Monkey Love 1935

1935

Monkey Love Panet

2005

titles don’t

match

Review

By

Rating

12 Monkeys

Joe

4

Monkey Boy

Jim

2

Monkey Love

Joe

2

IE

FJ

pDB

Slide20

20

Result of a Fuzzy JoinTitleReviewMatchp

Movie

Review

P

Twelve Monkeys

12 Monkeys

0.7

Monkey Love 1997

12 Monkeys

0.45

Monkey Love 1935

Monkey Love

0.82

Monkey Love 1935

Monkey Boy

0.68

Monkey Love Planet

Monkey Love

0.8

[

Gravano

et al

’01,Arasu’06

]

IE

FJ

pDB

Higher scores, more likely to matchSlide21

21

Queries over Fuzzy Joins

MovieTitle

Year

Twelve Monkeys

1995

Monkey Love 97

1997

Monkey Love 35

1935

Monkey Love PL

2005

Review

By

Rating

12 Monkeys

Joe

4

Monkey Boy

Jim

2

Monkey Love

Joe

2

Movie

Review

P

Twelve Monkeys

12 Monkeys

0.7

Monkey Love 97

12 Monkeys

0.45

Monkey Love 35

Monkey Love

0.82

Monkey Love 35

Monkey Boy

0.68

Monkey Love Planet

Monkey Love

0.8

Who reviewed movies made in 1935 ?

By

P

Joe

0.73

Fred

0.68

Jim

0.43

. . .

0.12

IMDB

Reviews

TitleReviewMatch

p

SELECT DISTINCT z.By

FROM IMDB x, TitleReviewMatch

p

y, Amazon z

WHERE x.title=y.title

and x.year=1935 and y.review=z.review

Ranked !

Find movies reviewed by Jim and Joe

SELECT DISTINCT x.Title

FROM IMDB x, TitleReviewMatch

p

y1, Amazon z1,

TitleReviewMatch

p

y2, Amazon z2

WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

Title

P

Gone with…

0.73

Amadeus

0.68

. . .

0.43

Answer:

Answer:

IE

FJ

pDB

Slide22

Application Summary

pDBs can manage outputs of great techniquesValue over standard RDBMs: RecallTo keep precision high, need ranking (by prob)

Major Theme: Get high quality efficiently!

RFID

:

Particle Filters, HMMS

Alice

needs:

Fuzzy Joins

IE

Sentiment Analysis

22Slide23

Overview of tutorial

Part I: Basic Query Processing Two Scenarios for pDBsA Basic Query & Data Model Basic Query Processing Techniques

23Slide24

24

Simple Probabilistic DB (pDB)

Object

Time

Person

P

Laptop77

9:07

John

0.62

Jim

0.34

Book302

9:18

Mary

0.45

John

0.33

Fred

0.11

HasObject

p

What does it

mean

?

Keys

Probability

Non-keys

[Barbara et al. ‘92]Slide25

25

Possible Worlds Semantics

Object

Time

Person

P

Laptop77

9:07

John

p

1

Jim

p

2

Book302

9:18

Mary

p

3

John

p

4

Fred

p

5

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

Mary

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

John

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

Fred

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

Mary

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

John

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

Fred

Object

Tim

Person

Laptop77

9:07

John

Object

Tim

Person

Laptop77

9:07

Jim

Object

Tim

Person

Book302

9:18

Mary

Object

Tim

Person

Book302

9:18

John

Object

Tim

Person

Book302

9:18

Fred

Object

Tim

Person

p

1

p

3

p

1

p

4

p

1

(1- p

3

-p

4

-p

5

)

Possible

worlds

PDB

HasObject

p

HasObject

[Fagin,Halpern,Megido’90]

Distribution over possible worldsSlide26

26

Two Approaches to QueriesStandard queries, probabilistic answersQuery: “find all movies with rating > 4”Answers: list of tuples with probabilities

Queries with explicit probabilitiesQuery: find all Movie-review matches with probability in [0.3, 0.8]

Answer: …

This tutorial

[Koch ’08]

MayBMSSlide27

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

Mary

27

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

John

Possible Worlds

Query

Semantics

Object

Time

Person

P

Laptop77

9:07

John

p

1

Jim

p

2

Book302

9:18

Mary

p

3

John

p

4

Fred

p

5

Object

Tim

Person

Laptop77

9:07

John

Book302

9:18

Fred

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

Mary

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

John

Object

Tim

Person

Laptop77

9:07

Jim

Book302

9:18

Fred

Object

Tim

Person

Laptop77

9:07

John

Object

Tim

Person

Laptop77

9:07

Jim

Object

Tim

Person

Book302

9:18

Mary

Object

Tim

Person

Book302

9:18

John

Object

Tim

Person

Book302

9:18

Fred

Object

Tim

Person

PDB

HasObject

p

HasObject

“John has laptop77 and doesn’t have book302

p

1

p

3

p

1

p

5

p

1

(1- p

3

-p

4

-p

5

)

= p

1

(1-p

4

)

QP Goal: Compute cleverly,

directlySlide28

Overview of Part I

Part I: Basic Query Processing (TODAY)Motivating Applications A Simple Data Model (Representation)Basic Query Processing Techniques28Slide29

Basic Query Processing Outline

SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K

Aggregation Queries + Probabilities

Top-K + Measures

OLAP Queries

HAVING Queries

Natural start, workhorse RDMS queries.

Believe these are

very

important for applications

29Slide30

30

Extensional Query EvaluationGoal: Make relational ops compute probabilities

s

v

p

v

p

JOIN

v

1

p

1

v

1

v

2

p

1

p

2

v

2

p

2

P

v

p

1

v

p

2

v

1-(1-p

1

)(1-p

2

)…

Why? It’s SQL–scale

and SQL-fast

[Fuhr&Roellke’97,

Dalvi

& S ‘04]

“Not

all

are false”

Removes DuplicatesSlide31

Extensional Plan to SQL

PersonLocpBobSEA

p1

Joe

NYC

p

2

Jon

SEA

p3

JeffSEAp

4

SELECT DISTINCT

loc

FROM

HomeOffice

Loc

P

SEA

1-(1-p

1

)(1-p

3)(1-p4)NYCp2SELECT loc, 1 – PRODUCT(1-p) as pFROM HomeOfficeGROUP BY locImportant point: Extensional Evaluation is SQL – so SQL fast

HomeOffice

[Fuhr&Roellke’97, Dalvi & S ‘04]

So

pDBs

are just SQL, but…

NB: Remove attribute

P

{-

person}

Translation

31Slide32

32

Jon

Sea

p

1

Jon

q

1

Jon

q

2

Jon

q

3

SELECT DISTINCT x.City

FROM Person

p

x, Purchase

p

y

WHERE x.Name = y.Cust

and y.Product = ‘Gadget’

Jon

Sea

p

1

q

1

Jon

Sea

p

1

q

2

Jon

Sea

p

1

q

3

Sea

1-(1-p

1

q

1

)(1- p

1

q

2

)(1- p

1

q

3

)

Jon

Sea

p

1

Jon

q

1

Jon

q

2

Jon

q

3

Jon

1-(1-q

1

)(1-q

2

)(1-q

3

)

Sea

p

1

(1-(1-q

1

)(1-q

2

)(1-q

3

))

Wrong !

Correct

Depends on plan !!!

[Dalvi&S’04]

JOIN

P

JOIN

P

Not independent!Slide33

Safe Plans

A plan that correctly computes probabilities is called a safe plan

Query Compilation = finding this condition

Q: When are projected

tuples

independent?

Intuition: A plan is safe if

it

only

multiplies

independent

probabilities

.

[Dalvi&S’04]

33Slide34

A Definition of Independence

No tuple used by

both qa and

qb

.

Query q is

independent

on variable x if q{x ←`a’} and q{x ← `b’} are independent events for any distinct constants

a,b

Fundamental judgment for large scale QP (GB, TB)

[

Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08]

Safe Plans: reduce problem of evaluate q to q{x

a} for some a.

If x is shared in all

subgoals

of q then x is independent on q.

And no Self-Joins

34

q = R(

x,y), S(x,y), T(z,x)q{ x ←`a’} = R(`a’,y), S(`a’,y), T(z,`a’)q{

x ←`b’} = R(`b’,y), S(`

b’,y), T(z,

`b

)Slide35

Compiling Safe Plans (Top-Down)

Example coming…Assuming no self-joins, tuple

indep.

Compile

[Query

q

]

returns

A plan

If single subgoal R with no variables

then return RIf

exists x s.t

.

q

is independent on x

then

Return

P

-{x}

(

Compile

[

q{x ← FreshConst()} ] )ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”35[Dalvi&S’04]Slide36

Compiling Safe Plans (Top-Down)

Compile[Query q

] returns A plan

If

single

subgoal

R with no variables

then

return R

If exists x s.t.

q is independent on x thenReturn

P-{x}(

Compile

[

q

{x

FreshConst

()} ] )

ElsIf

q=

q1q2

so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”Compile[ R(x),S(x,y) ]Compile[ R(`a’),S(`a’,y) ]Compile(R(`a’)

)Compile(S(`a’,y))

Compile(

S(`

a’,`b

’)

)

A safe plan!

R

S

JOIN

P

-{x}

P

-{y}

36

[

Dalvi&S’04]

Assuming no self-joins,

tuple

indep

.Slide37

Compiling Safe Plans (Top-Down)

Compile(R(x),S(x,y),T(y))

No Safe Plan!

Does our algorithm miss some plans?

Compile

[Query

q

]

returns

A plan

If

single

subgoal

R with no variables

then

return

R

If

exists x

s.t

.

q

is independent on x thenReturn P-{x}( Compile[ q{x ← FreshConst()} ] )ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2]) Else return “No Safe Plan”

37Assuming no self-joins, tuple indep.Slide38

38

Thm: The algorithm is CompleteQbad

:- R(x), S(x,y), T(y)

Data complexity

is #P complete

Theorem

The following are equivalent

Q has PTIME data complexity

Q admits an extensional plan (and one finds it in PTIME)

Q does not have

Q

bad

as a

subquery

Bottomline

: If there is a plan, we find it.

If we don’t find a plan, it’s provably hard

[Dalvi&S’04]

NB:

never

looked at the data, so is query compilationSlide39

Basic Query Processing Techniques

SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K

Aggregation Queries + Probabilities

Top-K + Measures

OLAP Queries

HAVING Queries

39Slide40

40

Intensional Query EvaluationGoal: Make relational ops compute Boolean expression f

s

v

f

v

f

v

1

f

1

v

1

v

2

f

1

˄ f

2

v

2

f

2

P

v

f

1

v

f

2

v

f

1

˅ f

2

[Fuhr&Roellke’97,

Graedel

et al. ’98,

Dalvi

& S ‘04]

f is a

small DNF

Pr[

q

]

reduced to

Pr[

f

is SAT

]

.

NB

: f is also known as

lineage

JOIN

Idea:

Approximate Pr[f is SAT]

Tuples

= variables in expressionSlide41

41

Monte Carlo SimulationSet Cnt

= 0

repeat

N times

randomly choose X

1

, X

2, X

3 in {0,1}

if E(X1, X2, X3

) = 1

then

Cnt

= Cnt+1

P =

Cnt

/N

return

P /*

'

Pr(E) */

(0/1)-Estimator Theorem. If then X1X2X1X3X2X3Naïve:Good: Works

for any E (not just DNF)

[Karp,Luby&Madras’89]

May be very

big

(Pr(E) very small)

Bad: Many samples (N) until get a sat assignment

sample

Estimate Pr[E] = 1/6Slide42

42

Monte Carlo SimulationLuby-Karp Theorem

.

If then

X

1

X

2

X

1

X

3

X

2

X

3

Improved:

[Karp,Luby&Madras’89]

Key idea

: Estimate

overlap

of SAT assigns

X1X2X1X3

X2X3Samples from here

Better now!

Bottom Line

: if E from SFW query, efficient technique

1. Pick a monomial (randomly) – satisfy it

2. Pick other

vars

randomly

3. Count

overlap

In 2 sets, so contributes ½

NB: Because DNF still

sats

ESlide43

Basic Query Processing Techniques

SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K

Aggregation Queries + Probabilities

Top-K + Measures

OLAP Queries

HAVING Queries

43Slide44

Motivation for Top-K for SFW queries

LK is fast in theory…

[R,Dalvi&S’07]

Find

the top actor in Pulp

Fiction

who appeared

in two

bad

movies five years earlier

0.0

1.0

1

3

4

2

Can we do better

?

Naïve

:

Sim

until all small

Christopher

Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

“Confidence intervals” contain true probability

44Slide45

45

A Better Method: MultisimulationSeparate Top-K with few simulationsConcentrate on intervals in Top-KAsymptotically, confidence intervals are nestedCompare against OPT:

“knows” intervals to simulate

Evaluating Complex SQL on PDBs

45

12/8/2006

0.0

1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

[R,Dalvi&S’07]Slide46

46

Key Idea: Critical RegionThe critical region is the interval (kth-highest min, k+1st higest max)For k = 2

0.0

1.0

[R,Dalvi&S’07]Slide47

47

Key Idea: Critical RegionThe critical region is the interval (kth-highest min, k+1st

higest max)For k = 2

0.0

1.0

[R,Dalvi&S’07]

Separated the top 2Slide48

48

Three Simple Rules: Rule 1

0.0

1.0

Pick a “Double Crosser”

OPT

must pick this tooSlide49

49

Three Simple Rules: Rule 2All lower/upper crossers then maximal

OPT must pick this too

0.0

1.0Slide50

50

Three Simple Rules: Rule 3Pick an upper and a lower crosserOPT may only pick 1 of these two

0.0

1.0Slide51

51

Multisimulation PerformanceThm: Multisimulation performs at most twice as many simulations as OPTAnd, no deterministic algorithm can do better on every instance.

Practice:

very

slow

w.o

. low-level optimization

Still slow with current techniques.

Open question!

[R,Dalvi&S’07]

Slow v. SQL, not

inferenceSlide52

Basic Query Processing Outline

SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K

Aggregation Queries + Probabilities

Top-K + Measures

OLAP Queries

HAVING Queries

52Slide53

3 Semantics for Top-K + Measures

The worst speeder? 2 speeders?Combine prob+measureAll 3 semantics:Create single scoreReturn ranked by score

License

Plate

Speed

P

A-123

200

0.2

500.8B-456

750.970

0.1C-789

74

1

[

Soliman

et al’07][Zhang&Chomicki’08

]

A-123

either

200

or

50

Differ in score def53Slide54

Semantic 1: Expectation

The worst speeder? 2 speeders?ExpectationScore=Expected SpeedLicense Plate

E[Speed]A-12380

B-456

74.5

C-789

74

Top1 = {A-123}

Top2 = {A-123,B-456}

Linear

apx, so fast to compute!

License Plate

Speed

Conf

A-123

200

0.2

50

0.8

B-456

75

0.9

70

0.1C-789741200 *.2 + 50 *.854Slide55

Semantic 2: U-kRanks

The worst speeder? 2 speeders?U-kRankScore(t)=Pr[t at rank k]

License Plate

Rank 1

Rank

2

A-123

0.2

0.0B-456

0.720.14C-789

0.080.496Top1 = {B-456}

Top2 = {B-456,C-789}

NB:

Soliman

et al

consider

correlations

[

Soliman

et al’07]

License

Plate

SpeedConfA-1232000.2500.8B-456750.9700.1C-7897410.8 * 0.9

55Slide56

Semantic 3: Global-Top-K

The worst speeder? 2 speeders?Global-Top-K Score(t)=Pr[t in top-k]

[Zhang&Chomicki’08

]

License

Plate

Top-1

Top-2

A-123

0.20.2

B-4560.720.98C-789

0.080.8

Top1 = {B-456}

Top2 = {B-456,C-789}

License

Plate

Speed

Conf

A-123

200

0.2

50

0.8

B-456750.9700.1C-78974156Slide57

Comparing the semantics

Z&C’s three properties for top-k[Zhang&Chomicki’08

]

Exact k

: If the cardinality of the db is large then the top-k has k exactly distinct values

Faithful

: If the probability and score of t is higher than u, then u in top-k implies t in top-k

Stability

: Raising the score/probability of a

tuple

in top-k, will not remove it from the top-k.

THM [Z&C’08]: Global-top-k has these properties.

Expectation also has these properties

57Slide58

Basic Query Processing Outline

SELECT-FROM-WHERE QueriesCompiling Safe QueriesUnsafe Queries (Sampling)Top-K

Aggregation Queries + Probabilities

Top-K + Measures

OLAP Queries

HAVING Queries

58Slide59

Motivation for OLAP

Customer Relationship Management AppData is dirty:

Extracted/Classified from text (e.g. Color, Brake)Attributes are non-leaf/ambiguous (

e.g.

EAST)

Do we need probabilities?

[Burdick et

al’05]

Auto

Loc

Cost

Color

Brake?

F-150

NY

$200

R

:1,

B

:0

0.8

F-150

EAST

$140R:0.5,B:0.51.0TruckMA$500R:1,B:00.9Is it a brake repair?East = NY? East= MA?Sources of uncertainty59Slide60

OLAP Data & Query Model

AutoLoc

CostColor

Brake?

F-150

NY

$200

R

:1,B

:00.8F-150EAST

$140R:0.5,B:0.5

1.0

Truck

MA

$500

R

:1,

B

:0

0.9

NY

MA

T1

F-150RAMT1T2T3T2

T3EAST

TRUCKS

“Cost of F-150 brake repairs in NY”

“Cost of F-150 brake repairs in EAST”

Query Regions

[Burdick et

al’05]

Size is not significant

60Slide61

3 Semantics for OLAP

AutoLoc

CostColor

Brake?

F-150

NY

$200

R

:1,B:0

0.8F-150EAST

$140R:0.5,B:0.5

1.0

Truck

MA

$500

R

:1,

B

:0

0.9

NY

MA

T1

F-150RAMT1T2T3T2

T3EAST

TRUCKS

[Burdick et

al’05]

Size is not significant

Not faithful

: Color uncertainty, breaks report!

Sem

1,

None

. Any uncertainty, ignore

tuple

.

61Slide62

3 Semantics for OLAP

AutoLoc

CostColor

Brake?

F-150

NY

$200

R

:1,B:0

0.8F-150EAST

$140R:0.5,B:0.5

1.0

Truck

MA

$500

R

:1,

B

:0

0.9

NY

MA

T1

F-150RAMT1T2T3T2

T3EAST

TRUCKS

[Burdick et

al’05]

Size is not significant

Sem

2:

Contains

. Contained in query’s region.

Not Consistent

.

NY + MA != East

i.e.

Blue + Yellow ≠ Green

(t2 not in either.)

62Slide63

3 Semantics for OLAP

AutoLoc

CostColor

Brake?

F-150

NY

$200

R

:1,B:0

0.8F-150EAST

$140R:0.5,B:0.5

1.0

Truck

MA

$500

R

:1,

B

:0

0.9

NY

MA

T1

F-150RAMT1T2T3T2

T3EAST

TRUCKS

[Burdick et

al’05]

Size is not significant

Sem

3:

Overlaps

. Probability in each region

Motivation for

pDB

approach

Consistent for Sum

63Slide64

OLAP Algorithms

Answer semantics: expectationsSUMAVG

[Burdick

et al

’05]

Tuple

contributes to Q

When COUNT big, good approximation [

Jayram

et al

‘07]

Important, well-studied problem: I/O optimizations, constraints [Burdick

et al

’06,07]

Faithful, consistent and efficient!

Difficult to implement!

64Slide65

Motivation for HAVING

Item ForecasterAmountP

WidgetAlice

$-99k

0.99

Bob

$100M

0.01

Whatsit

Alice$1M1

SELECT SUM(Amount)FROM Profit

WHERE item=‘Widget’

SELECT item FROM Profit

WHERE item =‘Widget’

GROUP BY item

HAVING SUM(Amount) > 0

Expectation Style [OLAP Style]

HAVING style

Ans: -99k *.99 +100M*0.01 ~900K

Ans: 0.01

Profit

65

[R&S’07]Slide66

Summary of HAVING results

Safety uses the independence test Twist: Safety depends on the aggregateIf the “plan is safe” then so is COUNT, MIN,MAXNot true for SUM and AVG!Theoretical AlgorithmsRequire innovation to make SQL efficientNative operators, sort based algorithm, etc.

[R&S’07]

66Slide67

Top-K & Aggregation Summary

Diverse semantics driven by applicationsTop K: U-kRanks and Global-top-kOLAP & HAVINGSkylines too! [Pei et al ‘08]Lots of interest in the communityConjecture: Aggregation and Top-k are more important for probabilistic databases than RDBMS

Tuple carries less informationMany prob

tuples

not as valuable as 1 correct

tuple

67Slide68

Take-home messages of Day 1

pDBs used in diverse application domainsRFID, Information Extraction, Sentiment AnalysisValue: Higher Recall, without loss of precisionThe fundamentals of QP in pDBsCompile a safe query to SQLEvaluate an unsafe plan (Monte Carlo)Top-K Semantics for pDBs

OLAP on Probabilistic pDBs

68Slide69

Advertisement for Day Two

ApplicationsRFID with movies, Smoothed dataAdvanced representationsLineage, Markov Models, Graphical Models, World Sets, Continuous Function.Advanced QPLazy Evaluation in Trio, Probabilistic Automaton, Probabilistic Inference, Sampling Technique.

And More!

All sales final.

O

ffer not valid in Alaska, or where prohibited by law.

69Slide70

Thank you

70