/
Top-K Query Evaluation on Probabilistic Data Top-K Query Evaluation on Probabilistic Data

Top-K Query Evaluation on Probabilistic Data - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
438 views
Uploaded On 2016-11-04

Top-K Query Evaluation on Probabilistic Data - PPT Presentation

Christopher Ré Nilesh Dalvi and Dan Suciu University of Washington Evaluating Complex SQL on PDBs 2 1282006 High Level Overview DBMS Precise answers over clean data Data are often imprecise ID: 484546

complex 2006 sql pdbs 2006 complex pdbs sql evaluating query output multisimulation processing overview background top tuples data max

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Top-K Query Evaluation on Probabilistic ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Top-K Query Evaluation on Probabilistic Data

Christopher

,

Nilesh

Dalvi

and Dan

Suciu

University of WashingtonSlide2

Evaluating Complex SQL on PDBs

2

12/8/2006

High Level Overview

DBMS: Precise answers over

clean

dataData are often impreciseInformation IntegrationInformation ExtractionProbabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability

This talk

: Compute Top-K

EfficientlySlide3

Evaluating Complex SQL on PDBs

3

12/8/2006

Overview

Motivating Example

Query Processing Background

MultisimulationExperimental ResultsSlide4

Evaluating Complex SQL on PDBs

4

12/8/2006

Overview

Motivating Example

Query Processing Background

MultisimulationExperimental ResultsSlide5

Evaluating Complex SQL on PDBs

5

12/8/2006

Example Application

IMDB

Lots of interesting data above movies (e.g. actors, directors)

Well

maintained

and

clean

But no reviews!

On the web there are lots of reviews

How will I know which movie they are about?

Alice needs to do

information extraction

and

object reconcillation

.

Is a movie good or bad?

Alice wants to do

sentiment analysis.

A

probabilistic database

can help Alice store and query her uncertain data.

Find all years where ‘

Anthony Hopkins

’ starred in a good movieSlide6

Evaluating Complex SQL on PDBs

6

12/8/2006

Imprecision is out there…

Object Reconciliation

RID

Title

r124

12 Monkeys

r155

Twelve Monkeys

r175

2 Monkey

r194

Monk

MID

Title

m232

12 Monkeys

m143

Monkey Love

Our Approach

: Convert scores to probabilities

Data extracted from Reviews

Clean IMDB Data

Output

: (RID,MID) pairs

12/8/2006

Match

No Match

t’

t

Felligi-Sunter

Approach

: Score (

s

) each (RID,MID) Slide7

Evaluating Complex SQL on PDBs

7

12/8/2006

Imprecision is out there…

Object Reconciliation

RID

Title

r124

12 Monkeys

r155

Twelve Monkeys

r175

2 Monkey

r194

Monk

MID

Title

m232

12 Monkeys

m143

Monkey Love

RID

MID

Prob

r175

m232

0.8

r175

m143

0.2

Felligi-Sunter

Approach

: Score (

s

) each (RID,MID)

Match

No Match

t’

tSlide8

Evaluating Complex SQL on PDBs

8

12/8/2006

Overview

Motivating Example

Query Processing Background

MultisimulationExperimental ResultsSlide9

Evaluating Complex SQL on PDBs

9

12/8/2006

Query Processing Background

RID

MID

Prob

r175

m232

0.8

r175

m143

0.2

Query Processing builds

event expression

Intensional Query Processing [FR97]

Associate to each tuple an

event

Probability event is satisfied = query value

Technical Point

: Projection as last operator implies result is a

DNFSlide10

Evaluating Complex SQL on PDBs

10

12/8/2006

DNF Sampling at a High Level

Estimate

p(t),

probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04]

Randomized Approximation [LK84]

Simulation

reduces uncertainty

0.0

1.0

Uncertain about

p(t)Slide11

Evaluating Complex SQL on PDBs

11

12/8/2006

Naïve Query Processing

Naïve

algorithm (PTIME): Simulate until all

small“Epsilon”-small

0.0

1.0

Christopher

Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Can we do better?Slide12

Evaluating Complex SQL on PDBs

12

12/8/2006

Overview

Motivating Example

Query Processing Background

MultisimulationExperimental ResultsSlide13

Evaluating Complex SQL on PDBs

13

12/8/2006

A Better Method: Multisimulation

Separate Top-K with few simulations

Concentrate on intervals in

Top-KAsymptotically, confidence intervals are nestedCompare against OPT“knows” which intervals to simulate

Evaluating Complex SQL on PDBs

13

12/8/2006

0.0

1.0

Christopher

Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2Slide14

Evaluating Complex SQL on PDBs

14

12/8/2006

The Critical Region

The

critical region

is the interval (kth-highest min, k+1st higest max)For k = 2

0.0

1.0Slide15

Evaluating Complex SQL on PDBs

15

12/8/2006

Three Simple Rules:

Rule 1

0.0

1.0

Pick a “Double Crosser”

OPT

must pick this tooSlide16

Evaluating Complex SQL on PDBs

16

12/8/2006

Three Simple Rules:

Rule 2

All

lower/upper

crossers

then maximal

OPT

must pick this too

0.0

1.0Slide17

Evaluating Complex SQL on PDBs

17

12/8/2006

Three Simple Rules:

Rule 3

Pick an upper and a

lower crosserOPT may only pick 1 of these two

0.0

1.0Slide18

Evaluating Complex SQL on PDBs

18

12/8/2006

Multisimulation

is a

2-ApproxThm: Multisimulation performs at most twice as many simulations as OPTAnd, no deterministic algorithm can do better on every instance.

Extensions

Top-K Set (shown)

Anytime (produce from 1 to k)

Rank (produce top k

ranked)

All ( rank all intervals )Slide19

Evaluating Complex SQL on PDBs

19

12/8/2006

Overview

Motivating Example

Query Processing Background

MultisimulationExperimental ResultsSlide20

Evaluating Complex SQL on PDBs

20

12/8/2006

Experiment Details:

Uncertain tuples

Table

# Tuples

StringMatch

339k

ActorMatch

6,758k

DirectorMatch

18k

Table

# Tuples

Reviews

292kSlide21

Evaluating Complex SQL on PDBs

21

12/8/2006

Running TimeSlide22

Evaluating Complex SQL on PDBs

22

12/8/2006

Running Time

“Find all years in which Anthony Hopkins was in a highly rated movie”

(SS)

S

mall Number of Tuples Output (33)

S

mall DNFs per Output

(Avg. 20.4, Max 63)Slide23

Evaluating Complex SQL on PDBs

23

12/8/2006

Running Time

“Find all directors who have a highly rated drama but low rated comedy”

(LL)

Large #Tuples

Output (1415)

L

arge

DNFs per Output

(Avg. 234.8, Max. 9088)Slide24

Evaluating Complex SQL on PDBs

24

12/8/2006

Conclusions

Mystiq

is a general purpose probabilistic database

Multisimulation and Logical Optimization key to performance on large data sets Advert: Demo on my laptopSlide25

Evaluating Complex SQL on PDBs

25

12/8/2006

Running Time

“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction”

(SL)

S

mall Number of Tuples Output (33)

L

arge DNFs per Output

(Avg. 117.7,Max 685)Slide26

Evaluating Complex SQL on PDBs

26

12/8/2006

Running Time

“Find all directors in the 80s who had a highly rated movie”

(LS)

L

arge #Tuples Output (3259)

S

mall DNFs per Output

(Avg 3.03, Max 30)Slide27

Evaluating Complex SQL on PDBs

27

12/8/2006

0.0

1.0

Christopher

Walken

Harvey Keitel

Samuel L. Jackson

Bruce WillisSlide28

Evaluating Complex SQL on PDBs

28

12/8/2006

0.0

1.0

Christopher

Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2Slide29

Evaluating Complex SQL on PDBs

29

12/8/2006

0.0

1.0Slide30

Evaluating Complex SQL on PDBs

30

12/8/2006

0.0

1.0Slide31

Evaluating Complex SQL on PDBs

31

12/8/2006

0.0

1.0Slide32

Evaluating Complex SQL on PDBs

32

12/8/2006

0.0

1.0Slide33

Evaluating Complex SQL on PDBs

33

12/8/2006

0.0

1.0