/
Christopher Re and Dan Christopher Re and Dan

Christopher Re and Dan - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
375 views
Uploaded On 2015-09-20

Christopher Re and Dan - PPT Presentation

Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database High level Overview Evaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs ID: 135117

sum tuple present safe tuple sum safe present item plans monoid convolution random profit query count marginal monoids semantics

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Christopher Re and Dan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Christopher Re and Dan SuciuUniversity of Washington

Efficient Evaluation of HAVING Queries on a Probabilistic DatabaseSlide2

High level Overview

Evaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs:

HAVING in SQL,

e.g. is the SUM(profit) > 100k?Looking for optimal algorithms (dichotomies): For all queries q with aggregate A wantP time algorithm, call this A-Safe [DS04,DS07]Some instance s.t. q is hard (#P).Technique: In safe plans, use multiplicationIn A-safe plans, use convolution (on monoids)

2Slide3

Motivation

Item

Forecaster

AmountPWidgetAlice$-99k

0.99

Bob

$100M0.01WhatsitAlice$1M1

SELECT SUM(Amount)FROM ProfitWHERE item=‘Widget’

SELECT item FROM ProfitWHERE item =‘Widget’GROUP BY itemHAVING SUM(Amount) > 0

Expectation Style [Prior Art]

HAVING style

Ans: -99k *.99 +100M*0.01 ~900K

Ans: 0.01

Profit

3Slide4

OverviewPreliminaries

Formal Problem Description

Query plans and

DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions4Slide5

SELECT ITEM FROM PROFIT

WHERE ITEM=‘Widget’

GROUP BY ITEM

HAVING SUM(PROFIT) > 0HAVING Query semantics

NB

: Assume SQL-like semantics

Conjunctive rule: No repeated symbolsAggregatesComparision: k, is a constant

5Slide6

Probabilistic Semantics

NB

: In paper, allow

disjoint tuples

Possible worlds

, model

Query SemanticsIn talk, restrict to tuple independence6Slide7

Complexity and formal problem

Data complexity: Fix Query. Instance grows.

In practice, query is small.

Consider k,

i.e.

1000, as part of the input

Skeleton,

7Slide8

OverviewPreliminaries

Formal Problem Description

Query plans and

DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions8Slide9

Monoids and Semirings

NB

: n=1 is logical OR

A

monoid

is a triple where M is a set and + is associative with identity 0.e.g.Commutative Semiring isBoth are commutative monoids* distributes over +

e.g. a Boolean algebra9Slide10

Fix a Semiring S.Annotation

is a function to S with finite support

Plans defined inductively:

[GKT07] : Datalog + Semirings

10Slide11

Goal: define value of tuple t in a plan P,

support

, i.e. tuples contributing to a valueValue of a plan, i.e, the annotation computes[GKT07] Inductive definition

11Slide12

Annotations and HAVING

X

Y

A10B100

C

1

t(Y)112

Monoid

sum is 1 iff all values are bigger than 3

0.20.4

0.1

probabilities

0 is tuple

not present1 is tuple present, y > 3

2 is

tuple

present,

Monoids

and Aggregates

How can we deal with probabilities?

12Slide13

OverviewPreliminaries

Formal Problem Description

Query plans and

DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions13Slide14

An M-random variable (rv) is

Correlations

r,s

are independent if for any m,m’ in MExtended to sets via total independenceMonoid Random Variables

14Slide15

Monoid

Convolutions

Let r be an

rv

. A

marginal vector is

The

monoid convolution * (depending on +) is

15Slide16

Convolutions

Convolutions are efficient, if M is not too big

If

r,s

monoid

rvs then r+s is an rv defined as PROP: If r,s are independent then the distribution of r + s is given by convolution:PROP: The convolution of n r.v.s can be computed in Single convolution in time

Convolution is associative.16Slide17

OverviewPreliminaries

Formal Problem Description

Query plans and

DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions17Slide18

Annotations and HAVING

X

Y

A10B100

C

1

t(Y)112

Monoid

sum is 1 iff all values are bigger than 3

0.20.4

0.1

probabilities

(0.8,0.2,0)

(0.6,0.4,0)

(0.9,0,0.1)

Marginal of 1 after convolution = value of query

0 is

tuple

not present

1 is

tuple

present, y > 3

marginal vectors

2 is

tuple

present,

Monoids

and Aggregates

18Slide19

Compute value of “Safe Plans”:

Plan is

safe

[DS04], if all projects and joins are independent tuples, else #PTHM: value is correct if the plan is safe. “Safe plans” for semirings

Only

efficient

if the

semiring is “small”

Gives dicohotomy for MIN,MAX,COUNT – not the others19Slide20

Additional ResultsDichotomy for SUM,AVG,COUNT DISTINCT

Not all safe plans allowed!

e.g.

cannot have independent projections “on top”Disjoint tuples in the paperNeed a “disjoint projection” operation More work for dichotomiesAlgorithms for finding safe plans (P time)20Slide21

ConclusionSemantic for aggregation queries on

prob

DBs

Similar to HAVING in SQLProposed a complexity measure for such queriesCentral technique was marginal vectors and convolutionsDichotomy for HAVING queries w.o. self-joins21Slide22

22Slide23

Conjunctive rule: No repeated subgoals

Aggregates

Comparision

: k, is a constantSELECT ITEM FROM PROFITWHERE ITEM=‘Widget’GROUP BY ITEMHAVING SUM(PROFIT) > 0

HAVING Query semantics

NB

: Assume SQL-like semantics

23Slide24

Annotations and HAVING

X

Y

A10B100

C

1

t(Y)112

Monoid

sum is 1 iff all values are bigger than 3

0.20.4

0.1

probabilities

(0.8,0.2,0)

(0.6,0.4,0)

(0.9,0,0.1)

Marginal of 1 after convolution = value of query

0 is

tuple

not present

1 is

tuple

present, y > 3

marginal vectors

2 is

tuple

present,

Monoids

and Aggregates

24