Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database High level Overview Evaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs ID: 135117
Download Presentation The PPT/PDF document "Christopher Re and Dan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Christopher Re and Dan SuciuUniversity of Washington
Efficient Evaluation of HAVING Queries on a Probabilistic DatabaseSlide2
High level Overview
Evaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs:
HAVING in SQL,
e.g. is the SUM(profit) > 100k?Looking for optimal algorithms (dichotomies): For all queries q with aggregate A wantP time algorithm, call this A-Safe [DS04,DS07]Some instance s.t. q is hard (#P).Technique: In safe plans, use multiplicationIn A-safe plans, use convolution (on monoids)
2Slide3
Motivation
Item
Forecaster
AmountPWidgetAlice$-99k
0.99
Bob
$100M0.01WhatsitAlice$1M1
SELECT SUM(Amount)FROM ProfitWHERE item=‘Widget’
SELECT item FROM ProfitWHERE item =‘Widget’GROUP BY itemHAVING SUM(Amount) > 0
Expectation Style [Prior Art]
HAVING style
Ans: -99k *.99 +100M*0.01 ~900K
Ans: 0.01
Profit
3Slide4
OverviewPreliminaries
Formal Problem Description
Query plans and
DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions4Slide5
SELECT ITEM FROM PROFIT
WHERE ITEM=‘Widget’
GROUP BY ITEM
HAVING SUM(PROFIT) > 0HAVING Query semantics
NB
: Assume SQL-like semantics
Conjunctive rule: No repeated symbolsAggregatesComparision: k, is a constant
5Slide6
Probabilistic Semantics
NB
: In paper, allow
disjoint tuples
Possible worlds
, model
Query SemanticsIn talk, restrict to tuple independence6Slide7
Complexity and formal problem
Data complexity: Fix Query. Instance grows.
In practice, query is small.
Consider k,
i.e.
1000, as part of the input
Skeleton,
7Slide8
OverviewPreliminaries
Formal Problem Description
Query plans and
DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions8Slide9
Monoids and Semirings
NB
: n=1 is logical OR
A
monoid
is a triple where M is a set and + is associative with identity 0.e.g.Commutative Semiring isBoth are commutative monoids* distributes over +
e.g. a Boolean algebra9Slide10
Fix a Semiring S.Annotation
is a function to S with finite support
Plans defined inductively:
[GKT07] : Datalog + Semirings
10Slide11
Goal: define value of tuple t in a plan P,
support
, i.e. tuples contributing to a valueValue of a plan, i.e, the annotation computes[GKT07] Inductive definition
11Slide12
Annotations and HAVING
X
Y
A10B100
C
1
t(Y)112
Monoid
sum is 1 iff all values are bigger than 3
0.20.4
0.1
probabilities
0 is tuple
not present1 is tuple present, y > 3
2 is
tuple
present,
Monoids
and Aggregates
How can we deal with probabilities?
12Slide13
OverviewPreliminaries
Formal Problem Description
Query plans and
DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions13Slide14
An M-random variable (rv) is
Correlations
r,s
are independent if for any m,m’ in MExtended to sets via total independenceMonoid Random Variables
14Slide15
Monoid
Convolutions
Let r be an
rv
. A
marginal vector is
The
monoid convolution * (depending on +) is
15Slide16
Convolutions
Convolutions are efficient, if M is not too big
If
r,s
monoid
rvs then r+s is an rv defined as PROP: If r,s are independent then the distribution of r + s is given by convolution:PROP: The convolution of n r.v.s can be computed in Single convolution in time
Convolution is associative.16Slide17
OverviewPreliminaries
Formal Problem Description
Query plans and
DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions17Slide18
Annotations and HAVING
X
Y
A10B100
C
1
t(Y)112
Monoid
sum is 1 iff all values are bigger than 3
0.20.4
0.1
probabilities
(0.8,0.2,0)
(0.6,0.4,0)
(0.9,0,0.1)
Marginal of 1 after convolution = value of query
0 is
tuple
not present
1 is
tuple
present, y > 3
marginal vectors
2 is
tuple
present,
Monoids
and Aggregates
18Slide19
Compute value of “Safe Plans”:
Plan is
safe
[DS04], if all projects and joins are independent tuples, else #PTHM: value is correct if the plan is safe. “Safe plans” for semirings
Only
efficient
if the
semiring is “small”
Gives dicohotomy for MIN,MAX,COUNT – not the others19Slide20
Additional ResultsDichotomy for SUM,AVG,COUNT DISTINCT
Not all safe plans allowed!
e.g.
cannot have independent projections “on top”Disjoint tuples in the paperNeed a “disjoint projection” operation More work for dichotomiesAlgorithms for finding safe plans (P time)20Slide21
ConclusionSemantic for aggregation queries on
prob
DBs
Similar to HAVING in SQLProposed a complexity measure for such queriesCentral technique was marginal vectors and convolutionsDichotomy for HAVING queries w.o. self-joins21Slide22
22Slide23
Conjunctive rule: No repeated subgoals
Aggregates
Comparision
: k, is a constantSELECT ITEM FROM PROFITWHERE ITEM=‘Widget’GROUP BY ITEMHAVING SUM(PROFIT) > 0
HAVING Query semantics
NB
: Assume SQL-like semantics
23Slide24
Annotations and HAVING
X
Y
A10B100
C
1
t(Y)112
Monoid
sum is 1 iff all values are bigger than 3
0.20.4
0.1
probabilities
(0.8,0.2,0)
(0.6,0.4,0)
(0.9,0,0.1)
Marginal of 1 after convolution = value of query
0 is
tuple
not present
1 is
tuple
present, y > 3
marginal vectors
2 is
tuple
present,
Monoids
and Aggregates
24