Bhargav Kanagal Amol Deshpande University of Maryland Motivation Information ExtractionIntegration GuptaampSarawagi2006 Jayram et al 2006 Structured entities extracted from text in the internet ID: 784186
Download The PPT/PDF document "Lineage Processing over Correlated Proba..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Lineage Processing over Correlated Probabilistic Databases
Bhargav KanagalAmol DeshpandeUniversity of Maryland
Slide2Motivation: Information Extraction/Integration
[Gupta&Sarawagi’2006, Jayram et al. 2006
]
Structured entities extracted from text in the internet
Reputed
SENTIMENT
ANALYSIS
Location
...located at 52 A Goregaon West Mumbai ...
ADDRESS
SEGMENTATION
CarAds
INFORMATION
EXTRACTION
CORRELATIONS
Slide3Reputed
SELECT
SellerId
FROM
Location, CarAds, Reputed
WHERE
reputation = ‘good’
AND
city = `Mumbai’ Location.SellerId = CarAds.SellerId AND CarAds.SellerId = Reputed.SellerId
Why Lineage Processing ?
[Das Sarma et al. 2006]
Location
CarAds
List all “reputed” car sellers in “Mumbai” who offer Honda cars
We need to compute the probability
of the above boolean formula
Slide4Motivation: RFID based
Event Monitoring
[RFID Ecosystem UW, Diao et al.
2009, Letchner et al. 2009, KD 2008]
A building instrumented with RFID
readers to
track assets / personnel
found(PC, X, 2pm), prob = 0.9
RFID readings are noisy
Miss
readings
Add spurious
readings
Subjected to
probabilistic modeling
Probabilities associated with eventsSpatial and Temporal correlations
found(x,PC)
∧found(z,PC)∧
[
found(y1
,PC)∨
found(y2,PC
)]
Was the PC correctly transferred from room A to the conference room ?
Slide5A Relational DBMS
Data tables
Uncertainty
Parameters
INDSEP Indexes
Query
Processor
PARSER
INDSEP
Manager
User
insert into
reputation
values
(‘z1’,219,
uncertain
(‘Good 0.5; Bad 0.5’);
insert factor
‘0 0 1; 1 1 1’
in
address
on
‘y1.e’,‘y2.e’;
PrDB System Overview
[Kanagal & Deshpande SIGMOD 2009, SDG08, www.cs.umd.edu/~amol/PrDB/]
Insert data + correlations
Issue
SPJ queries
Inference queries
Aggregation queries
Slide6Outline
Motivation & Problem definition [done]BackgroundProbabilistic Databases as Junction treesQuery processing over Junction trees
INDSEP
Lineage Processing over Junction trees
Lineage Processing using INDSEP
Results
Slide7id
YExists ?
1
34
?
2
33?
3
25?..
....5
11?
id
Y
Exists ?
134a
233b
325c
......
511
q
Background: ProbDBs as Junction trees
Tuple Uncertainty
Attribute Uncertainty
Converted to Tuple Uncertainty
Correlations
Consise encoding of the
joint probability distribution
Query evaluation is performed
directly over Junction Trees
Forest of junction trees
Random Variable
1
tuple exists
0 otherwise
Slide8Background: Junction trees
Each clique and separator stores joint pdf (POTENTIAL)
Tree structure reflects Markov property
Given
b
, c
: a independent of d
p(a,b,c)
p(b,c)
p(b,c,d)
Clique
Separator
Marginal: p(a,d)
Joint distribution
Slide9Marginal Computation
{b,
c
,
n
}
Keep query variables
Keep correlations
Remove others
PIVOT
Steiner tree + Send messages toward a given pivot node
For ProbDBs ≈ 1 million tuples, not scalable
Span of the query can be very large – almost the
complete database accessed
even for a 3 variable query
Searching for cliques is expensive: Linear scan over all the nodes is inefficient
Slide1050 ops
Shortcut Potentials
How can we make marginal computation scalable ?
100 ops
Shortcut Potential
Junction tree on set variables
{
c
,
f
,
g
,
j
,
k
,
l
,
m
}
Boundary separators
Distribution required to completely shortcut the partition
Which to build ?
Slide11Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
INDSEP - Overview
Variables: {
a,b
,..} {
c,f
,..} {
j,n..q
}
Child Separators:
p(c
), p(j)
Tree induced on the children
Shortcut potentials
of children: {
p(c
),
p(c,j
), p(j)}
Obtained by hierarchical partitioning of the junction tree
Actual Construction: [Kanagal & Deshpande SIGMOD 2009]
Slide12Computing Marginals using INDSEP
Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
{b,
c
,
n
}
{b, c}
{n}
{b,
c
}
{c, j}
{j,
n
}
{b,
c
,
n
}
Intermediate
Junction tree
[Kanagal & Deshpande SIGMOD 2009]
Recursion on INDSEP
Slide13Outline
Motivation & Problem definition [done]Background [done]
Junction trees & Query processing over junction trees
INDSEP
Lineage Processing over Junction trees
Lineage Processing using INDSEPResults
Slide14Lineage Processing
Typically classified into 2 types
Read-Once
(a
∧
b)
∨
(c
∧
d)
Non-Read-Once
(a
∧
b)
∨
(b∧c)
∨(c∧d)
The problem of lineage processing is #P-complete in general for correlated probabilistic databases, even for read-once lineages
Reduction from #DNF
Slide15Lineage Processing on Junction trees
Naïve:
(
a
∧
b)
∨
(c
∧d)
p(a,
b
, c, d)
p(a, b,
a∧
b, c, d)
p(a∧b, c, d)
p(a
∧b, c, d, c∧
d)
p(a
∧b
, c∧d)
p((a
∧b)
∨(c
∧d))
Multiply with
p(a
∧b|a,b)
Eliminate
a
,b
Multiply / Eliminate Evaluate marginal query over variables in formula
COMPLEXITY
Simplifcation (name of the above process)
Dependent on the size of the intermediate pdf
Here, it is at least (n+1) (#terms in the formula)
Not scalable to large formulae
Multiply
Eliminate
Slide16Lineage Processing [Optimization opportunities]
1. EAGER
Exploit conditional independence & simplify early
p(a, c, d)
p(a, c
∧
d)
PIVOT
Query: (a
∧
b)
∨
(c
∧
d)
[Kanagal & Deshpande SIGMOD 2010]
p(a, d)
p(a, d)
Slide17Lineage Processing [Optimization opportunities]
[Kanagal & Deshpande SIGMOD 2010]
p(c, f, g, h, m
∧
n)
p(c, h, m
∧
n)
p((c
∧
h)
∨(m∧n))
p(f, h)
p(g,m
∧
n)
p(c, f, g)
p(c, f, g, h)
p(g,m
∧
n)
p(g, c
∧
h)
p(c
∧
h, m
∧
n)
Max pdf: 5
Distribute simplification into the product
2. EAGER+ORDER
(c
∧
h)
∨
(m
∧
n)
Max pdf: 4
How to compute good ordering ?
Slide18Lineage Processing [Pivot Selection]
Also influences the intermediate pdf size
Optimal Pivot: Only n possible choices, estimate
pdf size
for each pivot location
Pivot = (ab)
Pivot = (cfg)
(
b
∧
c)
∨
g
Max pdf: 3
Max pdf: 4
Slide19Outline
Motivation & Problem definition [done]Background [done]
Junction trees & Query processing over junction trees
INDSEP
Lineage Processing over Junction trees [done]
Lineage Processing using INDSEPResults
Slide20Lineage Processing using INDSEP
Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
{b, c, d, e}
{n, o}
{j,
n
∨
o
}
{
b
∧
c
,
d
∨
e
, c}
{c, j}
Recursion bottomed out using EAGER+ORDER
(
b
∧
c
)
∨((
d
∨
e
)
∧(
n
∨
o
) )
But what is the running time ?
Slide21Lineage Planning Phase
(
b
∧
c
)
∨((
d
∨e)
∧(n∨o
) )
QUERY PLAN
Estimate maximum
intermediate pdf size
at each node
4
5
6
4
4
7
4
If a node exceeds a threshold, do approximations to estimate probability
In addition, modify query plan for:
Multiple lineages that share variables
Exploiting disconnections
Slide22Results
Query Processing times for
different heuristics
Datasets
D1: Fully independent
D2: Correlated
D3: Highly Correlated (long chains)
NOTE: LOG scale
NOTE: LOG scale
Comparison Systems
NAIVE
EAGER
EAGER + ORDER
EAGER+ORDER is much more efficient than others
Slide23Results
Query Processing time vs Lineage size
NOTE: LOG scale
Ratio vs
Sharing factor
Multiquery processing exploits sharing
Highly dependent on size of lineage
Slide24Conclusions
Proposed a scalable system for evaluating boolean formula queries over correlated probabilistic databasesFuturePlan to further the approximation approachesEnvelopes of boolean formulas for upper and lower bounds
Thank you
Slide25Lineage Processing (contd.)
Amount of simplification possible
when nodes are multiplied
Construct complete graph
on factors to be multiplied
p(g, c
∧
h)
p(f, h)
p(c, f, g)
4 - 2
Pick the biggest edge
Merge / Simplify nodes together
Recompute new edge weights
Slide26Lineage Processing via INDSEP [Improvement 1]
Multiple Lineage Processing: Exploit possibility of sharing
Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
{c, g, j}
{j, m}
{c, g, j}
{j, n}
(m
∧
c)
∨
g
(n
∧
c)
∨
g
Sharing across multiple levels
Need not even share variables, just paths
Slide27Lineage Processing via INDSEP [Improvement 2]
Extend to forest of junction trees: Real world data sets may have independences
Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
Index constructed to minimize disk wastage, combining forests together
(a
∧
o)
{a, c}
{j, o}
{a, c}
{c, j}
{j}
{o}
j and o are disconnected !!
a and o are disconnected !!
Preprocess formula, keep variables in connected components together
Slide28Lineage Processing via INDSEP [Improvement 3]
What about complexity ?
Complexity not evident from the algorithm
Root
I
1
I
2
I
3
P
1
P
2
P
6
P
5
P
4
P
3
{b, c, d, e}
{n, o}
{j, n
∨
o}
{b
∧
c, d
∨
e, c}
{c, j}
{d
∨
e}
{b
∧
c, c}
{j, n}
{o}
Compute lwidth here
Compute lwidth here
Intermediate junction tree
“Predict” how large the intermediate cliques will be
Approximate for all portions whose estimate is more than a threshold, e.g., 10