/
Lineage Processing over Correlated Probabilistic Databases Lineage Processing over Correlated Probabilistic Databases

Lineage Processing over Correlated Probabilistic Databases - PowerPoint Presentation

eatfuzzy
eatfuzzy . @eatfuzzy
Follow
345 views
Uploaded On 2020-06-23

Lineage Processing over Correlated Probabilistic Databases - PPT Presentation

Bhargav Kanagal Amol Deshpande University of Maryland Motivation Information ExtractionIntegration GuptaampSarawagi2006 Jayram et al 2006 Structured entities extracted from text in the internet ID: 784186

lineage processing trees junction processing lineage junction trees query indsep amp pdf pivot tree variables intermediate size root eager

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Lineage Processing over Correlated Proba..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Lineage Processing over Correlated Probabilistic Databases

Bhargav KanagalAmol DeshpandeUniversity of Maryland

Slide2

Motivation: Information Extraction/Integration

[Gupta&Sarawagi’2006, Jayram et al. 2006

]

Structured entities extracted from text in the internet

Reputed

SENTIMENT

ANALYSIS

Location

...located at 52 A Goregaon West Mumbai ...

ADDRESS

SEGMENTATION

CarAds

INFORMATION

EXTRACTION

CORRELATIONS

Slide3

Reputed

SELECT

SellerId

FROM

Location, CarAds, Reputed

WHERE

reputation = ‘good’

AND

city = `Mumbai’ Location.SellerId = CarAds.SellerId AND CarAds.SellerId = Reputed.SellerId

Why Lineage Processing ?

[Das Sarma et al. 2006]

Location

CarAds

List all “reputed” car sellers in “Mumbai” who offer Honda cars

We need to compute the probability

of the above boolean formula

Slide4

Motivation: RFID based

Event Monitoring

[RFID Ecosystem UW, Diao et al.

2009, Letchner et al. 2009, KD 2008]

A building instrumented with RFID

readers to

track assets / personnel

found(PC, X, 2pm), prob = 0.9

RFID readings are noisy

Miss

readings

Add spurious

readings

Subjected to

probabilistic modeling

Probabilities associated with eventsSpatial and Temporal correlations

found(x,PC)

∧found(z,PC)∧

[

found(y1

,PC)∨

found(y2,PC

)]

Was the PC correctly transferred from room A to the conference room ?

Slide5

A Relational DBMS

Data tables

Uncertainty

Parameters

INDSEP Indexes

Query

Processor

PARSER

INDSEP

Manager

User

insert into

reputation

values

(‘z1’,219,

uncertain

(‘Good 0.5; Bad 0.5’);

insert factor

‘0 0 1; 1 1 1’

in

address

on

‘y1.e’,‘y2.e’;

PrDB System Overview

[Kanagal & Deshpande SIGMOD 2009, SDG08, www.cs.umd.edu/~amol/PrDB/]

Insert data + correlations

Issue

SPJ queries

Inference queries

Aggregation queries

Slide6

Outline

Motivation & Problem definition [done]BackgroundProbabilistic Databases as Junction treesQuery processing over Junction trees

INDSEP

Lineage Processing over Junction trees

Lineage Processing using INDSEP

Results

Slide7

id

YExists ?

1

34

?

2

33?

3

25?..

....5

11?

id

Y

Exists ?

134a

233b

325c

......

511

q

Background: ProbDBs as Junction trees

Tuple Uncertainty

Attribute Uncertainty

Converted to Tuple Uncertainty

Correlations

Consise encoding of the

joint probability distribution

Query evaluation is performed

directly over Junction Trees

Forest of junction trees

Random Variable

1

tuple exists

0 otherwise

Slide8

Background: Junction trees

Each clique and separator stores joint pdf (POTENTIAL)

Tree structure reflects Markov property

Given

b

, c

: a independent of d

p(a,b,c)

p(b,c)

p(b,c,d)

Clique

Separator

Marginal: p(a,d)

Joint distribution

Slide9

Marginal Computation

{b,

c

,

n

}

Keep query variables

Keep correlations

Remove others

PIVOT

Steiner tree + Send messages toward a given pivot node

For ProbDBs ≈ 1 million tuples, not scalable

Span of the query can be very large – almost the

complete database accessed

even for a 3 variable query

Searching for cliques is expensive: Linear scan over all the nodes is inefficient

Slide10

50 ops

Shortcut Potentials

How can we make marginal computation scalable ?

100 ops

Shortcut Potential

Junction tree on set variables

{

c

,

f

,

g

,

j

,

k

,

l

,

m

}

Boundary separators

Distribution required to completely shortcut the partition

Which to build ?

Slide11

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

INDSEP - Overview

Variables: {

a,b

,..} {

c,f

,..} {

j,n..q

}

Child Separators:

p(c

), p(j)

Tree induced on the children

Shortcut potentials

of children: {

p(c

),

p(c,j

), p(j)}

Obtained by hierarchical partitioning of the junction tree

Actual Construction: [Kanagal & Deshpande SIGMOD 2009]

Slide12

Computing Marginals using INDSEP

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

{b,

c

,

n

}

{b, c}

{n}

{b,

c

}

{c, j}

{j,

n

}

{b,

c

,

n

}

Intermediate

Junction tree

[Kanagal & Deshpande SIGMOD 2009]

Recursion on INDSEP

Slide13

Outline

Motivation & Problem definition [done]Background [done]

Junction trees & Query processing over junction trees

INDSEP

Lineage Processing over Junction trees

Lineage Processing using INDSEPResults

Slide14

Lineage Processing

Typically classified into 2 types

Read-Once

(a

b)

(c

d)

Non-Read-Once

(a

b)

(b∧c)

∨(c∧d)

The problem of lineage processing is #P-complete in general for correlated probabilistic databases, even for read-once lineages

Reduction from #DNF

Slide15

Lineage Processing on Junction trees

Naïve:

(

a

b)

(c

∧d)

p(a,

b

, c, d)

p(a, b,

a∧

b, c, d)

p(a∧b, c, d)

p(a

∧b, c, d, c∧

d)

p(a

∧b

, c∧d)

p((a

∧b)

∨(c

∧d))

Multiply with

p(a

∧b|a,b)

Eliminate

a

,b

Multiply / Eliminate Evaluate marginal query over variables in formula

COMPLEXITY

Simplifcation (name of the above process)

Dependent on the size of the intermediate pdf

Here, it is at least (n+1) (#terms in the formula)

Not scalable to large formulae

Multiply

Eliminate

Slide16

Lineage Processing [Optimization opportunities]

1. EAGER

Exploit conditional independence & simplify early

p(a, c, d)

p(a, c

d)

PIVOT

Query: (a

b)

(c

d)

[Kanagal & Deshpande SIGMOD 2010]

p(a, d)

p(a, d)

Slide17

Lineage Processing [Optimization opportunities]

[Kanagal & Deshpande SIGMOD 2010]

p(c, f, g, h, m

n)

p(c, h, m

n)

p((c

h)

∨(m∧n))

p(f, h)

p(g,m

n)

p(c, f, g)

p(c, f, g, h)

p(g,m

n)

p(g, c

h)

p(c

h, m

n)

Max pdf: 5

Distribute simplification into the product

2. EAGER+ORDER

(c

h)

(m

n)

Max pdf: 4

How to compute good ordering ?

Slide18

Lineage Processing [Pivot Selection]

Also influences the intermediate pdf size

Optimal Pivot: Only n possible choices, estimate

pdf size

for each pivot location

Pivot = (ab)

Pivot = (cfg)

(

b

c)

g

Max pdf: 3

Max pdf: 4

Slide19

Outline

Motivation & Problem definition [done]Background [done]

Junction trees & Query processing over junction trees

INDSEP

Lineage Processing over Junction trees [done]

Lineage Processing using INDSEPResults

Slide20

Lineage Processing using INDSEP

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

{b, c, d, e}

{n, o}

{j,

n

o

}

{

b

c

,

d

e

, c}

{c, j}

Recursion bottomed out using EAGER+ORDER

(

b

c

)

∨((

d

e

)

∧(

n

o

) )

But what is the running time ?

Slide21

Lineage Planning Phase

(

b

c

)

∨((

d

∨e)

∧(n∨o

) )

QUERY PLAN

Estimate maximum

intermediate pdf size

at each node

4

5

6

4

4

7

4

If a node exceeds a threshold, do approximations to estimate probability

In addition, modify query plan for:

Multiple lineages that share variables

Exploiting disconnections

Slide22

Results

Query Processing times for

different heuristics

Datasets

D1: Fully independent

D2: Correlated

D3: Highly Correlated (long chains)

NOTE: LOG scale

NOTE: LOG scale

Comparison Systems

NAIVE

EAGER

EAGER + ORDER

EAGER+ORDER is much more efficient than others

Slide23

Results

Query Processing time vs Lineage size

NOTE: LOG scale

Ratio vs

Sharing factor

Multiquery processing exploits sharing

Highly dependent on size of lineage

Slide24

Conclusions

Proposed a scalable system for evaluating boolean formula queries over correlated probabilistic databasesFuturePlan to further the approximation approachesEnvelopes of boolean formulas for upper and lower bounds

Thank you

Slide25

Lineage Processing (contd.)

Amount of simplification possible

when nodes are multiplied

Construct complete graph

on factors to be multiplied

p(g, c

h)

p(f, h)

p(c, f, g)

4 - 2

Pick the biggest edge

Merge / Simplify nodes together

Recompute new edge weights

Slide26

Lineage Processing via INDSEP [Improvement 1]

Multiple Lineage Processing: Exploit possibility of sharing

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

{c, g, j}

{j, m}

{c, g, j}

{j, n}

(m

c)

g

(n

c)

g

Sharing across multiple levels

Need not even share variables, just paths

Slide27

Lineage Processing via INDSEP [Improvement 2]

Extend to forest of junction trees: Real world data sets may have independences

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

Index constructed to minimize disk wastage, combining forests together

(a

o)

{a, c}

{j, o}

{a, c}

{c, j}

{j}

{o}

j and o are disconnected !!

a and o are disconnected !!

Preprocess formula, keep variables in connected components together

Slide28

Lineage Processing via INDSEP [Improvement 3]

What about complexity ?

Complexity not evident from the algorithm

Root

I

1

I

2

I

3

P

1

P

2

P

6

P

5

P

4

P

3

{b, c, d, e}

{n, o}

{j, n

o}

{b

c, d

e, c}

{c, j}

{d

e}

{b

c, c}

{j, n}

{o}

Compute lwidth here

Compute lwidth here

Intermediate junction tree

“Predict” how large the intermediate cliques will be

Approximate for all portions whose estimate is more than a threshold, e.g., 10