/
An Investigation of the cost and accuracy tradeoffs of Supp An Investigation of the cost and accuracy tradeoffs of Supp

An Investigation of the cost and accuracy tradeoffs of Supp - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
408 views
Uploaded On 2017-10-31

An Investigation of the cost and accuracy tradeoffs of Supp - PPT Presentation

MS Thesis Defense Rohit Raghunathan August 19 th 2011 Committee Members Dr Subbarao Kambhampti Chair Dr Joohyung Lee Dr Huan Liu 1 Overview of the talk Introduction to Incomplete Autonomous Databases ID: 601361

based model network bayes model based bayes network attributes afd query missing rewriting queries year body imputation afds precision

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Investigation of the cost and accurac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases

MS Thesis DefenseRohit RaghunathanAugust 19th, 2011Committee MembersDr. Subbarao Kambhampti (Chair)Dr. Joohyung LeeDr. Huan Liu

1Slide2

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting2Slide3

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting

3Slide4

Introduction to Web databases

Many websites allow user query through a form based interface and are supported by backend databasesConsider used cars selling websites such as Cars.com, Yahoo! autos, etc

4Slide5

Incompleteness in Web databases

Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! AutosWeb databases are being populated using automated information extraction techniques which are inherently imperfectIncomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing valueWebsite# of attributes

# of tuples

incomplete tuples

Autotrader.com

13

25127

33.67%

Carsdirect.com

14

32564

98.74%

5Slide6

Problem Statement

Many entities corresponding to tuples with missing values might be relevant to the user queryTraditional query processing does not retrieve such tuplesNullAccord2003SedanQ: Make = Honda

6Slide7

Dimensions of the problem

Single vs Multiple missing valuesMultiple missing values requires capturing the correlations between themImputation vs Query RewritingImputation can look at all available evidenceQuery Rewriting requires finding the smallest number of evidencesLooking at all evidences -> reduces throughputLooking at very few evidences -> reduction in precision

Need to find middle ground

1

Audi

Sedan

20000

2

Audi

A8

Sedan

15000

3

Audi

2005

Sedan

23000

User Q: Model = A8

Rewritten Query

Make = Audi ^ Body = Sedan

7Slide8

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting

8Slide9

Approximate Functional Dependencies (AFDs)

AFDs are Functional Dependencies that hold on all but a small fraction of the databaseMakeModelBodyHondaCivicSedanHonda

Civic

Coupe

Honda

Civic

Sedan

Honda

Civic

Sedan

Model

 Body : 0.75

An AFD is of the form X

A

where X is a set of attributes and A is a single attribute

An attribute can have multiple rules

Model

 Make : 1.0

Make

 Body : 0.759Slide10

Overview of QPIAD

QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answersWhen mediator has access privileges to modify the underlying data sourceMissing values can be completed by a simple classification task. (Imputation) After which Traditional query processing will sufficeWhen mediators do not have such privilegesGenerate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)Issuing Q1 : Model = Tl

Q2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.

QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting

Techniques for combining multiple AFDs shown to be ineffective

ID

Make

Model

Year

Body

Mileage

1

BMW

745

2005

Sedan

20000

2

AcuraTl

2003350003BMW6452002Convt450004BMW7452001350005AcuraTl

2002Sedan24000Q: Body = SedanRelevant incomplete answersModel

 Body : 0.7510Slide11

Shortcomings of AFD-based approaches

Principles of locality and detachment do not hold for uncertain reasoningModel  Body (0.7)This intuitively means that model of a car determines the body of a car with a probability of 0.7 when no other evidence is available.When other evidences are present, there is no easy way to combine the probabilities11Slide12

Shortcomings of AFD-based approaches

IDMakeModelYearBodyMileage1Audi

Sedan

20000

2

Audi

A8

Sedan

15000

3

BMW

745

2002

Sedan

40000

4

Audi

2005

Sedan200005

AudiA82005Sedan2000061999Convt25000Imputing the missing values in T2 using a single AFD; ignore influence from other attributesImputing missing values in T1 ignores the correlations between the attributes Model and YearImputing missing values in T6 will get AFDs into cycles

Model  Make Make Model12Slide13

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

13Slide14

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

14Slide15

Bayes network

A Bayes network is a DAG representing the probabilistic dependencies between attributesIt is a compact representation of the full joint distributionTherefore influence from all variables are accountedIt represents the generative model of the autonomous databaseYear

Model

Make

Body

Mileage

Model

Make

Civic

Honda

0.8

..

..

..

CPDs model the strength of the probabilistic dependencies

15Slide16

Challenges in using Bayes networks for handling incompleteness in Autonomous databases

Learning and inference with Bayes networks is computationally harder than AFDsLearning the topology and parameters from data involves searching over search the space of topologiesBut can be done offlineInference in a general Bayes network is intractable.But can use approximate inferenceQuestion: Can we get benefits of exact inference while containing costs?16Slide17

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

17Slide18

Learning a Bayes network model

Structure & Parameter Learning From DataChallenge: Involves searching over topologiesUse Banjo Software Package as black-box.Experiments show learned topology is robust w.r.tSample size(5-20%) – same topologySearch time(5-30 minutes) – same topologyMax parent count (2-4) – same topology; significantly higher networks examined in case of 2.

18Slide19

Inference in Bayes networks

Exact Techniques NP-hard, in the general case. Therefore, do not scale well with increase in incompletenessJunction Tree (fastest; but inapplicable when query variables do not form a clique)Variable EliminationApproximate Techniques (Scales well; retaining accuracy of exact methods)Gibbs SamplingUsing Infer.net package allows us to use Expectation Propagation inference19Slide20

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

20Slide21

Imputation

Experimental SetupTest Databases: Cars.com database containing 8K tuples and Adult Database from UCI repository containing 15K tuplesBayes net inference Exact inference: Junction Tree, Variable EliminationApproximate inference: Gibbs Sampling21Slide22

Imputation

Remove all the values for the attribute being predictedSubstitute missing value with most likely valueAFD-approachUse only highest confidence AFD (Use all attributes if confidence is low, e.g., mileage(Cars)). Called Hybrid-one by authors of QPIAD.Bayes netInfer the posterior distribution of missing attribute, given evidences of the other attributes in the tuple 22Slide23

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

23Slide24

Imputation- single missing attribute

Significant difference for attributes Model and Year. AFDs using only the highest confidence rule, and ignore others.Attempts at combining evidences from multiple rules have been ineffective.Bayes nets systematically combines all evidences.

ID

Make

Model

Year

Body

1

Audi

A8

Sedan

2

BMW

745

2002

Sedan

3

Audi

2005

Sedan4AudiA82005Sedan24Slide25

Imputation- multiple missing attributes

AFD-approachPredict each missing value independentlyCan get in cyclesBayes netComputes the Joint distribution over the missing attributes.Make

Model

Year

Body

BMW

Sedan

BMW

2003

BMW

745

2004

Sedan

25

Make

 Model

Model  MakeSlide26

Imputation- multiple missing attributes

When missing attributes are correlated, they often get into cyclesOnly 9 out of 20 combinations could be predicted when 3 attributes are missingAFD accuracies are lower as they use a single rule independently for prediction BNs systematically combine evidences from multiple sources and capture correlations by finding the joint distributionWhen attributes are D-separated and involve attributes which have similar prediction accuracies for both methods, there is no difference in accuracy

Year

Model

Make

Body

Mileage

Price

26Slide27

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

27Slide28

Imputation- Increase in incompleteness in test data

Evidence for predicting missing values reduces with increase in incompletenessAFD-approachChain missing values in determining set of AFDBayes netNo change. Just compute posterior distribution of the attributes to be imputed given the evidence.28Q: Model = 745AFDs: Make, Body  Model

Year  Body

Make

Model

Year

Body

BMW

Sedan

BMW

2003

BMW

745

2004

SedanSlide29

Imputation- Increase in incompleteness in test data

29Slide30

Time Taken For Imputation

% incompletenessAFD (Sec.)BN-Gibbs(Sec.)(250 Samples)BN-Exact

(Sec.)

0

0.271

44.46

16.23

10

0.267

47.15

44.88

20

0.205

52.02

82.52

30

0.232

54.86

128.26400.23156.19

182.33500.23458.12248.75600.23260.09323.78700.23561.52402.13800.26263.69490.3190

0.21966.19609.65BN-Gibbs retains the accuracy edge of BN-Exact while containing costs30Slide31

Overview of the talk

Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs

31Slide32

Query Rewriting

When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.32Slide33

Query Rewriting– Single-attribute queries

IDMakeModelYearBodyMileage1BMW745

2005

Sedan

20000

2

Acura

Tl

2003

35000

3

BMW

645

2002

Convt

45000

4

BMW

7452001350005Acura

Tl2002Sedan24000Can retrieve T2 with Q’1: Model = Tl T4 with Q’2: Model = 745

Q: Body = Sedan1BMW745

2005Sedan

20000

5

Acura

Tl

2002

Sedan

24000

CERTAIN ANSWERS (BASE RESULT SET)

Relevant incomplete answers

33Slide34

Generating Rewritten Queries

IDMakeModelYearBodyMileage

1

BMW

745

2005

Sedan

20000

5

Acura

Tl

2002

Sedan

24000

CERTAIN ANSWERS (BASE RESULT SET)

Bayes Networks

ATTRIBUTES: ALL ATTRIBUTES IN

MARKOV BLANKET

(BN-ALL-MB)

Q’1: Model = 745Q’2: Model = Tl

YearModelMake

Body

Mileage

Given evidence of all attributes in MARKOV BLANKET, an attribute is independent of ALL other attributes

AFDs

ATTRIBUTES: DETERMINING SET OF AFD

Model

 Body : 0.9

Q’

1

: Model = 745

Q’

2

: Model =

Tl

34

Q: Body =

SedanSlide35

Ranking Rewritten queries

All queries may not be equally good in retrieving relevant answers“tl” model cars are more likely to be sedans than a car with “745” modelRank queries based on their expected precision (ExpPrec)Bayes NetworksInference in bayes networkAFDsUse Naïve Bayes Classifiers

ExpPrec

(Q) =

P(A

m

=

v

m|ti)

where

t

i

ε

П

MB(Am)

(RS(Q)) for Bayes netswhere ti ε ПdtrSet(Am)(RS(Q

)) for AFDsQ1’: Model = ‘tl’.ExpPrec(Q1’)= P(Body=Sedan|Model=tl) = 1

Q2’= Model = ‘745’.ExpPrec(Q2’)= P(Body=Sedan|Model=745) = 0.635Slide36

Ranking Rewritten Queries- only K queries

When database or network resources are limited, the mediator can choose to issue the top-K queries to get the most relevant uncertain answersIt is important to carefully trade precision with throughputUse F-measure metric (idea borrowed from QPIAD)

P

– expected precision (e.g. P(Model=745|Make =BMW) )

R

– expected recall

R

= expected precision * expected selectivity

e

xpected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

 

=0 – only precision

 

36Slide37

Experimental Setup

Test databases: Cars database consisting of 55K tuples and Adult database consisting of 15K tuplesTraining set 15% of the database. Test data split in two halves- One half contains no incompleteness and is used to return the base result setIn the other half all query-constrained attributes are made nullA copy of test data is used as the ground truth to compute precision and recallThis is an aggressive setup since most databases have <50% incompleteness37Slide38

BN-All-MB vs AFD

BN-All-MB: P(Make=bmw|model= 330)AFD: P(Make=bmw|model=330)

When size of determining set > 1 Expected Precision values represented of AFDs (represented by NBCs) are inaccurate

Actual precision is lower for AFDs because their expected precisions are inaccurate

38

Q: MakeSlide39

Shortcoming of BN-All-MB

Throughput of queries reduces drastically as markov blanket size increasesUse F-measure based ranking to increase recallWhen almost all queries have very low throughput there is simply no way to increase recall Year

Model

Make

Body

Mileage

Q: Model = 745

Q’

1

:

Make

Body

Year

Q’

2

:

Make

ᴧBodyᴧ

YearQ’3: MakeᴧBodyᴧYear39Slide40

BN-Beam (Single-attribute queries)

Q: Model = 745YearModel

Make

Body

Mileage

ID

Make

Model

Year

Body

Mileage

1

BMW

745

2005

Sedan

20000

2

BMW

2005

Sedan350003BMW6452002Convt450004BMW7452001

350005AcuraTl2002Sedan24000

6

BMW

2001

Sedan

20000

Candidate Attribute Set = {Year, Make, Body}

40Slide41

BN-Beam

Level 1Make = BMWYear = 2001Body = Sedan

Pick Top-

K

queries at each level based on F-measure metric

P

– expected precision (e.g. P(Model=745|Make =BMW) )

R

– expected recall

R

= expected precision * expected selectivity

Expected selectivity = Sample Selectivity * Sample Ratio

Sample Ratio estimated from cardinalities result sets from sample and original database

 

Level 2

Make = BMW ^ Year = 2001

Make = BMW ^ Year = 2005

Body = Sedan

Level LQ’1Q’2Q’

3

Issue to database in the increasing order of expected precision

At Level

L

all (partial) queries have ≤

L

attributes constrained

Year

Body

Best rewritten queries of size 1

41Slide42

BN-Beam vs BN-All-MB

Increasing α does not increase recall of BN-All-MBBN-Beam increases recall without a catastrophic reduction in precisionResults for Top-10 queries for user query Year = 2002Recall Plot

Precision Plot

42Slide43

Multi-attribute queries

Contribution to QPIADAim: To retrieve relevant uncertain answers with multiple-missing values on query-constrained attributes.43Slide44

Multi-attribute queries

IDMakeModelYearBodyMileage16452002

Coupe

40000

2

BMW

645

2002

Convt

3

745

2001

Sedan

4

645

2002

Coupe

5

BMW7452001

Coupe400006BMW6452002Convt40000Q: Make = BMW ʌ Mileage = 40000Base result set = T5, T6QPIAD retrieves T1 and T2.BN-Beam can also retrieve T3 and T4.Candidate attribute set: union of attributes in the markov blanket of all constrained attributes

All other steps same as single-attribute query caseBase result set QPIAD

BN-Beam

44Slide45

Comparison over multi-attribute queries

Two AFD approachesAFD-All-Attributes: Creates a conjunctive query by joining all attributes in the determining set of the AFDs of the constrained attributes.Consider AFDsModel  Make Year  MileageQ: Make = BMW ʌ Mileage = 40000

Make = BMW

Model = 745

Model = 645

Mileage = 40000

Year = 2001

Year = 2002

Q’

1

:

Model=745

Year=2001

Q’

2

: Model=645

Year=2001Q’3: Model=745ᴧYear=2002Q’

4: Model=645ᴧYear=2002Expected Precision = Product of individual query’s expected precision45Slide46

BN-Beam vs

AFD-All-AttributesPrecision of BN-Beam is competitive with AFD-All AttributesRecall of BN-Beam is higher

AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.

Leads to low throughput or even empty queries

Results for top-10 queries

Q: Make ^ Mileage

46Slide47

Comparison of multi-attribute queries

AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting Q: Make = Dodge ᴧ Year = 2004 IGNORE all attributes other than Make AFD : Model  Make Q’1: Model=ram Q’2

: Model= intrepid

47Slide48

BN-Beam vs

AFD-Highest-ConfidenceResults for top-10 queriesQ:Make ʌ Year(Car database)AFD-Highest-Confidence increases recall but NOT WITHOUT a CATASTROPHIC drop in precision

48Slide49

Summary

A comparison of cost and accuracy tradeoffs of using Bayes network models and AFDs for handling incompleteness in autonomous databasesBayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data. Presented two approaches- BN-All-MB and BN-Beam for generating rewritten queries using Bayes networks. We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall49Slide50

Deviations From the Thesis Draft

CAVEAT: I found two bugs in my code (Query Rewriting section)Corrected one bug (related to BN-based rewriting)Will correct the other one (related to AFD-based rewriting) after the defenseTHANK YOUQUESTIONS?50