MS Thesis Defense Rohit Raghunathan August 19 th 2011 Committee Members Dr Subbarao Kambhampti Chair Dr Joohyung Lee Dr Huan Liu 1 Overview of the talk Introduction to Incomplete Autonomous Databases ID: 601361
Download Presentation The PPT/PDF document "An Investigation of the cost and accurac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Investigation of the cost and accuracy tradeoffs of Supplanting AFDs with Bayes Network in Query Processing in the Presence of Incompleteness in Autonomous Databases
MS Thesis DefenseRohit RaghunathanAugust 19th, 2011Committee MembersDr. Subbarao Kambhampti (Chair)Dr. Joohyung LeeDr. Huan Liu
1Slide2
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting2Slide3
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting
3Slide4
Introduction to Web databases
Many websites allow user query through a form based interface and are supported by backend databasesConsider used cars selling websites such as Cars.com, Yahoo! autos, etc
4Slide5
Incompleteness in Web databases
Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! AutosWeb databases are being populated using automated information extraction techniques which are inherently imperfectIncomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing valueWebsite# of attributes
# of tuples
incomplete tuples
Autotrader.com
13
25127
33.67%
Carsdirect.com
14
32564
98.74%
5Slide6
Problem Statement
Many entities corresponding to tuples with missing values might be relevant to the user queryTraditional query processing does not retrieve such tuplesNullAccord2003SedanQ: Make = Honda
6Slide7
Dimensions of the problem
Single vs Multiple missing valuesMultiple missing values requires capturing the correlations between themImputation vs Query RewritingImputation can look at all available evidenceQuery Rewriting requires finding the smallest number of evidencesLooking at all evidences -> reduces throughputLooking at very few evidences -> reduction in precision
Need to find middle ground
1
Audi
Sedan
20000
2
Audi
A8
Sedan
15000
3
Audi
2005
Sedan
23000
User Q: Model = A8
Rewritten Query
Make = Audi ^ Body = Sedan
7Slide8
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewriting
8Slide9
Approximate Functional Dependencies (AFDs)
AFDs are Functional Dependencies that hold on all but a small fraction of the databaseMakeModelBodyHondaCivicSedanHonda
Civic
Coupe
Honda
Civic
Sedan
Honda
Civic
Sedan
Model
Body : 0.75
An AFD is of the form X
A
where X is a set of attributes and A is a single attribute
An attribute can have multiple rules
Model
Make : 1.0
Make
Body : 0.759Slide10
Overview of QPIAD
QPIAD uses AFDs and Naïve Bayes Classifiers to retrieve relevant uncertain answersWhen mediator has access privileges to modify the underlying data sourceMissing values can be completed by a simple classification task. (Imputation) After which Traditional query processing will sufficeWhen mediators do not have such privilegesGenerate a set of rewritten queries and issue it to the autonomous database (Query Rewriting)Issuing Q1 : Model = Tl
Q2 : Model = 745 will retrieve relevant incomplete answers T2 and T4.
QPIAD uses only the highest confidence AFD of each attribute for imputation and Query Rewriting
Techniques for combining multiple AFDs shown to be ineffective
ID
Make
Model
Year
Body
Mileage
1
BMW
745
2005
Sedan
20000
2
AcuraTl
2003350003BMW6452002Convt450004BMW7452001350005AcuraTl
2002Sedan24000Q: Body = SedanRelevant incomplete answersModel
Body : 0.7510Slide11
Shortcomings of AFD-based approaches
Principles of locality and detachment do not hold for uncertain reasoningModel Body (0.7)This intuitively means that model of a car determines the body of a car with a probability of 0.7 when no other evidence is available.When other evidences are present, there is no easy way to combine the probabilities11Slide12
Shortcomings of AFD-based approaches
IDMakeModelYearBodyMileage1Audi
Sedan
20000
2
Audi
A8
Sedan
15000
3
BMW
745
2002
Sedan
40000
4
Audi
2005
Sedan200005
AudiA82005Sedan2000061999Convt25000Imputing the missing values in T2 using a single AFD; ignore influence from other attributesImputing missing values in T1 ignores the correlations between the attributes Model and YearImputing missing values in T6 will get AFDs into cycles
Model Make Make Model12Slide13
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
13Slide14
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
14Slide15
Bayes network
A Bayes network is a DAG representing the probabilistic dependencies between attributesIt is a compact representation of the full joint distributionTherefore influence from all variables are accountedIt represents the generative model of the autonomous databaseYear
Model
Make
Body
Mileage
Model
Make
Civic
…
Honda
0.8
..
…
..
..
CPDs model the strength of the probabilistic dependencies
15Slide16
Challenges in using Bayes networks for handling incompleteness in Autonomous databases
Learning and inference with Bayes networks is computationally harder than AFDsLearning the topology and parameters from data involves searching over search the space of topologiesBut can be done offlineInference in a general Bayes network is intractable.But can use approximate inferenceQuestion: Can we get benefits of exact inference while containing costs?16Slide17
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
17Slide18
Learning a Bayes network model
Structure & Parameter Learning From DataChallenge: Involves searching over topologiesUse Banjo Software Package as black-box.Experiments show learned topology is robust w.r.tSample size(5-20%) – same topologySearch time(5-30 minutes) – same topologyMax parent count (2-4) – same topology; significantly higher networks examined in case of 2.
18Slide19
Inference in Bayes networks
Exact Techniques NP-hard, in the general case. Therefore, do not scale well with increase in incompletenessJunction Tree (fastest; but inapplicable when query variables do not form a clique)Variable EliminationApproximate Techniques (Scales well; retaining accuracy of exact methods)Gibbs SamplingUsing Infer.net package allows us to use Expectation Propagation inference19Slide20
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
20Slide21
Imputation
Experimental SetupTest Databases: Cars.com database containing 8K tuples and Adult Database from UCI repository containing 15K tuplesBayes net inference Exact inference: Junction Tree, Variable EliminationApproximate inference: Gibbs Sampling21Slide22
Imputation
Remove all the values for the attribute being predictedSubstitute missing value with most likely valueAFD-approachUse only highest confidence AFD (Use all attributes if confidence is low, e.g., mileage(Cars)). Called Hybrid-one by authors of QPIAD.Bayes netInfer the posterior distribution of missing attribute, given evidences of the other attributes in the tuple 22Slide23
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
23Slide24
Imputation- single missing attribute
Significant difference for attributes Model and Year. AFDs using only the highest confidence rule, and ignore others.Attempts at combining evidences from multiple rules have been ineffective.Bayes nets systematically combines all evidences.
ID
Make
Model
Year
Body
1
Audi
A8
Sedan
2
BMW
745
2002
Sedan
3
Audi
2005
Sedan4AudiA82005Sedan24Slide25
Imputation- multiple missing attributes
AFD-approachPredict each missing value independentlyCan get in cyclesBayes netComputes the Joint distribution over the missing attributes.Make
Model
Year
Body
BMW
Sedan
BMW
2003
BMW
745
2004
Sedan
25
Make
Model
Model MakeSlide26
Imputation- multiple missing attributes
When missing attributes are correlated, they often get into cyclesOnly 9 out of 20 combinations could be predicted when 3 attributes are missingAFD accuracies are lower as they use a single rule independently for prediction BNs systematically combine evidences from multiple sources and capture correlations by finding the joint distributionWhen attributes are D-separated and involve attributes which have similar prediction accuracies for both methods, there is no difference in accuracy
Year
Model
Make
Body
Mileage
Price
26Slide27
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
27Slide28
Imputation- Increase in incompleteness in test data
Evidence for predicting missing values reduces with increase in incompletenessAFD-approachChain missing values in determining set of AFDBayes netNo change. Just compute posterior distribution of the attributes to be imputed given the evidence.28Q: Model = 745AFDs: Make, Body Model
Year Body
Make
Model
Year
Body
BMW
Sedan
BMW
2003
BMW
745
2004
SedanSlide29
Imputation- Increase in incompleteness in test data
29Slide30
Time Taken For Imputation
% incompletenessAFD (Sec.)BN-Gibbs(Sec.)(250 Samples)BN-Exact
(Sec.)
0
0.271
44.46
16.23
10
0.267
47.15
44.88
20
0.205
52.02
82.52
30
0.232
54.86
128.26400.23156.19
182.33500.23458.12248.75600.23260.09323.78700.23561.52402.13800.26263.69490.3190
0.21966.19609.65BN-Gibbs retains the accuracy edge of BN-Exact while containing costs30Slide31
Overview of the talk
Introduction to Incomplete Autonomous DatabasesOverview of QPIAD and shortcomings of AFD-based approachesOur approach: Bayes network based imputation and query rewritingIntroductionLearning Bayes network models from dataImputationSingle and multiple missing valuesVarying levels of incompleteness in test dataQuery RewritingBayes network based rewriting Comparison of Bayes network based rewriting and AFDs
31Slide32
Query Rewriting
When mediators do not have access privileges, missing values cannot be substituted as in the case of imputation.Need to generate and send “rewritten” queries to retrieve relevant uncertain answers.32Slide33
Query Rewriting– Single-attribute queries
IDMakeModelYearBodyMileage1BMW745
2005
Sedan
20000
2
Acura
Tl
2003
35000
3
BMW
645
2002
Convt
45000
4
BMW
7452001350005Acura
Tl2002Sedan24000Can retrieve T2 with Q’1: Model = Tl T4 with Q’2: Model = 745
Q: Body = Sedan1BMW745
2005Sedan
20000
5
Acura
Tl
2002
Sedan
24000
CERTAIN ANSWERS (BASE RESULT SET)
Relevant incomplete answers
33Slide34
Generating Rewritten Queries
IDMakeModelYearBodyMileage
1
BMW
745
2005
Sedan
20000
5
Acura
Tl
2002
Sedan
24000
CERTAIN ANSWERS (BASE RESULT SET)
Bayes Networks
ATTRIBUTES: ALL ATTRIBUTES IN
MARKOV BLANKET
(BN-ALL-MB)
Q’1: Model = 745Q’2: Model = Tl
YearModelMake
Body
Mileage
Given evidence of all attributes in MARKOV BLANKET, an attribute is independent of ALL other attributes
AFDs
ATTRIBUTES: DETERMINING SET OF AFD
Model
Body : 0.9
Q’
1
: Model = 745
Q’
2
: Model =
Tl
34
Q: Body =
SedanSlide35
Ranking Rewritten queries
All queries may not be equally good in retrieving relevant answers“tl” model cars are more likely to be sedans than a car with “745” modelRank queries based on their expected precision (ExpPrec)Bayes NetworksInference in bayes networkAFDsUse Naïve Bayes Classifiers
ExpPrec
(Q) =
P(A
m
=
v
m|ti)
where
t
i
ε
П
MB(Am)
(RS(Q)) for Bayes netswhere ti ε ПdtrSet(Am)(RS(Q
)) for AFDsQ1’: Model = ‘tl’.ExpPrec(Q1’)= P(Body=Sedan|Model=tl) = 1
Q2’= Model = ‘745’.ExpPrec(Q2’)= P(Body=Sedan|Model=745) = 0.635Slide36
Ranking Rewritten Queries- only K queries
When database or network resources are limited, the mediator can choose to issue the top-K queries to get the most relevant uncertain answersIt is important to carefully trade precision with throughputUse F-measure metric (idea borrowed from QPIAD)
P
– expected precision (e.g. P(Model=745|Make =BMW) )
R
– expected recall
R
= expected precision * expected selectivity
e
xpected selectivity = Sample Selectivity * Sample Ratio
Sample Ratio estimated from cardinalities result sets from sample and original database
=0 – only precision
36Slide37
Experimental Setup
Test databases: Cars database consisting of 55K tuples and Adult database consisting of 15K tuplesTraining set 15% of the database. Test data split in two halves- One half contains no incompleteness and is used to return the base result setIn the other half all query-constrained attributes are made nullA copy of test data is used as the ground truth to compute precision and recallThis is an aggressive setup since most databases have <50% incompleteness37Slide38
BN-All-MB vs AFD
BN-All-MB: P(Make=bmw|model= 330)AFD: P(Make=bmw|model=330)
When size of determining set > 1 Expected Precision values represented of AFDs (represented by NBCs) are inaccurate
Actual precision is lower for AFDs because their expected precisions are inaccurate
38
Q: MakeSlide39
Shortcoming of BN-All-MB
Throughput of queries reduces drastically as markov blanket size increasesUse F-measure based ranking to increase recallWhen almost all queries have very low throughput there is simply no way to increase recall Year
Model
Make
Body
Mileage
Q: Model = 745
Q’
1
:
Make
ᴧ
Body
ᴧ
Year
Q’
2
:
Make
ᴧBodyᴧ
YearQ’3: MakeᴧBodyᴧYear39Slide40
BN-Beam (Single-attribute queries)
Q: Model = 745YearModel
Make
Body
Mileage
ID
Make
Model
Year
Body
Mileage
1
BMW
745
2005
Sedan
20000
2
BMW
2005
Sedan350003BMW6452002Convt450004BMW7452001
350005AcuraTl2002Sedan24000
6
BMW
2001
Sedan
20000
Candidate Attribute Set = {Year, Make, Body}
40Slide41
BN-Beam
Level 1Make = BMWYear = 2001Body = Sedan
Pick Top-
K
queries at each level based on F-measure metric
P
– expected precision (e.g. P(Model=745|Make =BMW) )
R
– expected recall
R
= expected precision * expected selectivity
Expected selectivity = Sample Selectivity * Sample Ratio
Sample Ratio estimated from cardinalities result sets from sample and original database
Level 2
Make = BMW ^ Year = 2001
Make = BMW ^ Year = 2005
Body = Sedan
Level LQ’1Q’2Q’
3
Issue to database in the increasing order of expected precision
At Level
L
all (partial) queries have ≤
L
attributes constrained
Year
Body
Best rewritten queries of size 1
41Slide42
BN-Beam vs BN-All-MB
Increasing α does not increase recall of BN-All-MBBN-Beam increases recall without a catastrophic reduction in precisionResults for Top-10 queries for user query Year = 2002Recall Plot
Precision Plot
42Slide43
Multi-attribute queries
Contribution to QPIADAim: To retrieve relevant uncertain answers with multiple-missing values on query-constrained attributes.43Slide44
Multi-attribute queries
IDMakeModelYearBodyMileage16452002
Coupe
40000
2
BMW
645
2002
Convt
3
745
2001
Sedan
4
645
2002
Coupe
5
BMW7452001
Coupe400006BMW6452002Convt40000Q: Make = BMW ʌ Mileage = 40000Base result set = T5, T6QPIAD retrieves T1 and T2.BN-Beam can also retrieve T3 and T4.Candidate attribute set: union of attributes in the markov blanket of all constrained attributes
All other steps same as single-attribute query caseBase result set QPIAD
BN-Beam
44Slide45
Comparison over multi-attribute queries
Two AFD approachesAFD-All-Attributes: Creates a conjunctive query by joining all attributes in the determining set of the AFDs of the constrained attributes.Consider AFDsModel Make Year MileageQ: Make = BMW ʌ Mileage = 40000
Make = BMW
Model = 745
Model = 645
Mileage = 40000
Year = 2001
Year = 2002
Q’
1
:
Model=745
ᴧ
Year=2001
Q’
2
: Model=645
ᴧ
Year=2001Q’3: Model=745ᴧYear=2002Q’
4: Model=645ᴧYear=2002Expected Precision = Product of individual query’s expected precision45Slide46
BN-Beam vs
AFD-All-AttributesPrecision of BN-Beam is competitive with AFD-All AttributesRecall of BN-Beam is higher
AFD-All-Attributes does not consider the joint distribution between the query-constrained attributes.
Leads to low throughput or even empty queries
Results for top-10 queries
Q: Make ^ Mileage
46Slide47
Comparison of multi-attribute queries
AFD-Highest-Confidence: Uses only the AFD of the highest confidence constrained attribute for rewriting Q: Make = Dodge ᴧ Year = 2004 IGNORE all attributes other than Make AFD : Model Make Q’1: Model=ram Q’2
: Model= intrepid
47Slide48
BN-Beam vs
AFD-Highest-ConfidenceResults for top-10 queriesQ:Make ʌ Year(Car database)AFD-Highest-Confidence increases recall but NOT WITHOUT a CATASTROPHIC drop in precision
48Slide49
Summary
A comparison of cost and accuracy tradeoffs of using Bayes network models and AFDs for handling incompleteness in autonomous databasesBayes nets have a significant edge over AFDs when missing values are on highly correlated attributes and at higher levels of incompleteness in test data. Presented two approaches- BN-All-MB and BN-Beam for generating rewritten queries using Bayes networks. We showed that BN-Beam is able to retrieve tuples with higher recall than BN-All-MB. We compared Bayes network based rewriting with AFD based rewriting and found the former to retrieve results with higher precision and recall49Slide50
Deviations From the Thesis Draft
CAVEAT: I found two bugs in my code (Query Rewriting section)Corrected one bug (related to BN-based rewriting)Will correct the other one (related to AFD-based rewriting) after the defenseTHANK YOUQUESTIONS?50