Yang Cao 1 2 Wenfei Fan 1 2 Wenyuan Yu 1 1 University of Edinburgh 2 Beihang University 2 3 FD FN LN team height date of birth CFD ID: 258130
Download Presentation The PPT/PDF document "Determining the Relative Accuracy of Att..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Determining the Relative Accuracy of Attributes
Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 11University of Edinburgh 2Beihang UniversitySlide2
2Slide3
3
FD: [FN, LN, team, height date of birth]CFD: [team = “Chicago Bulls” arena = “United Center”]FNLNheightdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963
ChicagoChicago Stadium
Michael
Jordan
198cm
17/02/1963
Chicago Bulls
United Center
Instance may be
consistent
, but its values may still be
inaccurate
Applications
: Data
fusion for big data,
decision making, information systems,
…
Data Accuracy
:
a central problem that has not been formally
studied
Data Accuracy
Find the most accurate values for
Jordan
within
DSlide4
4FN
LNheightunitdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago StadiumFNLNteamseasonMichaelJordanChicago Bulls94-95…………
t1
t
2
t
m
Michael
Jordan
Chicago Bulls
United Center
Form(1)
Form (2)
Using
Accuracy Rules
to capture
data semantics
D
17/02/1963
Accuracy Rules
Slide5
Inferring Relative Accuracy with ARs5FN
LNteamseasonMichaelJordanChicago Bulls94-95tmFNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadium
A chase-like procedure with ARs to deduce relative accuracyt1
t
2
D
D
m
φ
4
:
φ
1
:
φ
2
:
φ
3
:
φ
1
φ
4
φ
3
Michael
Jordan
17/02/1963
Chicago Bulls
United Center
A chasing sequenceSlide6
FNLNteamseasonMichael
JordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1Dm
φ1:
φ
2
:
φ
3
:
Termination Problem
:
Every chasing sequence always terminates.
φ
1
φ
4
φ
3
D
t
2
φ
4
:
Finite ??
(
φ
1
φ
4
φ
3
..
φ
j
..)
Yes
t
mSlide7
FNLNteamseasonMichael
JordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumφ1:
φ
2
:
φ
3
:
φ
4
:
7
φ
1
φ
4
φ
3
φ
5
:
φ
5
Whether different chasing sequences coincide?
Not always
Church-Rosser Problem
:
The Church-Rosser property is not guaranteed.
But can be checked in cubic time.
t
1
D
m
D
t
2
t
mSlide8
Fundamental Problems: Deducing candidate targets8
FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt
1t2
D
D
m
φ
4
:
φ
1
:
φ
2
:
φ
3
:
φ
1
φ
4
φ
3
t
e
Michael
Jordan
17/02/1963
Chicago Bulls
United Center
?
?
t
m
Target tuple may be
incomplete
: the need to find candidate targets
Michael
Jordan
198
cm
17/02/1963
Chicago
Chicago Stadium
Michael
Jordan
7
ft
17/02/1963
Chicago Bulls
United
Center
t
e
'
1
t
e
'
2
Whether candidate targets always exist?
Not always Slide9
Fundamental Problems: Deducing candidate targets9FN
LNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1
t2D
D
m
φ
4
:
φ
1
:
φ
2
:
φ
3
:
t
e
Michael
Jordan
17/02/1963
Chicago Bulls
United Center
?
?
t
m
φ
6
:
Michael
Jordan
198
cm
17/02/1963
Chicago
Chicago Stadium
Michael
Jordan
7
ft
17/02/1963
Chicago Bulls
United
Center
t
e
'1te'2
(
φ2, φ6)
It is NP-complete to determine whether there exist candidate targets
There can be exponentially or even infinitely many candidate targets
Slide10
Fundamental Problems: Top-k candidate targets10
FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt
1t2
D
D
m
φ
4
:
φ
1
:
φ
2
:
φ
3
:
t
e
Michael
Jordan
17/02/1963
Chicago Bulls
United Center
?
?
t
m
Preference model
:
(k, p(.))
:
p(.)
is any
monotone scoring function (e.g.,
occurrences
)
Top-k candidate targets problem
: whether there exists a
k
-set
T
e
such that
p(Te) > C
The Top-
k
candidate targets problem is NP-completeMichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan7ft17/02/1963Chicago BullsUnited Centerte'1
te'2
K=2, p(Te) = 14 Slide11
A framework for deducing target tuples11
Is S Church-Rosser?
complete te derived?
Compute top-
k
candidate targets
T
e
Preference Model
(
k,p
(.))
feedback
T
e
t'
e
Yes
No
Yes
Return
t
e
No
IsCR
RankJoinCT
TopKCT
TopKCT
hSlide12
AlgorithmsChecking Church-Rosser propertyIsCR
(2+)) Top-k candidate targetsRankJoinCTRank join based Top-k algorithmTopKCTPriority queue basedTopKCThHeuristic 12Slide13
TopKCT: Brodal Queue based Top-k algorithm13Input: A Church-Rosser Specification S
Preference model (k, p(.))A heap for each attributes A with null values in teOutput: The set Te of top-k scored candidate targets w.r.t.(k, p(.))13FNLNheight
unitdate of birthteamarenaM.
J.
7
ft
1963
Chicago Bulls
United Center
Michael
Jordan
198
cm
17/02/1963
Chicago
Chicago
Stadium
t
1
t
2
D
teMichael
Jordan17/02/1963
Chicago BullsUnited Center
?
?
H
heightHunit
Michael
Jordan
200
cm
17/02/1963
Chicago
Chicago Stadium
Michael
Jordan
198
m
17/02/1963
Chicago Bulls
United
Center
t
3
t4Top-1:te[height, unit] = (198, cm) 7,198,200ft, cm, mSlide14
TopKCT: Brodal Queue based Top-k algorithm14Input: A Church-Rosser Specification S
Preference model (k, p(.))A heap for each attributes A with null values in teOutput: The set Te of top-k scored candidate targets w.r.t.(k, p(.))14Early termination: Stops
as soon as top-k candidate targets are found.Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.
TopKCT
has
early
termination property
and is
Instance Optimal
.
An algorithm
A
is said to be instance optimal if there exists constant c
1
and c
2
such that
for
all instances I
and
all algorithms
in the same setting as A.
Optimality ratioSlide15
Experimental Study: SettingsDatasetsMed: sale records of medicines from various stores10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs CFP
: call for papers/participation found by Google503 tuples for 100 entries; 55 tuples as master data; 43 ARsRest: Restaurant data*246 tuples 5149 entries; 131 ARsSyn: Synthetic data generator20 attributes; ARs: 75% of form (1), 25% of form (2)Implementation64 bit Linux Amazon EC2 High-CPU Extra Large Instance7GB of memory and 20 EC2 Compute Unites15Slide16
Experimental Study: IsCREffectiveness of IsCR16
Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interactionNon-null values: over 70% when both ARs of form(1) and (2) are usedSlide17
Experimental Study: candidate targetsComputing top-k candidates17
k doesn’t have to be large: k=15 suffices for over 85% of the entries;Master data does help, but even when it is not available, TopKCT still works wellSlide18
Experimental Study: user interactionUser interactions18
Few rounds of interactions are needed to deduce the targets for all the entries:at most 3 for Med and 4 for CFPSlide19
Experimental Study: efficiencyEfficiency19Slide20
Experimental Study: efficiencyEfficiency20
For Syn with ||Ie|| = 1500, ||Im|| = 300 and =50, TopKCTh, TopKCT and RankJoinCT took 159ms, 271ms and 1983 ms, respectively. Slide21
ConclusionSummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the frameworkOutlookDiscovery of ARs
Improve the accuracy of data in a database21