/
Determining the Relative Accuracy of Attributes Determining the Relative Accuracy of Attributes

Determining the Relative Accuracy of Attributes - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
399 views
Uploaded On 2016-03-16

Determining the Relative Accuracy of Attributes - PPT Presentation

Yang Cao 1 2 Wenfei Fan 1 2 Wenyuan Yu 1 1 University of Edinburgh 2 Beihang University 2 3 FD FN LN team height date of birth CFD ID: 258130

candidate targets bullsunited top targets candidate top bullsunited 1963chicagochicago chicago center data jordan stadium 1963 united michael centermichaeljordan198cm17 bulls

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Determining the Relative Accuracy of Att..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Determining the Relative Accuracy of Attributes

Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 11University of Edinburgh 2Beihang UniversitySlide2

2Slide3

3

FD: [FN, LN, team, height  date of birth]CFD: [team = “Chicago Bulls”  arena = “United Center”]FNLNheightdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963

ChicagoChicago Stadium

Michael

Jordan

198cm

17/02/1963

Chicago Bulls

United Center

Instance may be

consistent

, but its values may still be

inaccurate

Applications

: Data

fusion for big data,

decision making, information systems,

Data Accuracy

:

a central problem that has not been formally

studied

Data Accuracy

Find the most accurate values for

Jordan

within

DSlide4

4FN

LNheightunitdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago StadiumFNLNteamseasonMichaelJordanChicago Bulls94-95…………

t1

t

2

t

m

Michael

Jordan

Chicago Bulls

United Center

Form(1)

 

Form (2)

 

Using

Accuracy Rules

to capture

data semantics

D

 

17/02/1963

 

 

 

Accuracy Rules

Slide5

Inferring Relative Accuracy with ARs5FN

LNteamseasonMichaelJordanChicago Bulls94-95tmFNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadium

A chase-like procedure with ARs to deduce relative accuracyt1

t

2

D

D

m

 

 

 

 

 

φ

4

:

 

φ

1

:

 

φ

2

:

 

φ

3

:

 

φ

1

φ

4

φ

3

Michael

Jordan

17/02/1963

Chicago Bulls

United Center

A chasing sequenceSlide6

FNLNteamseasonMichael

JordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1Dm

φ1:

 

φ

2

:

 

φ

3

:

 

Termination Problem

:

Every chasing sequence always terminates.

 

 

 

 

 

φ

1

φ

4

φ

3

 

D

t

2

φ

4

:

 

Finite ??

(

φ

1

φ

4

φ

3

..

φ

j

..)

Yes

t

mSlide7

FNLNteamseasonMichael

JordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumφ1:

 

φ

2

:

 

φ

3

:

 

φ

4

:

 

7

 

 

 

 

 

φ

1

φ

4

φ

3

φ

5

:

 

φ

5

Whether different chasing sequences coincide?

Not always

Church-Rosser Problem

:

The Church-Rosser property is not guaranteed.

But can be checked in cubic time.

t

1

D

m

 

D

t

2

t

mSlide8

Fundamental Problems: Deducing candidate targets8

FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt

1t2

D

D

m

φ

4

:

 

φ

1

:

 

φ

2

:

 

φ

3

:

 

 

 

 

 

 

 

φ

1

φ

4

φ

3

t

e

Michael

Jordan

17/02/1963

Chicago Bulls

United Center

?

?

t

m

Target tuple may be

incomplete

: the need to find candidate targets

Michael

Jordan

198

cm

17/02/1963

Chicago

Chicago Stadium

Michael

Jordan

7

ft

17/02/1963

Chicago Bulls

United

Center

t

e

'

1

t

e

'

2

Whether candidate targets always exist?

Not always Slide9

Fundamental Problems: Deducing candidate targets9FN

LNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1

t2D

D

m

φ

4

:

 

φ

1

:

 

φ

2

:

 

φ

3

:

 

t

e

Michael

Jordan

17/02/1963

Chicago Bulls

United Center

?

?

t

m

φ

6

:

 

Michael

Jordan

198

cm

17/02/1963

Chicago

Chicago Stadium

Michael

Jordan

7

ft

17/02/1963

Chicago Bulls

United

Center

t

e

'1te'2

(

φ2, φ6)

It is NP-complete to determine whether there exist candidate targets

There can be exponentially or even infinitely many candidate targets

 Slide10

Fundamental Problems: Top-k candidate targets10

FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt

1t2

D

D

m

φ

4

:

 

φ

1

:

 

φ

2

:

 

φ

3

:

 

t

e

Michael

Jordan

17/02/1963

Chicago Bulls

United Center

?

?

t

m

 

Preference model

:

(k, p(.))

:

p(.)

is any

monotone scoring function (e.g.,

occurrences

)

Top-k candidate targets problem

: whether there exists a

k

-set

T

e

such that

p(Te) > C

The Top-

k

candidate targets problem is NP-completeMichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan7ft17/02/1963Chicago BullsUnited Centerte'1

te'2

K=2, p(Te) = 14 Slide11

A framework for deducing target tuples11

 Is S Church-Rosser?

complete te derived?

Compute top-

k

candidate targets

T

e

Preference Model

(

k,p

(.))

feedback

T

e

t'

e

Yes

No

Yes

Return

t

e

No

IsCR

 

RankJoinCT

TopKCT

TopKCT

hSlide12

AlgorithmsChecking Church-Rosser propertyIsCR

(2+)) Top-k candidate targetsRankJoinCTRank join based Top-k algorithmTopKCTPriority queue basedTopKCThHeuristic 12Slide13

TopKCT: Brodal Queue based Top-k algorithm13Input: A Church-Rosser Specification S

Preference model (k, p(.))A heap for each attributes A with null values in teOutput: The set Te of top-k scored candidate targets w.r.t.(k, p(.))13FNLNheight

unitdate of birthteamarenaM.

J.

7

ft

1963

Chicago Bulls

United Center

Michael

Jordan

198

cm

17/02/1963

Chicago

Chicago

Stadium

t

1

t

2

D

teMichael

Jordan17/02/1963

Chicago BullsUnited Center

?

?

H

heightHunit

Michael

Jordan

200

cm

17/02/1963

Chicago

Chicago Stadium

Michael

Jordan

198

m

17/02/1963

Chicago Bulls

United

Center

t

3

t4Top-1:te[height, unit] = (198, cm) 7,198,200ft, cm, mSlide14

TopKCT: Brodal Queue based Top-k algorithm14Input: A Church-Rosser Specification S

Preference model (k, p(.))A heap for each attributes A with null values in teOutput: The set Te of top-k scored candidate targets w.r.t.(k, p(.))14Early termination: Stops

as soon as top-k candidate targets are found.Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.

TopKCT

has

early

termination property

and is

Instance Optimal

.

An algorithm

A

is said to be instance optimal if there exists constant c

1

and c

2

such that

for

all instances I

and

all algorithms

in the same setting as A.

 

Optimality ratioSlide15

Experimental Study: SettingsDatasetsMed: sale records of medicines from various stores10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs CFP

: call for papers/participation found by Google503 tuples for 100 entries; 55 tuples as master data; 43 ARsRest: Restaurant data*246 tuples 5149 entries; 131 ARsSyn: Synthetic data generator20 attributes; ARs: 75% of form (1), 25% of form (2)Implementation64 bit Linux Amazon EC2 High-CPU Extra Large Instance7GB of memory and 20 EC2 Compute Unites15Slide16

Experimental Study: IsCREffectiveness of IsCR16

Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interactionNon-null values: over 70% when both ARs of form(1) and (2) are usedSlide17

Experimental Study: candidate targetsComputing top-k candidates17

k doesn’t have to be large: k=15 suffices for over 85% of the entries;Master data does help, but even when it is not available, TopKCT still works wellSlide18

Experimental Study: user interactionUser interactions18

Few rounds of interactions are needed to deduce the targets for all the entries:at most 3 for Med and 4 for CFPSlide19

Experimental Study: efficiencyEfficiency19Slide20

Experimental Study: efficiencyEfficiency20

For Syn with ||Ie|| = 1500, ||Im|| = 300 and =50, TopKCTh, TopKCT and RankJoinCT took 159ms, 271ms and 1983 ms, respectively. Slide21

ConclusionSummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the frameworkOutlookDiscovery of ARs

Improve the accuracy of data in a database21