/
Progressive Approach to Relational Entity Progressive Approach to Relational Entity

Progressive Approach to Relational Entity - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
352 views
Uploaded On 2018-12-07

Progressive Approach to Relational Entity - PPT Presentation

Resolution Yasser Altowim Dmitri Kalashnikov Sharad Mehrotra Data Processing Flow Analysis Decision Data Quality of Analysis Quality of Decision Quality of Data Data Quality Challenges ID: 737940

duplicate resolution generation plan resolution duplicate plan generation cost data nodes blocks progressive set entity benefit workflow resolve distinct node functions approach

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Progressive Approach to Relational Entit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Progressive Approach to Relational Entity

Resolution

Yasser Altowim

, Dmitri Kalashnikov,

Sharad

MehrotraSlide2

Data Processing Flow

Analysis

Decision

Data

Quality

of Analysis

Quality

of Decision

Quality

of Data

Data Quality Challenges

Erroneous Values.

Missing Values.

Duplication.

Data Cleaning Accounts For:

80%Slide3

Digital

WorldEntities

RealWorldObjects

Entity Resolution (

ER

)

Michael Jordan

Basketball Player

Michael Jordan

Professor @ UCBSlide4

Id

Product NamePrice

p1

IPad Two 16GB WiFi$490

p

2

IPad 2nd

Generatation 16GB WiFi

$469p3

Apple Phone 4 32 GB$545p

4Apple iPod Shuffle 2GB$49

p

5IPhone

4th Generation 32GB

$520Entity Resolution (

ER)

P

1

P

2

P

4

P

3

P

5Slide5

Blocking

Dataset

Id

Product Name

Price

p

1

IPad

Two 16GB WiFi

$490p

2IPad 2nd Generatation

16GB WiFi$469

p3

Apple Phone 4 32 GB

$545p

4Apple iPod Shuffle 2GB$49p5IPhone 4th

Generation 32GB$520

p

1

p

2

p

5

p

3

p

4

BF =

1

st

char of

product name

BF

BF

2

BF

1

Blocks

Blocks

BlocksSlide6

Resolve

( )= duplicate,

distinct, or

uncertain

Similarity Computation

Id

Product Name

Price

p

3

Apple Phone 4 32 GB$545

p4Apple iPod Shuffle 2GB

$49

Similarity Functions:

Resolve Function:

,

 

 

)

=

0.125

 

 

 

 

 Slide7

Progressive

ER

Progressive ERSlide8

Real-time Analysis of Big Data

Event Monitoring Situational Awareness

Real-time Alerts Semantic Search Anti-terrorism

Applications

Data CleaningSlide9

Progressive

Data Cleaning

How Progressive ER Helps

Progressive

Analysis

Continually Refined ResultsSlide10

Id

Name

Papersu1

Very Large Data Bases{p

1}

u

2

ICDE Conference

{p2

}u3

VLDB{p3

}u4

IEEE Data Eng. Bull

{p

4}

Id

Title

Authors

Venue

p

1

Transaction Support in Read Optimized …

{

a

1

,

a

2

}

u

1

p

2

Read Optimized File System Designs:

{

a

1

}

u

2

p

3

Transaction Support in Read Optimized …

{

a

3

, a4}

u3

p4

Berkeley

DB: A Retrospective ..

{

a

3

}

u

4

Author Venue

Id

Name

Papers

a

1

Marge Seltzer

{

p

1

,

p

2

}

a

2

Michael

Stonebraker

{

p

1

}

a

3

Margo

I. Seltzer

{

p

3

,

p

4

}

a

4

M.

Stonebraker

{

p

3

}

Paper

Relational

DatasetSlide11

duplicate

Resolve

Graph Representation

u

1

,

u

3

p

1, p3

duplicateSlide12

Problem Definition

Given a relational dataset

D

, and a cost budget BG,Our goal is to develop a progressive approach that produces a high-quality result using

BG

units of cost.Slide13

ER Graph

R

1

S

1

R

2

T

2

T

1

S

2Slide14

ER Graph

R

1

S

1

R

2

T

2

T

1

S

2

v

1

v

2

v

3

v

4

v

8

v

7

v

6

v

5

v

9

v

10

v

11

v

12Slide15

R

2

T

2

S

2

Partially Constructed Graph

R

1

S

1

T

1

v

1

v

2

v

3

v

7

v

6

v

5

v

4

v

8

v

9

v

10

v

11

v

12Slide16

Overview

Window

1

Window

2

Window

n…

Plan Generation.

Plan Execution ( ).

Resolution Plan ( )

Set of

blocks

(

) to be instantiated. Set of nodes ( ) to be resolved.

BGSlide17

Plan Execution Phase

R

1

S

1

T

1

S

2

R

2

T

2

v

1

v

2

v

3

v

7

v

6

v

5

v

8

v

9

v

10

v

11

v

4

v

12Slide18

Plan Cost and BenefitSlide19

Node Benefit

Indirect

Benefit

Direct Benefit

v

1

v

2

v

3

v

4

v

5

v

6

StateSlide20

Probability Estimation

Effect

Cause

Cause

Cause

Noisy-OR Model

Effect:

Node being duplicate.

Causes:

Influencing duplicate nodes of .

Block to

which belongs.

Fraction of duplicate pairs in the block.

v

i

v

i

v

iSlide21

R

2

T

2

S

2

Example

R

1

S

1

T

1

v

1

v

2

v

3

v

7

v

6

v

5

v

4

v

8

v

9

v

10

v

11

v

12

duplicate

duplicate

duplicate

distinct

v

1

v

2

S

1Slide22

Node Impact

v

1

v

2

v

3

v

4

v

5

v

6

v

7

Dependent Nodes

Nearest Nodes

(

K=2

)

Belief Update

NP-hard

.

The Nearest Nodes are

not

always

instantiated

.

Why?Slide23

Impact Model

v

1

v

2

v

3

v

4

v

5

v

6

v

7

Case

#

1

Case

#

2

Case

#

3

v

1

v

2

v

3

v

4

v

5

v

6

v

7

v

1

v

3

v

6

v

2

v

4

v

5

v

7Slide24

Generate a plan such that:

h .

is maximized.

Benefit-

vs

-Cost Analysis:Each node and block has an updated cost and benefit.

Plan Generation Phase

NP-hard

Oregon-Trail KnapsackSlide25

Instantiated

Unresolved Nodes

Step#1Step#2

Uninstantiated Blocks

R

1

R

2

R

4

R

5

R

6

R

8

R

9

Plan Generation Algorithm

v

1

v

2

v

4

v

6

v

7

v

10

v

13

v

15

v

16

v

21

v

1

v

2

v

6

v

10

v

16Slide26

Step#3

If

>

else

return and

R

1

R

8

R

6

R

2

Plan Generation Algorithm

v

1

v

2

v

6

v

10

v

16

v

1

v

2

v

10

v

30

v

30

v

32

v

34

v

36

v

38

v

40

v

42

v

45

v

47

v

48Slide27

Lazy Resolution with Workflow

Resolve

Resolve

duplicate

or

distinct

How to

resolve

?

duplicate

or

distinct

v

1

v

1

Workflow

of

v

1Slide28

Contribution

of

Functions

For each function

:

Positive Contribution

[0,1]:The amount of

positive evidence that the function is expected to provide when applied on a duplicate

pair of entities. Negative Contribution

∈ [0,1]:The amount of

negative

evidence that the function is expected to provide when applied on a distinct

pair of entities.

 Slide29

Workflow

Generation

Compute a utility value

for each function :

Sort functions in a

decreasing

order

based on their utility values.

 

v

iSlide30

Workflow

Generation

Workflows:

Values of

:

 

Pre-generate

workflows:

 v

iSlide31

Resolution

Cost

Given a node and workflow

 

v

i

Resolution Cost when

v

i

is duplicate

.

Resolution Cost when

v

i

is distinct

.Slide32

Experimental

Evaluation

Papers (P)Authors

(A)Venues (U)

= (

Title, Abstract,

Keywords, Authors, Venue

).

= (Name,

Email, Affiliation, Address,

Paper). = (Name, Year,

Pages, Papers).

Number of

Entities

Blocking Functions

Similarity FunctionsResolve FunctionP30,0002

3Naïve BayesA

83,152

1

4

Naïve Bayes

U

30,000

1

3

Naïve Bayes

CiteSeerX

DatasetSlide33

CiteSeerX

- Blocking

Papers (P) First three characters of

title. Last three characters of title.

Authors (A) First

one character of first

name appended with the first

two characters of last name.

Venues (U) First

two characters of name appended with the first two digits of year. Slide34

Algorithms:

DepGraph.X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD.

Static.S. E. Whang et al. Joint entity resolution. ICDE.

Full:

No lazy resolution strategy.

Random:

Lazy resolution strategy but with random order.

Experimental

Evaluation

R

R

1

R

4

R5…

T

6

T

1

T

3

S

2

S

6

S

5

T

SSlide35

Time vs. RecallSlide36

Our Approach

Random

FullExecution Time (sec)

300.33396.55542.43

Plan Generation

4.76%

3.81%

2.58%

Plan Execution

95.11%96.17%

97.40Lazy Resolution with

Workflow

Our Approach

Random

Full

Execution Time (sec)300.33396.55542.43Plan Generation

4.76%3.81%2.58%

Reading Blocks

4.70%

3.75%

2.90%

Graph Creation

8.40%

6.25%

4.72%

Node Resolution

82.01%

86.17%

89.78%

Reading Blocks.

Creating Nodes.

Resolving Nodes

.Slide37

Lazy Resolution with Workflow #2

Number of Sim Functions

P

A

U

Set_1

3

4

3Set_22

22

Set_3111Slide38

Correlation Among Sim FunctionsSlide39

Synthetic Dataset

Parameter

DescriptionValue

nNumber of entity-sets

4

s

Number of entities per entity-set

20,000

b

Number of blocks per entity-set

100dFraction of duplicate pairs in each entity-set

0.2z

Zipfian

distribution exponent

0.15l

Probability of generating an influence0.3Slide40

Duplicate Distribution

Z = 0.00

Z =

0.15

Z = 0.30Slide41

Number Of Influences

l

=

0.0

l

= 0.3

l

= 0.6Slide42

Conclusion

Progressive Approach to Relational ER.

Cost and benefit model for generating a resolution plan.

Lazy resolution strategy to resolve nodes with the least amount of cost.

Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.Slide43

Questions