/
Progressive Approach to Relational Entity Resolution Progressive Approach to Relational Entity Resolution

Progressive Approach to Relational Entity Resolution - PowerPoint Presentation

eatsui
eatsui . @eatsui
Follow
342 views
Uploaded On 2020-06-23

Progressive Approach to Relational Entity Resolution - PPT Presentation

Yasser Altowim Dmitri Kalashnikov Sharad Mehrotra Progressive ER Progressive ER Id Name Papers u 1 Very Large Data Bases p 1 u 2 ICDE Conference p 2 u 3 VLDB ID: 783950

resolution plan benefit cost plan resolution cost benefit approach generation nodes progressive lazy graph strategy relational random venue optimized

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Progressive Approach to Relational Entit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Progressive Approach to Relational Entity Resolution

Yasser Altowim

, Dmitri Kalashnikov,

Sharad

Mehrotra

Slide2

Progressive

ER

Progressive ER

Slide3

Id

NamePapers

u

1

Very Large Data Bases

{p

1}

u

2

ICDE Conference

{p2}u

3

VLDB

{p3}u4IEEE Data Eng. Bull{p4}

Id

Title

Authors

Venue

p

1

Transaction Support in Read Optimized …

{a1, a2}u1p2Read Optimized File System Designs: …{a1}u2p3Transaction Support in Read Optimized …{a3, a4}u3p4Berkeley DB: A Retrospective ..{a3}u4

Author Venue

IdNamePapersa1Marge Seltzer{p1, p2}a2Michael Stonebraker{p1}a3Margo I. Seltzer{p3, p4}a4M. Stonebraker{p3}

Paper

Relational Dataset

Slide4

duplicate

Resolve

Graph Representation

u

1

,

u

3

p

1

,

p3

duplicate

Slide5

Problem Definition

Given a relational dataset

D

, and a cost budget

BG

,Our goal is to develop a progressive approach that produces a high-quality result using

BG units of cost.

Slide6

ER Graph

R

1

S

1

R

2

T

2

T

1

S

2

Slide7

ER Graph

R

1

S

1

R

2

T

2

T

1

S

2

v

1

v

2

v

3

v

4

v

8

v7v6v5v9

v

10v11

v

12

Slide8

R

2

T

2

S

2

Partially Constructed Graph

R

1

S

1

T

1

v

1

v

2

v

3

v

7

v

6

v5v4

v8

v9

v

10

v11

v

12

Slide9

Resolution Windows

Window

1

Window

2

Window

n

Plan Generation.

Plan Execution ( ).

Resolution Plan ( ) Set of blocks (

) to be

instantiated

. Set of

nodes (

) to be

resolved.

BG

Lazy Resolution Strategy

Slide10

Plan Cost and Benefit

Slide11

Node Benefit

Indirect

Benefit

Direct Benefit

v

1

v

2

v

3

v

4

v

5

v

6

State

Slide12

Generate a plan such that:

h .

is maximized.

Benefit-

vs

-Cost Analysis:

Each node and block has an updated cost and benefit.

Plan Generation Phase

NP-hard

Oregon-Trail Knapsack

Slide13

Instantiated

Unresolved Nodes

Step#1

Step#2

Uninstantiated

Blocks

R

1

R

2

R

4

R

5

R

6

R

8

R

9

Plan Generation Algorithm

v

1

v2v4v6v7v10v13v15v16v

21

v1v

2

v

6

v

10

v

16

Slide14

Step#3

If

>

else

return and

R

1

R

8

R

6

R

2

Plan Generation Algorithm

v

1

v

2

v

6

v

10v16

v

1

v

2

v

10

v

30

v

30

v

32

v

34

v

36

v

38

v

40

v

42

v

45

v

47

v

48

Slide15

Experimental

Evaluation

Papers

(P)Authors

(A)Venues

(U)

= (Title

, Abstract, Keywords

, Authors,

Venue).

= (Name, Email, Affiliation,

Address, Paper).

= (Name, Year, Pages, Papers). Number of EntitiesBlocking Functions

Similarity Functions

Resolve Function

P

30,000

2

3

Naïve Bayes

A83,15214Naïve BayesU30,00013Naïve Bayes CiteSeerX Dataset

Slide16

Algorithms:

DepGraph.X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD.Static.

S. E. Whang et al. Joint entity resolution. ICDE

.

Full:

No lazy resolution strategy.

Random:

Lazy resolution strategy but with random order.

Experimental

Evaluation

R

R

1

R

4

R

5

T

6

T1T3…S2S6S5…TS

Slide17

Time vs. Recall

Slide18

Our Approach

Random

Full

Execution Time (sec)

300.33

396.55

542.43

Plan Generation4.76%

3.81%

2.58%

Plan Execution95.11%96.17%

97.40

Lazy Resolution with

WorkflowOur Approach

Random

Full

Execution Time (sec)

300.33

396.55

542.43

Plan Generation

4.76%3.81%2.58%Reading Blocks4.70%3.75%2.90%Graph Creation8.40%6.25%4.72%Node Resolution82.01%86.17%89.78% Reading Blocks. Creating Nodes. Resolving Nodes.

Slide19

Conclusion

Progressive Approach to Relational ER.

Cost and benefit model for generating a resolution plan.

Lazy resolution strategy to resolve nodes with the least amount of cost.

Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.

Slide20

Questions