Yasser Altowim Dmitri Kalashnikov Sharad Mehrotra Progressive ER Progressive ER Id Name Papers u 1 Very Large Data Bases p 1 u 2 ICDE Conference p 2 u 3 VLDB ID: 783950
Download The PPT/PDF document "Progressive Approach to Relational Entit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Progressive Approach to Relational Entity Resolution
Yasser Altowim
, Dmitri Kalashnikov,
Sharad
Mehrotra
Slide2Progressive
ER
Progressive ER
Slide3Id
NamePapers
u
1
Very Large Data Bases
{p
1}
u
2
ICDE Conference
{p2}u
3
VLDB
{p3}u4IEEE Data Eng. Bull{p4}
Id
Title
Authors
Venue
p
1
Transaction Support in Read Optimized …
{a1, a2}u1p2Read Optimized File System Designs: …{a1}u2p3Transaction Support in Read Optimized …{a3, a4}u3p4Berkeley DB: A Retrospective ..{a3}u4
Author Venue
IdNamePapersa1Marge Seltzer{p1, p2}a2Michael Stonebraker{p1}a3Margo I. Seltzer{p3, p4}a4M. Stonebraker{p3}
Paper
Relational Dataset
Slide4duplicate
Resolve
Graph Representation
u
1
,
u
3
p
1
,
p3
duplicate
Slide5Problem Definition
Given a relational dataset
D
, and a cost budget
BG
,Our goal is to develop a progressive approach that produces a high-quality result using
BG units of cost.
Slide6ER Graph
R
1
S
1
R
2
T
2
T
1
S
2
Slide7ER Graph
R
1
S
1
R
2
T
2
T
1
S
2
v
1
v
2
v
3
v
4
v
8
v7v6v5v9
v
10v11
v
12
Slide8R
2
T
2
S
2
Partially Constructed Graph
R
1
S
1
T
1
v
1
v
2
v
3
v
7
v
6
v5v4
v8
v9
v
10
v11
v
12
Slide9Resolution Windows
Window
1
Window
2
Window
n
…
Plan Generation.
Plan Execution ( ).
Resolution Plan ( ) Set of blocks (
) to be
instantiated
. Set of
nodes (
) to be
resolved.
BG
Lazy Resolution Strategy
Slide10Plan Cost and Benefit
Slide11Node Benefit
…
…
…
…
…
…
Indirect
Benefit
Direct Benefit
v
1
v
2
v
3
v
4
v
5
v
6
State
Slide12Generate a plan such that:
h .
is maximized.
Benefit-
vs
-Cost Analysis:
Each node and block has an updated cost and benefit.
Plan Generation Phase
NP-hard
Oregon-Trail Knapsack
Slide13Instantiated
Unresolved Nodes
Step#1
Step#2
Uninstantiated
Blocks
R
1
R
2
R
4
R
5
R
6
R
8
R
9
Plan Generation Algorithm
v
1
v2v4v6v7v10v13v15v16v
21
v1v
2
v
6
v
10
v
16
Slide14Step#3
If
>
else
return and
R
1
R
8
R
6
R
2
…
Plan Generation Algorithm
v
1
v
2
v
6
v
10v16
v
1
v
2
v
10
v
30
v
30
v
32
v
34
v
36
v
38
v
40
v
42
v
45
v
47
v
48
Slide15Experimental
Evaluation
Papers
(P)Authors
(A)Venues
(U)
= (Title
, Abstract, Keywords
, Authors,
Venue).
= (Name, Email, Affiliation,
Address, Paper).
= (Name, Year, Pages, Papers). Number of EntitiesBlocking Functions
Similarity Functions
Resolve Function
P
30,000
2
3
Naïve Bayes
A83,15214Naïve BayesU30,00013Naïve Bayes CiteSeerX Dataset
Slide16Algorithms:
DepGraph.X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD.Static.
S. E. Whang et al. Joint entity resolution. ICDE
.
Full:
No lazy resolution strategy.
Random:
Lazy resolution strategy but with random order.
Experimental
Evaluation
R
R
1
R
4
R
5
…
T
6
T1T3…S2S6S5…TS
Slide17Time vs. Recall
Slide18Our Approach
Random
Full
Execution Time (sec)
300.33
396.55
542.43
Plan Generation4.76%
3.81%
2.58%
Plan Execution95.11%96.17%
97.40
Lazy Resolution with
WorkflowOur Approach
Random
Full
Execution Time (sec)
300.33
396.55
542.43
Plan Generation
4.76%3.81%2.58%Reading Blocks4.70%3.75%2.90%Graph Creation8.40%6.25%4.72%Node Resolution82.01%86.17%89.78% Reading Blocks. Creating Nodes. Resolving Nodes.
Slide19Conclusion
Progressive Approach to Relational ER.
Cost and benefit model for generating a resolution plan.
Lazy resolution strategy to resolve nodes with the least amount of cost.
Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.
Slide20Questions