Resolution Yasser Altowim Dmitri Kalashnikov Sharad Mehrotra Data Processing Flow Analysis Decision Data Quality of Analysis Quality of Decision Quality of Data Data Quality Challenges ID: 737940
Download Presentation The PPT/PDF document "Progressive Approach to Relational Entit..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Progressive Approach to Relational Entity
Resolution
Yasser Altowim
, Dmitri Kalashnikov,
Sharad
MehrotraSlide2
Data Processing Flow
Analysis
Decision
Data
Quality
of Analysis
Quality
of Decision
Quality
of Data
Data Quality Challenges
Erroneous Values.
Missing Values.
Duplication.
…
Data Cleaning Accounts For:
80%Slide3
Digital
WorldEntities
RealWorldObjects
Entity Resolution (
ER
)
Michael Jordan
Basketball Player
Michael Jordan
Professor @ UCBSlide4
Id
Product NamePrice
p1
IPad Two 16GB WiFi$490
p
2
IPad 2nd
Generatation 16GB WiFi
$469p3
Apple Phone 4 32 GB$545p
4Apple iPod Shuffle 2GB$49
p
5IPhone
4th Generation 32GB
$520Entity Resolution (
ER)
P
1
P
2
P
4
P
3
P
5Slide5
Blocking
Dataset
Id
Product Name
Price
p
1
IPad
Two 16GB WiFi
$490p
2IPad 2nd Generatation
16GB WiFi$469
p3
Apple Phone 4 32 GB
$545p
4Apple iPod Shuffle 2GB$49p5IPhone 4th
Generation 32GB$520
p
1
p
2
p
5
p
3
p
4
…
BF =
1
st
char of
product name
BF
…
BF
2
BF
1
Blocks
Blocks
BlocksSlide6
Resolve
( )= duplicate,
distinct, or
uncertain
Similarity Computation
Id
Product Name
Price
p
3
Apple Phone 4 32 GB$545
p4Apple iPod Shuffle 2GB
$49
Similarity Functions:
Resolve Function:
,
)
=
0.125
Slide7
Progressive
ER
Progressive ERSlide8
Real-time Analysis of Big Data
Event Monitoring Situational Awareness
Real-time Alerts Semantic Search Anti-terrorism
Applications
Data CleaningSlide9
Progressive
Data Cleaning
How Progressive ER Helps
Progressive
Analysis
Continually Refined ResultsSlide10
Id
Name
Papersu1
Very Large Data Bases{p
1}
u
2
ICDE Conference
{p2
}u3
VLDB{p3
}u4
IEEE Data Eng. Bull
{p
4}
Id
Title
Authors
Venue
p
1
Transaction Support in Read Optimized …
{
a
1
,
a
2
}
u
1
p
2
Read Optimized File System Designs:
…
{
a
1
}
u
2
p
3
Transaction Support in Read Optimized …
{
a
3
, a4}
u3
p4
Berkeley
DB: A Retrospective ..
{
a
3
}
u
4
Author Venue
Id
Name
Papers
a
1
Marge Seltzer
{
p
1
,
p
2
}
a
2
Michael
Stonebraker
{
p
1
}
a
3
Margo
I. Seltzer
{
p
3
,
p
4
}
a
4
M.
Stonebraker
{
p
3
}
Paper
Relational
DatasetSlide11
duplicate
Resolve
Graph Representation
u
1
,
u
3
p
1, p3
duplicateSlide12
Problem Definition
Given a relational dataset
D
, and a cost budget BG,Our goal is to develop a progressive approach that produces a high-quality result using
BG
units of cost.Slide13
ER Graph
R
1
S
1
R
2
T
2
T
1
S
2Slide14
ER Graph
R
1
S
1
R
2
T
2
T
1
S
2
v
1
v
2
v
3
v
4
v
8
v
7
v
6
v
5
v
9
v
10
v
11
v
12Slide15
R
2
T
2
S
2
Partially Constructed Graph
R
1
S
1
T
1
v
1
v
2
v
3
v
7
v
6
v
5
v
4
v
8
v
9
v
10
v
11
v
12Slide16
Overview
Window
1
Window
2
Window
n…
Plan Generation.
Plan Execution ( ).
Resolution Plan ( )
Set of
blocks
(
) to be instantiated. Set of nodes ( ) to be resolved.
BGSlide17
Plan Execution Phase
R
1
S
1
T
1
S
2
R
2
T
2
v
1
v
2
v
3
v
7
v
6
v
5
v
8
v
9
v
10
v
11
v
4
v
12Slide18
Plan Cost and BenefitSlide19
Node Benefit
…
…
…
…
…
…
Indirect
Benefit
Direct Benefit
v
1
v
2
v
3
v
4
v
5
v
6
StateSlide20
Probability Estimation
…
Effect
Cause
Cause
Cause
Noisy-OR Model
Effect:
Node being duplicate.
Causes:
Influencing duplicate nodes of .
Block to
which belongs.
Fraction of duplicate pairs in the block.
v
i
v
i
v
iSlide21
R
2
T
2
S
2
Example
R
1
S
1
T
1
v
1
v
2
v
3
v
7
v
6
v
5
v
4
v
8
v
9
v
10
v
11
v
12
duplicate
duplicate
duplicate
distinct
v
1
v
2
S
1Slide22
Node Impact
…
…
…
…
v
1
v
2
v
3
v
4
v
5
v
6
v
7
Dependent Nodes
Nearest Nodes
(
K=2
)
Belief Update
NP-hard
.
The Nearest Nodes are
not
always
instantiated
.
Why?Slide23
Impact Model
v
1
v
2
v
3
v
4
v
5
v
6
v
7
Case
#
1
Case
#
2
Case
#
3
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
1
v
3
v
6
v
2
v
4
v
5
v
7Slide24
Generate a plan such that:
h .
is maximized.
Benefit-
vs
-Cost Analysis:Each node and block has an updated cost and benefit.
Plan Generation Phase
NP-hard
Oregon-Trail KnapsackSlide25
Instantiated
Unresolved Nodes
Step#1Step#2
Uninstantiated Blocks
R
1
R
2
R
4
R
5
R
6
R
8
R
9
Plan Generation Algorithm
v
1
v
2
v
4
v
6
v
7
v
10
v
13
v
15
v
16
v
21
v
1
v
2
v
6
v
10
v
16Slide26
Step#3
If
>
else
return and
R
1
R
8
R
6
R
2
…
Plan Generation Algorithm
v
1
v
2
v
6
v
10
v
16
v
1
v
2
v
10
v
30
v
30
v
32
v
34
v
36
v
38
v
40
v
42
v
45
v
47
v
48Slide27
Lazy Resolution with Workflow
Resolve
Resolve
…
duplicate
or
distinct
How to
resolve
?
duplicate
or
distinct
v
1
v
1
Workflow
of
v
1Slide28
Contribution
of
Functions
For each function
:
Positive Contribution
∈
[0,1]:The amount of
positive evidence that the function is expected to provide when applied on a duplicate
pair of entities. Negative Contribution
∈ [0,1]:The amount of
negative
evidence that the function is expected to provide when applied on a distinct
pair of entities.
Slide29
Workflow
Generation
Compute a utility value
for each function :
Sort functions in a
decreasing
order
based on their utility values.
v
iSlide30
Workflow
Generation
Workflows:
…
…
Values of
:
…
…
Pre-generate
workflows:
v
iSlide31
Resolution
Cost
Given a node and workflow
v
i
Resolution Cost when
v
i
is duplicate
.
Resolution Cost when
v
i
is distinct
.Slide32
Experimental
Evaluation
Papers (P)Authors
(A)Venues (U)
= (
Title, Abstract,
Keywords, Authors, Venue
).
= (Name,
Email, Affiliation, Address,
Paper). = (Name, Year,
Pages, Papers).
Number of
Entities
Blocking Functions
Similarity FunctionsResolve FunctionP30,0002
3Naïve BayesA
83,152
1
4
Naïve Bayes
U
30,000
1
3
Naïve Bayes
CiteSeerX
DatasetSlide33
CiteSeerX
- Blocking
Papers (P) First three characters of
title. Last three characters of title.
Authors (A) First
one character of first
name appended with the first
two characters of last name.
Venues (U) First
two characters of name appended with the first two digits of year. Slide34
Algorithms:
DepGraph.X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD.
Static.S. E. Whang et al. Joint entity resolution. ICDE.
Full:
No lazy resolution strategy.
Random:
Lazy resolution strategy but with random order.
Experimental
Evaluation
R
R
1
R
4
R5…
T
6
T
1
T
3
…
S
2
S
6
S
5
…
T
SSlide35
Time vs. RecallSlide36
Our Approach
Random
FullExecution Time (sec)
300.33396.55542.43
Plan Generation
4.76%
3.81%
2.58%
Plan Execution
95.11%96.17%
97.40Lazy Resolution with
Workflow
Our Approach
Random
Full
Execution Time (sec)300.33396.55542.43Plan Generation
4.76%3.81%2.58%
Reading Blocks
4.70%
3.75%
2.90%
Graph Creation
8.40%
6.25%
4.72%
Node Resolution
82.01%
86.17%
89.78%
Reading Blocks.
Creating Nodes.
Resolving Nodes
.Slide37
Lazy Resolution with Workflow #2
Number of Sim Functions
P
A
U
Set_1
3
4
3Set_22
22
Set_3111Slide38
Correlation Among Sim FunctionsSlide39
Synthetic Dataset
Parameter
DescriptionValue
nNumber of entity-sets
4
s
Number of entities per entity-set
20,000
b
Number of blocks per entity-set
100dFraction of duplicate pairs in each entity-set
0.2z
Zipfian
distribution exponent
0.15l
Probability of generating an influence0.3Slide40
Duplicate Distribution
Z = 0.00
Z =
0.15
Z = 0.30Slide41
Number Of Influences
l
=
0.0
l
= 0.3
l
= 0.6Slide42
Conclusion
Progressive Approach to Relational ER.
Cost and benefit model for generating a resolution plan.
Lazy resolution strategy to resolve nodes with the least amount of cost.
Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.Slide43
Questions