Pei Li University of Milan Bicocca Advisor Andrea Maurino Supervisors ATampT Labs Research Xin Luna Dong Divesh Srivastava October 2012 Some Statistics from DBLP How many Wei Wangs are there ID: 270291
Download Presentation The PPT/PDF document "Linking Records with Value Diversity" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Linking Records with Value Diversity
Pei Li
University of Milan –
Bicocca
Advisor :
Andrea Maurino
Supervisors@ AT&T Labs - Research:
Xin Luna Dong, Divesh Srivastava
October, 2012Slide2
Some Statistics from DBLP
How many Wei Wang’s are there?
What are their authoring histories?
•••
2Slide3
Some Statistics from
YellowPages
•••
3
Are there any business chains?
If yes, which businesses are their members? Slide4
Record Linkage
What is record linkage (entity resolution)?
Input: a set of records
Output: clustering of records A critical problem in data integration and data cleaning“A reputation for world-class quality is profitable, a ‘business maker’.” – William E. Winkler
Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) :assume that records of the same entities are consistent
often focus on different representations of the same value e.g., “IBM” and “International Business Machines”
•••
4Slide5
New Challenges
In reality, we observe value diversity of entities
Values can evolve over
time Catholic Healthcare (1986 - 2012) Dignity Health (2012 -)
Different records of the same group can have “local” values Some sources may provide erroneous values
••• 5
ID
Name
Address
Phone
URL
001
F.B. Insurance
Vernon 76384 TX
877 635-4684
txfb-ins.com
002
F.B. Insurance #1
Lufkin 75901 TX936 634-7285txfb.org003F.B. Insurance #5Cibolo 78108 TX877 635-4684
IDNameURLSource001Meekhof Tire Sales & Service Incwww.meekhoftire.comSrc. 1002Meekhof Tire Sales & Service Incwww.napaautocare.com Src. 2
•••
5Slide6
My Goal
To improve the linkage quality of integrated data with fairly high
diversity
linking temporal records
[VLDB ’11] [VLDB ’12 demo][FCS Journal ’12]linking records of the same group[Under preparation for SIGMOD ’13]•••
6Slide7
Outline
Motivation
Linking temporal records
DecayTemporal clustering
DemoLinking records of the same groupRelated workConclusions & Future work
••• 7Slide8
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1
: Xin Dong
R. Polytechnic Institute
r2
: Xin Dong
University of Washington
r7
: Dong Xin University of Illinoisr3
: Xin Dong University of Washingtonr4
: Xin Luna DongUniversity of Washingtonr8
:Dong Xin
University of Illinois
r9
: Dong Xin
Microsoft Research
r5
: Xin Luna Dong
AT&T Labs-Research
r10
: Dong Xin
University of Illinois
r11
: Dong Xin
Microsoft Research
r6
: Xin Luna Dong
AT&T Labs-Research
r12
: Dong Xin
Microsoft Research
How many authors?
What are their authoring histories?
2011
8Slide9
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1
: Xin Dong
R. Polytechnic Institute
r2
: Xin Dong
University of Washington
r7
: Dong Xin University of Illinoisr3
: Xin Dong University of Washingtonr4
: Xin Luna DongUniversity of Washingtonr8
:Dong Xin
University of Illinois
r9
: Dong Xin
Microsoft Research
r5
: Xin Luna Dong
AT&T Labs-Research
r10
: Dong Xin
University of Illinois
r11
: Dong Xin
Microsoft Research
r6
: Xin Luna Dong
AT&T Labs-Research
r12
: Dong Xin
Microsoft Research
Ground truth
3 authors
2011
9Slide10
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1
: Xin Dong
R. Polytechnic Institute
r2
: Xin Dong
University of Washington
r7
: Dong Xin University of Illinoisr3
: Xin Dong University of Washingtonr4
: Xin Luna DongUniversity of Washingtonr8
:Dong Xin
University of Illinois
r9
: Dong Xin
Microsoft Research
r5
: Xin Luna Dong
AT&T Labs-Research
r10
: Dong Xin
University of Illinois
r11
: Dong Xin
Microsoft Research
r6
: Xin Luna Dong
AT&T Labs-Research
r12
: Dong Xin
Microsoft Research
Solution 1:
requiring high value consistency
5 authors
false negative
2011
10Slide11
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1
: Xin Dong
R. Polytechnic Institute
r2
: Xin Dong
University of Washington
r7
: Dong Xin University of Illinoisr3
: Xin Dong University of Washington
r4: Xin Luna DongUniversity of Washington
r8
:Dong Xin
University of Illinois
r9
: Dong Xin
Microsoft Research
r5
: Xin Luna Dong
AT&T Labs-Research
r10
: Dong Xin
University of Illinois
r11
: Dong Xin
Microsoft Research
r6
: Xin Luna Dong
AT&T Labs-Research
r12
: Dong Xin
Microsoft Research
Solution 2:
matching records w. similar names
2 authors
false positive
2011
11Slide12
Opportunities
ID
Name
Affiliation
Co-authors
Year
r1
Xin
Dong
R. Polytechnic Institute
Wozny
1991
r2
Xin Dong
University of Washington
Halevy,
Tatarinov
2004
r7
Dong Xin
University of Illinois
Han,
Wah
2004
r3
Xin Dong
University of Washington
Halevy
2005
r4
Xin Luna
Dong
University of Washington
Halevy, Yu
2007
r8
Dong Xin
University of Illinois
Wah
2007
r9
Dong Xin
Microsoft Research
Wu, Han
2008
r10
Dong Xin
University of Illinois
Ling, He
2009
r11
Dong Xin
Microsoft Research
Chaudhuri
,
Ganti
2009
r5
Xin Luna
Dong
AT&T Labs-Research
Das
Sarma
, Halevy
2009
r6
Xin Luna
Dong
AT&T Labs-Research
Naumann
2010
r12
Dong Xin
Microsoft Research
He
2011
Smooth transition
Seldom erratic changes
Continuity of history
•••
12Slide13
Intuitions
ID
Name
Affiliation
Co-authors
Year
r1
Xin
Dong
R. Polytechnic Institute
Wozny
1991
r2
Xin Dong
University of Washington
Halevy,
Tatarinov
2004
r7
Dong Xin
University of Illinois
Han,
Wah
2004
r3
Xin Dong
University of Washington
Halevy
2005
r4
Xin Luna
Dong
University of Washington
Halevy, Yu
2007
r8
Dong Xin
University of Illinois
Wah
2007
r9
Dong Xin
Microsoft Research
Wu, Han
2008
r10
Dong Xin
University of Illinois
Ling, He
2009
r11
Dong Xin
Microsoft Research
Chaudhuri
,
Ganti
2009
r5
Xin Luna
Dong
AT&T Labs-Research
Das
Sarma
, Halevy
2009
r6
Xin Luna
Dong
AT&T Labs-Research
Naumann
2010
r12
Dong Xin
Microsoft Research
He
2011
Less penalty on different values over time
Less reward on the same value over time
Consider records in time order for clustering
•••
13Slide14
Outline
Motivation
Linking temporal records
DecayTemporal clustering
DemoLinking records of the same groupRelated workConclusions & Future work
••• 14Slide15
Disagreement Decay
Intuition: different values over a long time is not a strong indicator of referring to different entities.
University of Washington (01-07)
AT&T Labs-Research (07-date)
Definition (Disagreement decay)
Disagreement decay of attribute A over time ∆t is the probability that an entity changes its A-value within time ∆t.
•••
15Slide16
Agreement Decay
Intuition: the same value over a long time is not a strong indicator of referring to the same entities.
Adam Smith: (1723-1790)
Adam Smith: (1965-)
Definition (Agreement decay
) Agreement decay of attribute A over time ∆t is the probability that different entities share the same A-value within time ∆t.
•••
16Slide17
Decay Curves
Decay curves of address learnt from European Patent data
Disagreement decay
Agreement decay
Patent records: 1871
Real-world inventors: 359
In years: 1978 - 2003
•••
17Slide18
Applying Decay
E.g.
r1 <Xin Dong, Uni. of Washington, 2004>
r2 <Xin Dong, AT&T Labs-Research, 2009>
No decayed similarity:
w(name)=w(affi.)=.5sim
(r1, r2)=.5*1+.5*0=
.5
Decayed similarity
w(name, ∆t=5)=1-d
agree
(name , ∆t=5)=.95,
w(
affi
., ∆t=5)=1-d
disagree
(
affi
. , ∆t=5)=.1 sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9Match Un-match ••• 18Slide19
Applying Decay
•••
19
ID
Name
Affiliation
Co-authors
Year
r1
Xin
Dong
R. Polytechnic Institute
Wozny
1991
r2
Xin Dong
University of Washington
Halevy,
Tatarinov
2004
r7
Dong Xin
University of Illinois
Han,
Wah
2004
r3
Xin Dong
University of Washington
Halevy
2005
r4
Xin Luna
Dong
University of Washington
Halevy, Yu
2007
r8
Dong Xin
University of Illinois
Wah
2007
r9
Dong Xin
Microsoft Research
Wu, Han
2008
r10
Dong Xin
University of Illinois
Ling, He
2009
r11
Dong Xin
Microsoft Research
Chaudhuri
,
Ganti
2009
r5
Xin Luna
Dong
AT&T Labs-Research
Das
Sarma
, Halevy
2009
r6
Xin Luna
Dong
AT&T Labs-Research
Naumann
2010
r12
Dong Xin
Microsoft Research
He
2011
All records are merged into the same cluster!!
Able to detect changes!Slide20
Decayed Similarity & Traditional Clustering
•••
20
Decay improves recall over baselines by 23-67%
Patent records: 1871
Real-world inventors: 359
In years: 1978 - 2003Slide21
Outline
Motivation
Linking temporal records
Decay
Temporal clusteringDemoLinking records of the same groupRelated workConclusions & Future work
••• 21Slide22
Early Binding
Compare a new record with existing clusters
Make
eager merging decision
for each recordMaintain the earliest/latest timestamp for its last value
••• 22Slide23
Early Binding
ID
Name
Affiliation
Co-authors
From
To
r2
Xin Dong
Univ. of Washington
Halevy,
Tatarinov
2004
2004
ID
Name
Affiliation
Co-authors
From
To
r3
Xin Dong
Univ. of Washington
Halevy
2004
2005
r1
Xin
Dong
R. P. Institute
Wozny
1991
1991
r7
Dong Xin
University of Illinois
Han,
Wah
2004
2004
r8
Dong Xin
University of Illinois
Wah
2004
2007
r4
Xin Luna
Dong
Univ. of Washington
Halevy, Yu
2004
2007
r9
Dong Xin
Microsoft Research
Wu, Han
2008
2008
r10
Dong Xin
University of Illinois
Ling, He
2009
2009
ID
Name
Affiliation
Co-authors
From
To
r5
Xin Luna
Dong
AT&T Labs-Research
Das
Sarma
, Halevy
2009
2009
r11
Dong Xin
Microsoft Research
Chaudhuri
,
Ganti
2008
2009
r6
Xin Luna
Dong
AT&T Labs-Research
Naumann
2009
2010
r12
Dong Xin
Microsoft Research
He
2008
2011
C
1
C
2
C
3
earlier mistakes prevent later merging
!!
Avoid a lot of false positives!
•••
23Slide24
Adjusted Binding
Compare earlier records with clusters created later
Proceed in EM-style
Initialization:
Start with the result of initialized clustering Estimation:
Compute record-cluster similarityMaximization: Choose the optimal clusteringTermination:
Repeat until the results converge or oscillate
•••
24Slide25
Adjusted Binding
Compute similarity by
Consistency
: consistency in evolution of values
Continuity: continuity of records in time
Case 1:
r.t
C.late
record time stamp
cluster time stamp
C.early
Case 2:
r.t
C.late
C.early
Case 3:
r.t
C.late
C.early
Case 4:
r.t
C.late
C.early
sim
(r, C)=cont(r, C)*cons(r, C)
•••
25Slide26
Adjusted Binding
r7
DongXin@UI
-2004
r
9
DongXin@MSR
-2008
C
3
C
4
C
5
r
10
DongXin@UI
-2009
r
8DongXin@UI
-2007
r11
DongXin@MSR
-2009
r
12
DongXin@MSR
-2011
r10 has higher continuity with C4
r8 has higher continuity with C4
Once r8 is merged to C4, r7 has higher continuity with C4
26Slide27
Adjusted Binding
C
1
C
2
C
3
ID
Name
Affiliation
Co-authors
Year
r1
Xin
Dong
R. Polytechnic Institute
Wozny
1991
r2
Xin Dong
University of Washington
Halevy,
Tatarinov
2004
r3
Xin Dong
University of Washington
Halevy
2005
r4
Xin Luna
Dong
University of Washington
Halevy, Yu
2007
r5
Xin Luna
Dong
AT&T Labs-Research
Das
Sarma
, Halevy
2009
r6
Xin Luna
Dong
AT&T Labs-Research
Naumann
2010
r7
Dong Xin
University of Illinois
Han,
Wah
2004
r8
Dong Xin
University of Illinois
Wah
2007
r9
Dong Xin
Microsoft Research
Wu, Han
2008
r10
Dong Xin
University of Illinois
Ling, He
2009
r11
Dong Xin
Microsoft Research
Chaudhuri
,
Ganti
2009
r12
Dong Xin
Microsoft Research
He
2011
Correctly cluster all records
•••
27Slide28
Temporal Clustering
•••
28
Patent records: 1871Real-world inventors: 359In years: 1978 - 2003
Full algorithm has the best result
Adjusted Clustering improves recall without reducing precision much Slide29
Experimental Results
Data sets:
#Records
#Entities
Years
Patent
1871
359
1978-2003
DBLP-XD
72
8
1991-2010
DBLP-WW
738
18+potpourri
1992-2011
(a) Results of XD data
(b) Results of WW data••• 29Slide30
DemonstrationCHRONOS
: Facilitating History Discovery by Linking Temporal Records
•••
ITIS Lab ••• http://www.itis.disco.unimib.it•••
30Slide31
Outline
Motivation
Linking temporal records
Decay
Temporal clusteringDemoLinking records of the same groupRelated workConclusions & Future work
••• 31Slide32
Are there any business chains?
If yes, which businesses are their members?
32Slide33
Ground Truth
2 chains
33Slide34
Solution 1:
Require high value consistency
0 chain
34Slide35
Solution 2:
Match records w. same name
1 chain
35Slide36
Challenges
ID
name
phone
stateURL domain
r1Taco CasaAL
tacocasa.com
r2
Taco
Casa
900
AL
tacocasa.com
r3
Taco
Casa
900
AL
tacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TX
tacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com
Erroneous values
Different local values
Scalability
6.8M Records
•••
36Slide37
Two-Stage Linkage – Stage I
Stage I:
Identify cores
containing listings very likely to belong to the same chainRequire strong robustness in presence of possibly erroneous values
Graph theoryHigh Scalability
••• 37
ID
name
phone
state
URL domain
r1
Taco
Casa
AL
tacocasa.com
r2
Taco
Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900AL
r5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa
704TXr10Elva’s Taco Casa
TXtacodemar.comSlide38
Two-Stage Linkage – Stage II
Stage II:
Cluster
cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering
No penalty on local values
••• 38
ID
name
phone
state
URL domain
r1
Taco
Casa
AL
tacocasa.com
r2
Taco
Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900AL
r5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TX
r10Elva’s Taco CasaTX
tacodemar.com
Reward strong evidenceSlide39
Stage II: Cluster cores and remaining records into chains.
Collect strong evidence from cores and leverage in clustering
No penalty on local values
•••
39
ID
name
phone
state
URL domain
r1
Taco
Casa
AL
tacocasa.com
r2
Taco
Casa
900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL
r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10
Elva’s Taco CasaTXtacodemar.com
Reward strong evidence
Two-Stage Linkage – Stage IISlide40
Stage II: Cluster cores and remaining records into chains.
Collect strong evidence from cores and leverage in clustering
No penalty on local values
•••
40
ID
name
phone
state
URL domain
r1
Taco
Casa
AL
tacocasa.com
r2
Taco
Casa
900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL
r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10
Elva’s Taco CasaTXtacodemar.com
Apply weak evidence
Two-Stage Linkage – Stage IISlide41
Stage II: Cluster cores and remaining records into chains.
Collect strong evidence from cores and leverage in clustering
No penalty on local values
•••
41
ID
name
phone
state
URL domain
r1
Taco
Casa
AL
tacocasa.com
r2
Taco
Casa
900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL
r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10
Elva’s Taco CasaTXtacodemar.com
No penalty on local values
Two-Stage Linkage – Stage IISlide42
Experimental Evaluation
Data set
6.8M records from YellowPages.comEffectiveness:Precision / Recall / F-measure (avg.): .96 / .96 / .96Efficiency
:6.9 hrs for single-machine solution40 mins for Hadoop solution80K chains and 1M records in chains•••
42
Chain
name
#
Stores
USPS - United States Post Office
12,776
SUBWAY
11,278
State Farm Insurance
8,711
McDonald's
7,450
Edward Jones6,781Slide43
Experimental Evaluation II
•••
ITIS Lab •••
http://www.itis.disco.unimib.it••• 43
Sample
#Records
#Chains
Chain size
#Single-biz
records
Random
2062
30
[2, 308]
503
AI
2446
1
24460UB3227[2, 275]5FBIns114914[33, 269]0Slide44
Related WorkRecord similarity:
Probabilistic linkage
Classification-based approaches
: classify records by probabilistic model [Felligi, ’69]Deterministic linkage
Distance-base approaches: apply distance metric to compute similarity of each attribute, and take the weighted sum as record similarity [Dey,08]Rule-based approaches: apply domain knolwedge to match record [Hernandez,98]Record clusteringTransitive rule [Hernandez,98]
Optimization problem [Wijaya,09]…•••
44Slide45
Conclusions
In some applications record linkage needs to be tolerant with
value diversity
When linking temporal records, time decay allows tolerance on evolving values
When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values•••
45Slide46
Future Work
•••
46Slide47
Thanks!
•••
47