Linking Records with Value Diversity PowerPoint Presentation

Linking Records with Value Diversity PowerPoint Presentation

2016-10-13 46K 46 0 0

Description

Xin Luna Dong. Database Department, AT&T Labs-Research. Collaborators: Pei Li, Andrea Maurino (Univ. of Milan-. Bicocca. ),. Songtao Guo (. ATTi. ), Divesh Srivastava (AT&T). December, 2012. Real Stories (I). ID: 475400

Embed code:

Download this presentation



DownloadNote - The PPT/PDF document "Linking Records with Value Diversity" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Linking Records with Value Diversity

Slide1

Linking Records with Value Diversity

Xin Luna Dong

Database Department, AT&T Labs-Research

Collaborators: Pei Li, Andrea Maurino (Univ. of Milan-

Bicocca

),

Songtao Guo (

ATTi

), Divesh Srivastava (AT&T)

December, 2012

Slide2

Real Stories (I)

Slide3

Real Stories (II)

Luna’s DBLP entry

Slide4

Sorry, no entry is found for Xin Dong

Real Stories (III)

Lab visiting

Slide5

Another Example from DBLP

•••

5

How many Wei Wang’s are there?What are their authoring histories?

Slide6

An Example from YP.com

- Are they the same business?

A:

the

same business

B: different businesses sharing the same phone#C: different businesses, only one correctly associated with the given phone#

•••

6

Slide7

Another Example from YP.com

••• 7

Are there any business chains?

If yes, which businesses are their members?

Slide8

Record Linkage

What is record linkage (entity resolution)?Input: a set of recordsOutput: clustering of records A critical problem in data integration and data cleaning“A reputation for world-class quality is profitable, a ‘business maker’.” – William E. WinklerCurrent work (surveyed in [Elmagarmid, 07], [Koudas, 06]) :assume that records of the same entities are consistent often focus on different representations of the same value E.g., “IBM” and “International Business Machines”

•••

8

Slide9

New Challenges

In reality, we observe value diversity of entitiesValues can evolve over time Catholic Healthcare (1986 - 2012)  Dignity Health (2012 -)Different records of the same group can have “local” values Some sources may provide erroneous values

••• 9

IDNameAddressPhoneURL001F.B. InsuranceVernon 76384 TX877 635-4684txfb-ins.com002F.B. Insurance #1Lufkin 75901 TX936 634-7285txfb.org003F.B. Insurance #5Cibolo 78108 TX877 635-4684

IDNameURLSource001Meekhof Tire Sales & Service Incwww.meekhoftire.comSrc. 1002Meekhof Tire Sales & Service Incwww.napaautocare.com Src. 2

•••

9

Slide10

Our Goal

To improve the linkage quality of integrated data with fairly high diversity Linking temporal records[VLDB ’11] [VLDB ’12 demo][FCS Journal ’12]Linking records of the same group[Under submission]Linking records with erroneous values[VLDB’10]

•••

10

Slide11

Outline

MotivationLinking temporal recordsDecayTemporal clusteringDemoLinking records of the same groupLinking records with erroneous valuesRelated workConclusions

•••

11

Slide12

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

How many authors?

What are their authoring histories?

2011

12

Slide13

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

Ground truth

3 authors

2011

13

Slide14

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

Solution 1:

requiring high value consistency

5 authors

false negative

2011

14

Slide15

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

Solution 2:

matching records w. similar names

2 authors

false positive

2011

15

Slide16

Opportunities

••• 16

IDNameAffiliationCo-authorsYearr1Xin DongR. Polytechnic InstituteWozny1991r2Xin DongUniversity of WashingtonHalevy, Tatarinov2004r7Dong Xin University of IllinoisHan, Wah2004r3Xin DongUniversity of WashingtonHalevy2005r4Xin Luna DongUniversity of WashingtonHalevy, Yu2007r8Dong Xin University of IllinoisWah2007r9Dong Xin Microsoft ResearchWu, Han2008r10Dong Xin University of IllinoisLing, He2009r11Dong Xin Microsoft ResearchChaudhuri, Ganti2009r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy2009r6Xin Luna DongAT&T Labs-ResearchNaumann2010r12Dong Xin Microsoft ResearchHe2011

Smooth transition

Seldom erratic changes

Continuity of history

Slide17

Intuitions

IDNameAffiliationCo-authorsYearr1Xin DongR. Polytechnic InstituteWozny1991r2Xin DongUniversity of WashingtonHalevy, Tatarinov2004r7Dong Xin University of IllinoisHan, Wah2004r3Xin DongUniversity of WashingtonHalevy2005r4Xin Luna DongUniversity of WashingtonHalevy, Yu2007r8Dong Xin University of IllinoisWah2007r9Dong Xin Microsoft ResearchWu, Han2008r10Dong Xin University of IllinoisLing, He2009r11Dong Xin Microsoft ResearchChaudhuri, Ganti2009r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy2009r6Xin Luna DongAT&T Labs-ResearchNaumann2010r12Dong Xin Microsoft ResearchHe2011

••• 17

Less penalty on different values over time

Less reward on the same value over time

Consider records in time order for clustering

Slide18

Outline

MotivationLinking temporal recordsDecayTemporal clusteringDemoLinking records of the same groupLinking records with erroneous valuesRelated workConclusions

•••

18

Slide19

Disagreement Decay

Intuition: different values over a long time is not a strong indicator of referring to different entities.University of Washington (01-07)AT&T Labs-Research (07-date) Definition (Disagreement decay) Disagreement decay of attribute A over time ∆t is the probability that an entity changes its A-value within time ∆t.

••• 19

Slide20

Agreement Decay

Intuition: the same value over a long time is not a strong indicator of referring to the same entities.Adam Smith: (1723-1790)Adam Smith: (1965-)Definition (Agreement decay) Agreement decay of attribute A over time ∆t is the probability that different entities share the same A-value within time ∆t.

••• 20

Slide21

Decay Curves

Decay curves of address learnt from European Patent data

••• 21

Disagreement decay

Agreement decay

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003

Slide22

E1

1991

2004

2009

2010

R. P. Institute

AT&T

UW

E2

2004

2008

2010

MSR

UIUC

E3

Change point

Last time point

t=1

Full life span

Partial life span

t=5

t=2

t=4

t=3

Change & last time point

AT&T

MSR

Learning Disagreement Decay

1. Full life span: [t,

t

next

)

A value exists from t to

t

next

, for time (

t

next

-t)

2. Partial life span: [t, t

end+1)*A value exists since t, for at least time (tend-t+1)Lp={1, 2, 3}, Lf={4, 5}d(∆t=1)=0/(2+3)=0d(∆t=4)=1/(2+0)=0.5d(∆t=5)=2/(2+0)=1

Slide23

Applying Decay

E.g.

r1 <Xin Dong, Uni. of Washington, 2004>r2 <Xin Dong, AT&T Labs-Research, 2009>No decayed similarity:w(name)=w(affi.)=.5sim(r1, r2)=.5*1+.5*0=.5Decayed similarityw(name, ∆t=5)=1-dagree(name , ∆t=5)=.95, w(affi., ∆t=5)=1-ddisagree(affi. , ∆t=5)=.1 sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9

••• 23

Match

Un-match

Slide24

Applying Decay

••• 24

IDNameAffiliationCo-authorsYearr1Xin DongR. Polytechnic InstituteWozny1991r2Xin DongUniversity of WashingtonHalevy, Tatarinov2004r7Dong Xin University of IllinoisHan, Wah2004r3Xin DongUniversity of WashingtonHalevy2005r4Xin Luna DongUniversity of WashingtonHalevy, Yu2007r8Dong Xin University of IllinoisWah2007r9Dong Xin Microsoft ResearchWu, Han2008r10Dong Xin University of IllinoisLing, He2009r11Dong Xin Microsoft ResearchChaudhuri, Ganti2009r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy2009r6Xin Luna DongAT&T Labs-ResearchNaumann2010r12Dong Xin Microsoft ResearchHe2011

All records are merged into the same cluster!!

 Able to detect changes!

Slide25

Decayed Similarity & Traditional Clustering

••• 25

Decay improves recall over baselines by 23-67%

Patent records: 1871

Real-world inventors: 359In years: 1978 - 2003

Slide26

Outline

MotivationLinking temporal recordsDecayTemporal clusteringDemoLinking records of the same groupLinking records with erroneous valuesRelated workConclusions

•••

26

Slide27

Early Binding

Compare a new record with existing clustersMake eager merging decision for each recordMaintain the earliest/latest timestamp for its last value

•••

27

Slide28

Early Binding

IDNameAffiliationCo-authorsFrom To

••• 28

r2Xin DongUniv. of WashingtonHalevy, Tatarinov20042004

IDNameAffiliationCo-authorsFromTo

r3Xin DongUniv. of WashingtonHalevy20042005

r1Xin DongR. P. InstituteWozny19911991

r7Dong Xin University of IllinoisHan, Wah20042004

r8

Dong Xin

University of Illinois

Wah20042007

r4

Xin Luna DongUniv. of WashingtonHalevy, Yu20042007

r9

Dong Xin Microsoft ResearchWu, Han20082008

r10

Dong Xin University of IllinoisLing, He20092009

ID

NameAffiliationCo-authorsFromTo

r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy20092009

r11

Dong Xin Microsoft ResearchChaudhuri, Ganti20082009

r6

Xin Luna DongAT&T Labs-ResearchNaumann20092010

r12

Dong Xin Microsoft ResearchHe20082011

C

1

C2

C3

earlier mistakes prevent later merging

!!

 Avoid a lot of false positives!

Slide29

Late Binding

Keep all evidence in record-cluster comparison

Make a global decision at the end

Facilitate with a bi-partite graph

Slide30

Late Binding

1

r1XinDong@R.P.I -1991

r2XinDong@UW -2004

r7DongXin@UI -2004

C1

C2

C3

0.5

0.5

0.33

0.22

0.45

create C2

p(r2, C1)=.5, p(r2, C2)=.5

create C3

p(r7, C1)=.33, p(r7, C2)=.22, p(r7, C3)=.45

Choose the possible world with highest probability

r1

X.D

R.P. I.

Wozny

19911

r2X.DUWHalevy, Tatarinov2004.5

r7D.X UIHan, Wah2004.33

r2D.XUWHalevy, Tatarinov2004.5

r7D.X UIHan, Wah2004.22

r7

D.X

UI

Han,

Wah

2004

.45

Slide31

IDNameAffiliationCo-authorsYearr1Xin DongR. Polytechnic InstituteWozny1991r2Xin DongUniversity of WashingtonHalevy, Tatarinov2004r3Xin DongUniversity of WashingtonHalevy2005r4Xin Luna DongUniversity of WashingtonHalevy, Yu2007r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy2009r6Xin Luna DongAT&T Labs-ResearchNaumann2010r7Dong Xin University of IllinoisHan, Wah2004r8Dong Xin University of IllinoisWah2007r9Dong Xin Microsoft ResearchWu, Han2008r11Dong Xin Microsoft ResearchChaudhuri, Ganti2009r12Dong Xin Microsoft ResearchHe2011r10Dong Xin University of IllinoisLing, He2009

Late Binding

C1

C2

C3

C4

C5

 Failed to merge C3, C4, C5

 Correctly split r1, r10 from C2

Slide32

Adjusted Binding

Compare earlier records with clusters created laterProceed in EM-styleInitialization: Start with the result of early/late bindingEstimation: Compute record-cluster similarityMaximization: Choose the optimal clusteringTermination: Repeat until the results converge or oscillate

•••

32

Slide33

Adjusted Binding

Compute similarity by Consistency: consistency in evolution of valuesContinuity: continuity of records in time

••• 33

Case 1:

r.t

C.late

record time stamp

cluster time stamp

C.early

Case 2:

r.t

C.late

C.early

Case 3:

r.t

C.late

C.early

Case 4:

r.t

C.late

C.early

sim

(r, C)=cont(r, C)*cons(r, C)

Slide34

Adjusted Binding

r7DongXin@UI -2004

r9DongXin@MSR -2008

C3

C4

C5

r

10

DongXin@UI -2009

r

8DongXin@UI -2007

r

11DongXin@MSR -2009

r12DongXin@MSR -2011

r10 has higher continuity with C4

r8 has higher continuity with C4

Once r8 is merged to C4, r7 has higher continuity with C4

34

Slide35

Adjusted Binding

••• 35

C1

C2

C3

IDNameAffiliationCo-authorsYearr1Xin DongR. Polytechnic InstituteWozny1991r2Xin DongUniversity of WashingtonHalevy, Tatarinov2004r3Xin DongUniversity of WashingtonHalevy2005r4Xin Luna DongUniversity of WashingtonHalevy, Yu2007r5Xin Luna DongAT&T Labs-ResearchDas Sarma, Halevy2009r6Xin Luna DongAT&T Labs-ResearchNaumann2010r7Dong Xin University of IllinoisHan, Wah2004r8Dong Xin University of IllinoisWah2007r9Dong Xin Microsoft ResearchWu, Han2008r10Dong Xin University of IllinoisLing, He2009r11Dong Xin Microsoft ResearchChaudhuri, Ganti2009r12Dong Xin Microsoft ResearchHe2011

 Correctly cluster all records

Slide36

Temporal Clustering

••• 36

Patent records: 1871Real-world inventors: 359In years: 1978 - 2003

Full algorithm has the best result

Adjusted Clustering improves recall without reducing precision much

Slide37

Comparison of Clustering Algorithms

Early has a lower precision

Late has a lower recall

Adjust improves over both

Slide38

Accuracy on DBLP Data – Xin Dong

Data set: Xin Dong data set from DBLP72 records, 8 entities, in 1991-2010Compare name, affiliation, title & co-authorsGolden standard: by manually checking

Adjust improves over baseline by

37-43%

Slide39

Error We Fixed

Records with affiliation University of Nebraska–Lincoln

Slide40

We Only Made One Mistake

Author’s affiliation on Journal papers are out of date

Slide41

Accuracy on DBLP Data (Wei Wang)

Data set: Wei Wang data set from DBLP738 records, 18 entities + potpourri, in 1992-2011Compare name, affiliation & co-authorsGolden standard: from DBLP + manually checking

Adjust improves over baseline by

11-15%

High precision (.98) and high recall (.97)

Slide42

Mistakes We Made

1 record @ 2006

72 records @ 2000-2011

Slide43

Mistakes We Made

Purdue University

Concordia University

Univ. of Western Ontario

Slide44

Errors We Fixed … despite some mistakes

546

records in potpourri

Correctly merged

63

records to existing Wei Wang entries

Wrongly merged

61

records

26

records: due to missing department information

35

records: due to high similarity of affiliation

E.g., Northwest University of Science & Technology

Northeast University of Science & Technology

Precision and recall of .94 w. consideration of these records

Slide45

Demonstration

CHRONOS: Facilitating History Discovery by Linking Temporal Records

••• ITIS Lab ••• http://www.itis.disco.unimib.it

•••

45

Slide46

Outline

MotivationLinking temporal recordsDecayTemporal clusteringDemoLinking records of the same groupLinking records with erroneous valuesRelated workConclusions

•••

46

Slide47

Are there any business chains?

If yes, which businesses are their members?

47

Slide48

Ground Truth

2 chains

48

Slide49

Solution 1:

Require high value consistency

0 chain

49

Slide50

Solution 2:

Match records w. same name

1 chain

50

Slide51

Challenges

••• 51

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Erroneous values

Different local values

Scalability

18M Records

Slide52

Two-Stage Linkage – Stage I

Stage I: Identify cores containing listings very likely to belong to the same chainRequire robustness in presence of possibly erroneous values  Graph theoryHigh Scalability

••• 52

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Slide53

Two-Stage Linkage – Stage II

Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clusteringNo penalty on local values

••• 53

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Reward strong evidence

Slide54

Two-Stage Linkage – Stage II

Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clusteringNo penalty on local values

••• 54

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Reward strong evidence

Slide55

Two-Stage Linkage – Stage II

Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clusteringNo penalty on local values

••• 55

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Apply weak evidence

Slide56

Two-Stage Linkage – Stage II

Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clusteringNo penalty on local values

••• 56

IDnamephonestateURL domainr1Taco CasaALtacocasa.comr2Taco Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

No penalty on local values

Slide57

Experimental Evaluation

Data set 18M records from YP.comEffectiveness:Precision / Recall / F-measure (avg.): .96 / .96 / .96Efficiency:8.3 hrs for single-machine solution40 mins for Hadoop solution.6M chains and 2.7M listings in chains

••• 57

Chain

name

#

Stores

SUBWAY

21,912

Bank of America

21,727

U-Haul

21,638

USPS - United States Post Office

19,225

McDonald's

17,289

Slide58

Experimental Evaluation II

••• ITIS Lab ••• http://www.itis.disco.unimib.it

••• 58

Sample

#Records

#Chains

Chain size

#Single-biz

records

Random

2062

30

[2, 308]

503

AI

2446

1

2446

0

UB

322

7

[2, 275]

5

FBIns

1149

14

[33, 269]

0

Slide59

Outline

MotivationLinking temporal recordsDecayTemporal clusteringDemoLinking records of the same groupLinking records with erroneous valuesRelated workConclusions

•••

59

Slide60

Limitations of Current Solution

SOURCENAME PHONEADDRESSs1Microsofe Corp.xxx-12551 Microsoft Way Microsofe Corp.xxx-94001 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan W.s2Microsoft Corp.xxx-12551 Microsoft Way Microsofe Corp.xxx-94001 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways3Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways4Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways5Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways6Microsoft Corp.xxx-22551 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways7MS Corp.xxx-12551 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways8MS Corp.xxx-12551 Microsoft WayMacrosoft Inc.xxx-05002 Sylvan Ways9Macrosoft Inc.xxx-05002 Sylvan Ways10MS Corp.xxx-05002 Sylvan Way

Locally resolving conflicts for linked records may overlook important global evidence

Erroneous values may prevent correct matching

Traditional techniques may fall short when exceptions to the uniqueness constraints exist

(Microsoft Corp. ,

Microsofe

Corp., MS Corp.)(XXX-1255, xxx-9400)(1 Microsoft Way)

(

Macrosoft Inc.)(XXX-0500)(2 Sylvan Way, 2 Sylvan W.)

60

Slide61

Our Solution

Perform linkage and fusion simultaneouslyAble to identify incorrect value from the beginning, so can improve linkage Make global decisionsConsider sources that associate a pair of values in the same record, so can improve fusionAllow small number of violations for capturing possible exceptions in the real world

61

Slide62

Clustering Performance

MDM:Our Model:

PrecisionRecallF-measure0.9460.9630.954

PrecisionRecallF-measure0.9810.8680.923

Page

62

Slide63

Example I (True Positive)

SRC_IDSRCNAMEPHONE#ADDRESS140430735AYepes Olga Lucia DDS(818) 242-95951217 S CENTRAL AVE217003624CIYepes Olga Lucia DDS(818) 242-95951217 S CENTRAL AVE317003624SPYepes Olga Lucia DDS(818) 242-95951217 S CENTRAL AVE437977223VOlga Lucia Dds(818) 242-95951217 S CENTRAL AVE512318966VOlga Lucia DDS(818) 242-95951217 S CENTRAL AVE6247896CSYepes, Olga Lucia, Dds - Olga Yepes Professional Dental(818) 242-95951217 S CENTRAL AVE

Page 63

MDM clusters

Cluster1: YP_ID = 9622348 [1,2,3,4,5]

Yepes

Olga Lucia DDS, (818) 242-9595, 1217 S CENTRAL AVE

Cluster2: YP_ID = 22548385 [6]

Yepes

, Olga Lucia,

Dds

- Olga

Yepes

Professional

Dentall

, (818) 242-9595, 1217 S CENTRAL AVE

Our cluster

Cluster1:

CLUSTER REPRESENTATIVES={

Yepes

Olga Lucia DDS,8182429595,1217 S CENTRAL AVE}

BUSINESS_NAME(s):Yepes

, Olga Lucia,

Dds

- Olga

Yepes

Professional

Dental|Yepes

Olga Lucia

DDS|Yepes

Olga Lucia

Dds

PHONE(s

): 8182429595

ADDRESS(es

): 1217 S CENTRAL AVE

Slide64

Example II (True Positive)

SRC_IDSRCNAMEPHONE#ADDRESS112317074VStandard Parking Corporation8189565880330 N BRAND BLVD237975426VStandard Parking Corporation8189565880330 N BRAND BLVD3145031720SPStandard Parking Corporation8189565880330 N BRAND BL437975400VStandard Parking Corp of Calif8185458560330 N BRAND BLVD512317051VStandard Parking Corp of Calif8185458560330 N BRAND BLVD617138241SPStandard Parking8185458560330 N BRAND BL712636915AStandard Parking Corporation8189565880330 N BRAND BLVD

Page 64

MDM clusters

Cluster1: YP_ID = 2304258 [1,2,3]

Standard Parking Corporation (null) (818) 956-5880

Cluster2: YP_ID = 8037494 [4,5,6,7]

Standard Parking Corporation 330 N Brand Blvd (818) 545-8560

Our cluster

Cluster1:

CLUSTER REPRESENTATIVES={Standard Parking Corporation, 8189565880, 330 N BRAND BLVD}

BUSINESS_NAME(s):Standard

Parking Corp of

Calif

| Standard Parking | Standard Parking Corporation

PHONE(s

): 8189565880

ADDRESS(es

): 330 N BRAND BLVD

Slide65

Example III (True Positive)

SRC_IDSRCNAMEPHONE#ADDRESS1151827586DBrandwood Hotel818244382033912 N BRAND BLVD2151827586ABrandwood Hotel81824438203391 2 N BRAND BLVD 3245891CSBrentwood Hotel8182443820339 1/2 N BRAND BLVD4136879332DBrandwood Hotel8182443820339 1/2 N BRAND BLVD512316985VBrandwood Hotel8182443820339 1/2 N BRAND BLVD637975338VBrandwood Hotel8182443820339 1/2 N BRAND BLVD7136879332SPBrandwood Hotel8182443820339 1-2 N BRAND BL82031962ABrandwood Hotel8182443820339 1/2 N BRAND BLVD9159061355ABrandwood Hotel8182443820302 N BRAND BLVD10159061355ABrandwood Hotel8182443820302 N BRAND BLVD

Page 65

MDM clusters

Cluster1: YP_ID = 20464165 [1,2]

Brandwood

Hotel (null) (818) 244-3820

Cluster2: YP_ID = 1045190 [3,4,5,6,7,8]

Brandwood

Hotel 339 1/2 N Brand Blvd (818) 244-3820

Cluster3: YP_ID = 17959938 [9,10]

Brandwood

Hotel 302 N Brand Blvd (818) 244-3820

Our cluster

Cluster1:

CLUSTER REPRESENTATIVES={

Brandwood

Hotel, 8182443820, 339 1/2 N BRAND BLVD}

BUSINESS_NAME(s

):

Brandwood

Hotel|Brentwood

Hotel

PHONE(s):8182443820

ADDRESS(es): 33912 N BRAND BLVD|3391 2 N BRAND BLVD|339 1/2 N BRAND BLVD|339 1-2 N BRAND BL

Slide66

Example IV (False Positive)

SRC_IDSRCNAMEPHONE#ADDRESS1247195CSGwynn Allen Chevrolet(818) 240-57201400 S BRAND BLVD224963507VLTAllen Gwynn Chevrolet(818) 240-57201400 S BRAND BLVD325807138VLTAllen Gwynn Chevrolet(818) 551-72661400 S BRAND BLVD4147986010SPAllen Gwynn Chevrolet(818) 241-04401400 S BRAND BLVD5147986009SPAllen Gwynn Chevrolet(818) 240-28781400 S BRAND BLVD6200901140JPMW61CMRAllen Gwynn Chevrolet(888) 799-77331400 S BRAND BLVD737977470VLTChevrolet Authorized Sales & Service Allen Gwynn Chevrolet(818) 551-72661400 S BRAND BLVD822779608VLTChevrolet Authorized Sales & Service /Allen Gwynn Chevrolet(818) 551-72661400 S BRAND BLVD912319256VLTGwynn Allen Chevrolet(818) 240-57201400 S BRAND BLVD1012319255VLTChevrolet Authorized Sales & Service(818) 240-57201400 S BRAND BLVD11144348375SPChevy Authorized Sales & Service(818) 551-72661400 S BRAND BLVD1285774433SPChevy Authorized Sales & Service(818) 551-72661400 S BRAND BLVD1367270550AMAAllen Gwynn Chevrolet(818) 240-00001400 S BRAND BLVD1422779606VLTAllen Gwynn Chevrolet(818) 551-72661400 S BRAND BLVD1521348765VLTAllen Gwynn Chevrolet(818) 242-22321400 S BRAND BLVD1612319301VLTAllen Gwynn Chevrolet(818) 240-00001400 S BRAND BLVD17147049159SPAllen Gwynn Chevrolet(818) 242-22321400 S BRAND BL18147137314SPAllen Gwynn Chevrolet(818) 240-57201400 S BRAND BL1942595980CSChevrolet-Allen Gwynn(818) 240-56121400 S BRAND BLVD2019561543SPChevrolet-Allen Gwynn(818) 240-56121400 S BRAND BLVD21143813191SPChevrolet-Allen Gwynn(818) 240-56121400 S BRAND BL

Page

66

Slide67

Example V (False Positive)

SRC_IDSRCNAMEPHONE#ADDRESS137973654VLTGeo Systems of Calif. Inc.(818) 500-9533312 WESTERN AVE212315143VLTGeo Systems of Calif. Inc.(818) 500-9533312 WESTERN AVE3143812833SPGeo Systems of Calif. Inc.(818) 500-9533312 WESTERN AVE412315142VLTCal Geosystems Inc.(818) 500-9533312 WESTERN AVE585156451SPCal. Geosystems Inc.(818) 500-9533312 WESTERN AVE612315274VLTGeosystems Of California(818) 500-95331545 VICTORY BLVD737973770VLTGeosystems of California(818) 500-95331545 VICTORY BLVD8144127258SPCalif. Geo-Systems Inc(818) 500-95339143812831SPCalif Geo-Systems Inc(818) 500-953310685180616AMACal Geosystems Inc(818) 500-95331545 VICTORY BLVD11685180617AMACalif Geo Systems Inc See Geo Systems of Calif Inc(818) 500-95331545 VICTORY BLVD

Page

67

Slide68

Related Work

Record similarity: Probabilistic linkageClassification-based approaches: classify records by probabilistic model [Felligi, ’69]Deterministic linkageDistance-base approaches: apply distance metric to compute similarity of each attribute, and take the weighted sum as record similarity [Dey,08]Rule-based approaches: apply domain knolwedge to match record [Hernandez,98]Record clusteringTransitive rule [Hernandez,98]Optimization problem [Wijaya,09]…

•••

68

Slide69

Conclusions

In some applications record linkage needs to be tolerant with value diversityWhen linking temporal records, time decay allows tolerance on evolving valuesWhen linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values

•••

69

Slide70

Thanks!

•••

70


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.