/
Linking Records with Value Diversity Linking Records with Value Diversity

Linking Records with Value Diversity - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
393 views
Uploaded On 2016-03-27

Linking Records with Value Diversity - PPT Presentation

Pei Li University of Milan Bicocca Advisor Andrea Maurino Supervisors ATampT Labs Research Xin Luna Dong Divesh Srivastava October 2012 Some Statistics from DBLP How many Wei Wangs are there ID: 270291

dong xin university records xin dong records university amp luna microsoft research labs 1991 dongat donguniversity stage washington decay time illinois values

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Linking Records with Value Diversity" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Linking Records with Value Diversity

Pei Li

University of Milan –

Bicocca

Advisor :

Andrea Maurino

Supervisors@ AT&T Labs - Research:

Xin Luna Dong, Divesh Srivastava

October, 2012Slide2

Some Statistics from DBLP

How many Wei Wang’s are there?

What are their authoring histories?

•••

2Slide3

Some Statistics from

YellowPages

•••

3

Are there any business chains?

If yes, which businesses are their members? Slide4

Record Linkage

What is record linkage (entity resolution)?

Input: a set of records

Output: clustering of records A critical problem in data integration and data cleaning“A reputation for world-class quality is profitable, a ‘business maker’.” – William E. Winkler

Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) :assume that records of the same entities are consistent

often focus on different representations of the same value e.g., “IBM” and “International Business Machines”

•••

4Slide5

New Challenges

In reality, we observe value diversity of entities

Values can evolve over

time Catholic Healthcare (1986 - 2012)  Dignity Health (2012 -)

Different records of the same group can have “local” values Some sources may provide erroneous values

••• 5

ID

Name

Address

Phone

URL

001

F.B. Insurance

Vernon 76384 TX

877 635-4684

txfb-ins.com

002

F.B. Insurance #1

Lufkin 75901 TX936 634-7285txfb.org003F.B. Insurance #5Cibolo 78108 TX877 635-4684

IDNameURLSource001Meekhof Tire Sales & Service Incwww.meekhoftire.comSrc. 1002Meekhof Tire Sales & Service Incwww.napaautocare.com Src. 2

•••

5Slide6

My Goal

To improve the linkage quality of integrated data with fairly high

diversity

linking temporal records

[VLDB ’11] [VLDB ’12 demo][FCS Journal ’12]linking records of the same group[Under preparation for SIGMOD ’13]•••

6Slide7

Outline

Motivation

Linking temporal records

DecayTemporal clustering

DemoLinking records of the same groupRelated workConclusions & Future work

••• 7Slide8

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7

: Dong Xin University of Illinoisr3

: Xin Dong University of Washingtonr4

: Xin Luna DongUniversity of Washingtonr8

:Dong Xin

University of Illinois

r9

: Dong Xin

Microsoft Research

r5

: Xin Luna Dong

AT&T Labs-Research

r10

: Dong Xin

University of Illinois

r11

: Dong Xin

Microsoft Research

r6

: Xin Luna Dong

AT&T Labs-Research

r12

: Dong Xin

Microsoft Research

How many authors?

What are their authoring histories?

2011

8Slide9

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7

: Dong Xin University of Illinoisr3

: Xin Dong University of Washingtonr4

: Xin Luna DongUniversity of Washingtonr8

:Dong Xin

University of Illinois

r9

: Dong Xin

Microsoft Research

r5

: Xin Luna Dong

AT&T Labs-Research

r10

: Dong Xin

University of Illinois

r11

: Dong Xin

Microsoft Research

r6

: Xin Luna Dong

AT&T Labs-Research

r12

: Dong Xin

Microsoft Research

Ground truth

3 authors

2011

9Slide10

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7

: Dong Xin University of Illinoisr3

: Xin Dong University of Washingtonr4

: Xin Luna DongUniversity of Washingtonr8

:Dong Xin

University of Illinois

r9

: Dong Xin

Microsoft Research

r5

: Xin Luna Dong

AT&T Labs-Research

r10

: Dong Xin

University of Illinois

r11

: Dong Xin

Microsoft Research

r6

: Xin Luna Dong

AT&T Labs-Research

r12

: Dong Xin

Microsoft Research

Solution 1:

requiring high value consistency

5 authors

false negative

2011

10Slide11

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1

: Xin Dong

R. Polytechnic Institute

r2

: Xin Dong

University of Washington

r7

: Dong Xin University of Illinoisr3

: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8

:Dong Xin

University of Illinois

r9

: Dong Xin

Microsoft Research

r5

: Xin Luna Dong

AT&T Labs-Research

r10

: Dong Xin

University of Illinois

r11

: Dong Xin

Microsoft Research

r6

: Xin Luna Dong

AT&T Labs-Research

r12

: Dong Xin

Microsoft Research

Solution 2:

matching records w. similar names

2 authors

false positive

2011

11Slide12

Opportunities

ID

Name

Affiliation

Co-authors

Year

r1

Xin

Dong

R. Polytechnic Institute

Wozny

1991

r2

Xin Dong

University of Washington

Halevy,

Tatarinov

2004

r7

Dong Xin

University of Illinois

Han,

Wah

2004

r3

Xin Dong

University of Washington

Halevy

2005

r4

Xin Luna

Dong

University of Washington

Halevy, Yu

2007

r8

Dong Xin

University of Illinois

Wah

2007

r9

Dong Xin

Microsoft Research

Wu, Han

2008

r10

Dong Xin

University of Illinois

Ling, He

2009

r11

Dong Xin

Microsoft Research

Chaudhuri

,

Ganti

2009

r5

Xin Luna

Dong

AT&T Labs-Research

Das

Sarma

, Halevy

2009

r6

Xin Luna

Dong

AT&T Labs-Research

Naumann

2010

r12

Dong Xin

Microsoft Research

He

2011

Smooth transition

Seldom erratic changes

Continuity of history

•••

12Slide13

Intuitions

ID

Name

Affiliation

Co-authors

Year

r1

Xin

Dong

R. Polytechnic Institute

Wozny

1991

r2

Xin Dong

University of Washington

Halevy,

Tatarinov

2004

r7

Dong Xin

University of Illinois

Han,

Wah

2004

r3

Xin Dong

University of Washington

Halevy

2005

r4

Xin Luna

Dong

University of Washington

Halevy, Yu

2007

r8

Dong Xin

University of Illinois

Wah

2007

r9

Dong Xin

Microsoft Research

Wu, Han

2008

r10

Dong Xin

University of Illinois

Ling, He

2009

r11

Dong Xin

Microsoft Research

Chaudhuri

,

Ganti

2009

r5

Xin Luna

Dong

AT&T Labs-Research

Das

Sarma

, Halevy

2009

r6

Xin Luna

Dong

AT&T Labs-Research

Naumann

2010

r12

Dong Xin

Microsoft Research

He

2011

Less penalty on different values over time

Less reward on the same value over time

Consider records in time order for clustering

•••

13Slide14

Outline

Motivation

Linking temporal records

DecayTemporal clustering

DemoLinking records of the same groupRelated workConclusions & Future work

••• 14Slide15

Disagreement Decay

Intuition: different values over a long time is not a strong indicator of referring to different entities.

University of Washington (01-07)

AT&T Labs-Research (07-date)

Definition (Disagreement decay)

Disagreement decay of attribute A over time ∆t is the probability that an entity changes its A-value within time ∆t.

•••

15Slide16

Agreement Decay

Intuition: the same value over a long time is not a strong indicator of referring to the same entities.

Adam Smith: (1723-1790)

Adam Smith: (1965-)

Definition (Agreement decay

) Agreement decay of attribute A over time ∆t is the probability that different entities share the same A-value within time ∆t.

•••

16Slide17

Decay Curves

Decay curves of address learnt from European Patent data

Disagreement decay

Agreement decay

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003

•••

17Slide18

Applying Decay

E.g.

r1 <Xin Dong, Uni. of Washington, 2004>

r2 <Xin Dong, AT&T Labs-Research, 2009>

No decayed similarity:

w(name)=w(affi.)=.5sim

(r1, r2)=.5*1+.5*0=

.5

Decayed similarity

w(name, ∆t=5)=1-d

agree

(name , ∆t=5)=.95,

w(

affi

., ∆t=5)=1-d

disagree

(

affi

. , ∆t=5)=.1 sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9Match Un-match ••• 18Slide19

Applying Decay

•••

19

ID

Name

Affiliation

Co-authors

Year

r1

Xin

Dong

R. Polytechnic Institute

Wozny

1991

r2

Xin Dong

University of Washington

Halevy,

Tatarinov

2004

r7

Dong Xin

University of Illinois

Han,

Wah

2004

r3

Xin Dong

University of Washington

Halevy

2005

r4

Xin Luna

Dong

University of Washington

Halevy, Yu

2007

r8

Dong Xin

University of Illinois

Wah

2007

r9

Dong Xin

Microsoft Research

Wu, Han

2008

r10

Dong Xin

University of Illinois

Ling, He

2009

r11

Dong Xin

Microsoft Research

Chaudhuri

,

Ganti

2009

r5

Xin Luna

Dong

AT&T Labs-Research

Das

Sarma

, Halevy

2009

r6

Xin Luna

Dong

AT&T Labs-Research

Naumann

2010

r12

Dong Xin

Microsoft Research

He

2011

All records are merged into the same cluster!!

 Able to detect changes!Slide20

Decayed Similarity & Traditional Clustering

•••

20

Decay improves recall over baselines by 23-67%

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003Slide21

Outline

Motivation

Linking temporal records

Decay

Temporal clusteringDemoLinking records of the same groupRelated workConclusions & Future work

••• 21Slide22

Early Binding

Compare a new record with existing clusters

Make

eager merging decision

for each recordMaintain the earliest/latest timestamp for its last value

••• 22Slide23

Early Binding

ID

Name

Affiliation

Co-authors

From

To

r2

Xin Dong

Univ. of Washington

Halevy,

Tatarinov

2004

2004

ID

Name

Affiliation

Co-authors

From

To

r3

Xin Dong

Univ. of Washington

Halevy

2004

2005

r1

Xin

Dong

R. P. Institute

Wozny

1991

1991

r7

Dong Xin

University of Illinois

Han,

Wah

2004

2004

r8

Dong Xin

University of Illinois

Wah

2004

2007

r4

Xin Luna

Dong

Univ. of Washington

Halevy, Yu

2004

2007

r9

Dong Xin

Microsoft Research

Wu, Han

2008

2008

r10

Dong Xin

University of Illinois

Ling, He

2009

2009

ID

Name

Affiliation

Co-authors

From

To

r5

Xin Luna

Dong

AT&T Labs-Research

Das

Sarma

, Halevy

2009

2009

r11

Dong Xin

Microsoft Research

Chaudhuri

,

Ganti

2008

2009

r6

Xin Luna

Dong

AT&T Labs-Research

Naumann

2009

2010

r12

Dong Xin

Microsoft Research

He

2008

2011

C

1

C

2

C

3

earlier mistakes prevent later merging

!!

 Avoid a lot of false positives!

•••

23Slide24

Adjusted Binding

Compare earlier records with clusters created later

Proceed in EM-style

Initialization:

Start with the result of initialized clustering Estimation:

Compute record-cluster similarityMaximization: Choose the optimal clusteringTermination:

Repeat until the results converge or oscillate

•••

24Slide25

Adjusted Binding

Compute similarity by

Consistency

: consistency in evolution of values

Continuity: continuity of records in time

Case 1:

r.t

C.late

record time stamp

cluster time stamp

C.early

Case 2:

r.t

C.late

C.early

Case 3:

r.t

C.late

C.early

Case 4:

r.t

C.late

C.early

sim

(r, C)=cont(r, C)*cons(r, C)

•••

25Slide26

Adjusted Binding

r7

DongXin@UI

-2004

r

9

DongXin@MSR

-2008

C

3

C

4

C

5

r

10

DongXin@UI

-2009

r

8DongXin@UI

-2007

r11

DongXin@MSR

-2009

r

12

DongXin@MSR

-2011

r10 has higher continuity with C4

r8 has higher continuity with C4

Once r8 is merged to C4, r7 has higher continuity with C4

26Slide27

Adjusted Binding

C

1

C

2

C

3

ID

Name

Affiliation

Co-authors

Year

r1

Xin

Dong

R. Polytechnic Institute

Wozny

1991

r2

Xin Dong

University of Washington

Halevy,

Tatarinov

2004

r3

Xin Dong

University of Washington

Halevy

2005

r4

Xin Luna

Dong

University of Washington

Halevy, Yu

2007

r5

Xin Luna

Dong

AT&T Labs-Research

Das

Sarma

, Halevy

2009

r6

Xin Luna

Dong

AT&T Labs-Research

Naumann

2010

r7

Dong Xin

University of Illinois

Han,

Wah

2004

r8

Dong Xin

University of Illinois

Wah

2007

r9

Dong Xin

Microsoft Research

Wu, Han

2008

r10

Dong Xin

University of Illinois

Ling, He

2009

r11

Dong Xin

Microsoft Research

Chaudhuri

,

Ganti

2009

r12

Dong Xin

Microsoft Research

He

2011

 Correctly cluster all records

•••

27Slide28

Temporal Clustering

•••

28

Patent records: 1871Real-world inventors: 359In years: 1978 - 2003

Full algorithm has the best result

Adjusted Clustering improves recall without reducing precision much Slide29

Experimental Results

Data sets:

#Records

#Entities

Years

Patent

1871

359

1978-2003

DBLP-XD

72

8

1991-2010

DBLP-WW

738

18+potpourri

1992-2011

(a) Results of XD data

(b) Results of WW data••• 29Slide30

DemonstrationCHRONOS

: Facilitating History Discovery by Linking Temporal Records

•••

ITIS Lab ••• http://www.itis.disco.unimib.it•••

30Slide31

Outline

Motivation

Linking temporal records

Decay

Temporal clusteringDemoLinking records of the same groupRelated workConclusions & Future work

••• 31Slide32

Are there any business chains?

If yes, which businesses are their members?

32Slide33

Ground Truth

2 chains

33Slide34

Solution 1:

Require high value consistency

0 chain

34Slide35

Solution 2:

Match records w. same name

1 chain

35Slide36

Challenges

ID

name

phone

stateURL domain

r1Taco CasaAL

tacocasa.com

r2

Taco

Casa

900

AL

tacocasa.com

r3

Taco

Casa

900

AL

tacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TX

tacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10Elva’s Taco CasaTXtacodemar.com

Erroneous values

Different local values

Scalability

6.8M Records

•••

36Slide37

Two-Stage Linkage – Stage I

Stage I:

Identify cores

containing listings very likely to belong to the same chainRequire strong robustness in presence of possibly erroneous values

 Graph theoryHigh Scalability

••• 37

ID

name

phone

state

URL domain

r1

Taco

Casa

AL

tacocasa.com

r2

Taco

Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900AL

r5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa

704TXr10Elva’s Taco Casa

TXtacodemar.comSlide38

Two-Stage Linkage – Stage II

Stage II:

Cluster

cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering

No penalty on local values

••• 38

ID

name

phone

state

URL domain

r1

Taco

Casa

AL

tacocasa.com

r2

Taco

Casa900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900AL

r5Taco Casa900ALr6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TX

r10Elva’s Taco CasaTX

tacodemar.com

Reward strong evidenceSlide39

Stage II: Cluster cores and remaining records into chains.

Collect strong evidence from cores and leverage in clustering

No penalty on local values

•••

39

ID

name

phone

state

URL domain

r1

Taco

Casa

AL

tacocasa.com

r2

Taco

Casa

900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL

r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10

Elva’s Taco CasaTXtacodemar.com

Reward strong evidence

Two-Stage Linkage – Stage IISlide40

Stage II: Cluster cores and remaining records into chains.

Collect strong evidence from cores and leverage in clustering

No penalty on local values

•••

40

ID

name

phone

state

URL domain

r1

Taco

Casa

AL

tacocasa.com

r2

Taco

Casa

900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL

r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10

Elva’s Taco CasaTXtacodemar.com

Apply weak evidence

Two-Stage Linkage – Stage IISlide41

Stage II: Cluster cores and remaining records into chains.

Collect strong evidence from cores and leverage in clustering

No penalty on local values

•••

41

ID

name

phone

state

URL domain

r1

Taco

Casa

AL

tacocasa.com

r2

Taco

Casa

900ALtacocasa.comr3Taco Casa900ALtacocasa.com, tacocasatexas.comr4Taco Casa900ALr5Taco Casa900AL

r6Taco Casa701TXtacocasatexas.comr7Taco Casa702TXtacocasatexas.comr8Taco Casa703TXtacocasatexas.comr9Taco Casa704TXr10

Elva’s Taco CasaTXtacodemar.com

No penalty on local values

Two-Stage Linkage – Stage IISlide42

Experimental Evaluation

Data set

6.8M records from YellowPages.comEffectiveness:Precision / Recall / F-measure (avg.): .96 / .96 / .96Efficiency

:6.9 hrs for single-machine solution40 mins for Hadoop solution80K chains and 1M records in chains•••

42

Chain

name

#

Stores

USPS - United States Post Office

12,776

SUBWAY

11,278

State Farm Insurance

8,711

McDonald's

7,450

Edward Jones6,781Slide43

Experimental Evaluation II

•••

ITIS Lab •••

http://www.itis.disco.unimib.it••• 43

Sample

#Records

#Chains

Chain size

#Single-biz

records

Random

2062

30

[2, 308]

503

AI

2446

1

24460UB3227[2, 275]5FBIns114914[33, 269]0Slide44

Related WorkRecord similarity:

Probabilistic linkage

Classification-based approaches

: classify records by probabilistic model [Felligi, ’69]Deterministic linkage

Distance-base approaches: apply distance metric to compute similarity of each attribute, and take the weighted sum as record similarity [Dey,08]Rule-based approaches: apply domain knolwedge to match record [Hernandez,98]Record clusteringTransitive rule [Hernandez,98]

Optimization problem [Wijaya,09]…•••

44Slide45

Conclusions

In some applications record linkage needs to be tolerant with

value diversity

When linking temporal records, time decay allows tolerance on evolving values

When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values•••

45Slide46

Future Work

•••

46Slide47

Thanks!

•••

47