/
Active Sampling for Entity Matching Active Sampling for Entity Matching

Active Sampling for Entity Matching - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
392 views
Uploaded On 2017-10-28

Active Sampling for Entity Matching - PPT Presentation

Aditya Parameswaran Stanford University Jointly with Kedar Bellare Suresh Iyengar Vibhor Rastogi Yahoo Research Entity Matching Goal Find duplicate entities in a given data set ID: 600350

active precision maximize weighted precision active weighted maximize label learning objective recall convex data entities reduction classifier complexity hull

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Active Sampling for Entity Matching" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Active Sampling for Entity Matching

Aditya ParameswaranStanford UniversityJointly with: Kedar Bellare, Suresh Iyengar, Vibhor RastogiYahoo! ResearchSlide2

Entity Matching

Goal: Find duplicate entities in a given data setFundamental data cleaning primitive  decades of prior workEspecially important at Yahoo! (and other web companies)2Homma’s Brown Rice SushiCalifornia AvenuePalo Alto

Homma’s Sushi

Cal Ave

Palo AltoSlide3

Why is it important?

3

W

ebsites

Databases

Content Providers

Dirty Entities

???

Deduplicated

Entities

Applications:

Business Listings in Y! Local

Celebrities in Y! Movies

Events in Y!

Upcoming

….

Find Duplicates

Yelp

Zagat

FoursqSlide4

How?

4

Reformulated Goal:

Construct a high quality classifier identifying duplicate entity pairs

Problem:

How do we select training data?

Answer: Active Learning with Human Experts!

Slide5

Reformulated Workflow

5

W

ebsites

Databases

Content Providers

Dirty Entities

Our Technique

Deduplicated

EntitiesSlide6

Active Learning (AL) Primer

Properties of an AL algorithm:Label ComplexityTime ComplexityConsistencyPrior work:Uncertainty SamplingQuery by Committee…Importance Weighted Active Learning (IWAL)Online IWAL without Constraints Implemented in Vowpal Wabbit (VW)0-1 MetricTime and Label efficientProvably Consistent

Work even under noisy settings

}

6Slide7

Problem One: Imbalanced Data

Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]: Maximize Recall Such that Precision > τ7

100

1

Non-matches

Matches

Solution: All Non-matches

Precision 100%

0-1 Error

≈ 0

Correctly

identified matches

% of correct matchesSlide8

Problem Two: Guarantees

Prior work on Entity MatchingNo guarantees on Recall/PrecisionEven if they do, they have:High time + label complexityCan we adapt prior work on AL for the new objective:Maximize recall, such that precision > τ With:Sub-linear label complexityEfficient time complexity8Slide9

Overview of Our Approach

Recall Optimizationwith Precision ConstraintWeighted 0-1 ErrorActive Learningwith 0-1 Error

Reduction:

Convex-hull Search in

Relaxed

Lagrangian

Reduction:

Rejection Sampling

This talk

Paper

9Slide10

Objective

Given: Hypothesis class H, Threshold τ in [0,1]Objective: Find h in H thatMaximizes recall(h)Such that: precision(h) >= τ Equivalently:Maximize -falseneg(h)Such that: ε truepos(h) - falsepos(h) >= 0Where

ε = τ/(1-τ)

10Slide11

Unconstrained Objective

Current formulation:Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0If we introduce lagrange multiplier λ: Maximize X(h) + λ Y(h), can be rewritten as:Minimize δ falseneg

(h) + (1 – δ) falsepos(h)

X(h)

Y(h)

Weighted

0-1 objective

11Slide12

Convex Hull of Classifiers

12

Y(h)

X(h)

We want a classifier here

0

Convex shape formed by joining classifiers strictly dominating others

Maximize X(h)

Such that Y(h) >= 0

Can have exponential number of points insideSlide13

Convex Hull of Classifiers

13

Y(h)

X(h)

For any

λ

>0, there is a point / line with largest value of X +

λ

Y

If

λ

=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.

u

v

u-v

Plug

λ

into weighted objective, get classifier h with highest

X(h) +

λ

Y(h)

Maximize X(h)

Such that Y(h) >= 0Slide14

Convex Hull of Classifiers

14

Y(h)

X(h)

W

orst case, we get this point

Naïve strategy:

try all

λ

Equivalently, try all slopes

Instead,

do

b

inary

search for

λ

Problem: When to stop?

1) Bounds

2) Discretization of

λ

Details in Paper!

Too long!

Maximize X(h)

Such that Y(h) >= 0Slide15

Algorithm I (Ours  Weighted)

Given: AL black box C for weighted 0-1 errorGoal: Precision constrained objectiveRange of λ: [Λmin,Λmax]Don’t enumerate all candidate λ  too expensive; O(n3)Instead, discretized using factor θ  see paper!Binary search over discretized valuesSame complexity as binary searchO(log n)

15Slide16

Algorithm II (Weighted  0-1)

Given: AL black box B for 0-1 errorGoal: AL Black box C for weighted 0-1 errorUse trick from Supervised Learning [Zadrozny03]Cost-sensitive objective  Binary Reduction by rejection sampling16Slide17

Overview of Our Approach

Recall Optimizationwith Precision ConstraintWeighted 0-1 ErrorActive Learningwith 0-1 Error

Reduction:

Convex-hull Search in

Relaxed

Lagrangian

Reduction:

Rejection Sampling

This talk

Paper

O(log n)

O(log n)

Labels

= O(log

2

n) L(B)

Time

= O(log

2

n)

T(B)

17Slide18

Experiments

Four real-world data setsAll labels knownSimulate active learning Two approaches for AL with Precision Constraint:OursWith Vowpal Wabbit as 0-1 AL Black BoxMonotone [Arasu11]Assumes monotonicity of similarity featuresHigh computational + label complexityData SetSize

Ratio (+/-)FeaturesY! Local Businesses

3958

0.115

5

UCI Person

Linkage

574913

0.004

9

DBLP-ACM Bibliography

494437

0.005

7

Scholar-DBLP Bibliography5893260.0097

18Slide19

Results I (Runtime with #Features)

Computational complexity on UCI Person19Slide20

Results II (Quality & #Label Queries)

BusinessPerson

20Slide21

Results II (Contd.)

DBLP-ACM21

ScholarSlide22

Results III (0-1 Active Learning)

Precision Constraint Satisfaction % of 0-1 AL22Slide23

Conclusion

Active learning for Entity MatchingCan use any 0-1 AL as black boxGreat real world performance:Computationally efficient (600k examples in 25 seconds)Label efficient and better F-1 on four real-world tasksGuaranteed Precision of matcherTime and label complexity23