Aditya Parameswaran Stanford University Jointly with Kedar Bellare Suresh Iyengar Vibhor Rastogi Yahoo Research Entity Matching Goal Find duplicate entities in a given data set ID: 600350
Download Presentation The PPT/PDF document "Active Sampling for Entity Matching" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Active Sampling for Entity Matching
Aditya ParameswaranStanford UniversityJointly with: Kedar Bellare, Suresh Iyengar, Vibhor RastogiYahoo! ResearchSlide2
Entity Matching
Goal: Find duplicate entities in a given data setFundamental data cleaning primitive decades of prior workEspecially important at Yahoo! (and other web companies)2Homma’s Brown Rice SushiCalifornia AvenuePalo Alto
Homma’s Sushi
Cal Ave
Palo AltoSlide3
Why is it important?
3
W
ebsites
Databases
Content Providers
Dirty Entities
???
Deduplicated
Entities
Applications:
Business Listings in Y! Local
Celebrities in Y! Movies
Events in Y!
Upcoming
….
Find Duplicates
Yelp
Zagat
FoursqSlide4
How?
4
Reformulated Goal:
Construct a high quality classifier identifying duplicate entity pairs
Problem:
How do we select training data?
Answer: Active Learning with Human Experts!
Slide5
Reformulated Workflow
5
W
ebsites
Databases
Content Providers
Dirty Entities
Our Technique
Deduplicated
EntitiesSlide6
Active Learning (AL) Primer
Properties of an AL algorithm:Label ComplexityTime ComplexityConsistencyPrior work:Uncertainty SamplingQuery by Committee…Importance Weighted Active Learning (IWAL)Online IWAL without Constraints Implemented in Vowpal Wabbit (VW)0-1 MetricTime and Label efficientProvably Consistent
Work even under noisy settings
}
6Slide7
Problem One: Imbalanced Data
Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]: Maximize Recall Such that Precision > τ7
100
1
Non-matches
Matches
Solution: All Non-matches
Precision 100%
0-1 Error
≈ 0
Correctly
identified matches
% of correct matchesSlide8
Problem Two: Guarantees
Prior work on Entity MatchingNo guarantees on Recall/PrecisionEven if they do, they have:High time + label complexityCan we adapt prior work on AL for the new objective:Maximize recall, such that precision > τ With:Sub-linear label complexityEfficient time complexity8Slide9
Overview of Our Approach
Recall Optimizationwith Precision ConstraintWeighted 0-1 ErrorActive Learningwith 0-1 Error
Reduction:
Convex-hull Search in
Relaxed
Lagrangian
Reduction:
Rejection Sampling
This talk
Paper
9Slide10
Objective
Given: Hypothesis class H, Threshold τ in [0,1]Objective: Find h in H thatMaximizes recall(h)Such that: precision(h) >= τ Equivalently:Maximize -falseneg(h)Such that: ε truepos(h) - falsepos(h) >= 0Where
ε = τ/(1-τ)
10Slide11
Unconstrained Objective
Current formulation:Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0If we introduce lagrange multiplier λ: Maximize X(h) + λ Y(h), can be rewritten as:Minimize δ falseneg
(h) + (1 – δ) falsepos(h)
X(h)
Y(h)
Weighted
0-1 objective
11Slide12
Convex Hull of Classifiers
12
Y(h)
X(h)
We want a classifier here
0
Convex shape formed by joining classifiers strictly dominating others
Maximize X(h)
Such that Y(h) >= 0
Can have exponential number of points insideSlide13
Convex Hull of Classifiers
13
Y(h)
X(h)
For any
λ
>0, there is a point / line with largest value of X +
λ
Y
If
λ
=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.
u
v
u-v
Plug
λ
into weighted objective, get classifier h with highest
X(h) +
λ
Y(h)
Maximize X(h)
Such that Y(h) >= 0Slide14
Convex Hull of Classifiers
14
Y(h)
X(h)
W
orst case, we get this point
Naïve strategy:
try all
λ
Equivalently, try all slopes
Instead,
do
b
inary
search for
λ
Problem: When to stop?
1) Bounds
2) Discretization of
λ
Details in Paper!
Too long!
Maximize X(h)
Such that Y(h) >= 0Slide15
Algorithm I (Ours Weighted)
Given: AL black box C for weighted 0-1 errorGoal: Precision constrained objectiveRange of λ: [Λmin,Λmax]Don’t enumerate all candidate λ too expensive; O(n3)Instead, discretized using factor θ see paper!Binary search over discretized valuesSame complexity as binary searchO(log n)
15Slide16
Algorithm II (Weighted 0-1)
Given: AL black box B for 0-1 errorGoal: AL Black box C for weighted 0-1 errorUse trick from Supervised Learning [Zadrozny03]Cost-sensitive objective Binary Reduction by rejection sampling16Slide17
Overview of Our Approach
Recall Optimizationwith Precision ConstraintWeighted 0-1 ErrorActive Learningwith 0-1 Error
Reduction:
Convex-hull Search in
Relaxed
Lagrangian
Reduction:
Rejection Sampling
This talk
Paper
O(log n)
O(log n)
Labels
= O(log
2
n) L(B)
Time
= O(log
2
n)
T(B)
17Slide18
Experiments
Four real-world data setsAll labels knownSimulate active learning Two approaches for AL with Precision Constraint:OursWith Vowpal Wabbit as 0-1 AL Black BoxMonotone [Arasu11]Assumes monotonicity of similarity featuresHigh computational + label complexityData SetSize
Ratio (+/-)FeaturesY! Local Businesses
3958
0.115
5
UCI Person
Linkage
574913
0.004
9
DBLP-ACM Bibliography
494437
0.005
7
Scholar-DBLP Bibliography5893260.0097
18Slide19
Results I (Runtime with #Features)
Computational complexity on UCI Person19Slide20
Results II (Quality & #Label Queries)
BusinessPerson
20Slide21
Results II (Contd.)
DBLP-ACM21
ScholarSlide22
Results III (0-1 Active Learning)
Precision Constraint Satisfaction % of 0-1 AL22Slide23
Conclusion
Active learning for Entity MatchingCan use any 0-1 AL as black boxGreat real world performance:Computationally efficient (600k examples in 25 seconds)Label efficient and better F-1 on four real-world tasksGuaranteed Precision of matcherTime and label complexity23