semimetrics LeeAd Gottlieb Ariel U Aryeh Kontorovich Ben Gurion U Pinhas Nisnevitch Tel Aviv U TexPoint fonts used in EMF Read the TexPoint manual before you delete this box ID: 565775
Download Presentation The PPT/PDF document "Nearly optimal classification for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Nearly optimal classification for semimetrics
Lee-Ad Gottlieb Ariel U.Aryeh Kontorovich Ben Gurion U.Pinhas Nisnevitch Tel Aviv U.
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.:
A
A
A
ASlide2
Classification problem
A fundamental problem in learning:Point space X Probability distribution P on X x {-1,1} Learner observes sample S of
n
points
(x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
2
-1
+1Slide3
Classification problem
A fundamental problem in learning:Point space X Probability distribution P on X x {-1,1} Learner observes sample S of
n
points
(x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
3
-1
+1Slide4
Classification problem
A fundamental problem in learning:Point space X Probability distribution P on X x {-1,1} Learner observes sample S of
n
points
(x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
4
-1
+1Slide5
Generalization bounds
How do we upper bound the
true error?
Use a
generalization bound. Roughly speaking (and whp) true error ≤ empirical error + √[(complexity of h)/n]More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set
that can be shattered by h
-1
+1
-1
+1
5Slide6
Popular approach for classification
Assume the points are in Euclidean space!ProsExistence of inner productEfficient algorithms (SVM)Good generalization bounds (max margin)ConsMany natural settings non-Euclidean Euclidean structure is a strong assumptionRecent popular focus
Metric
space data
Semimetric space data6Slide7
7
Semimetric space(X, ρ) is a metric space if X = set of points
ρ
()
= distance function ρ:X×X → ℝNonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′Symmetric ρ(x,x′) = ρ(x′,x)Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)Slide8
8
Semimetric space(X,d) is a semimetric space if
X
= set of points
ρ() = distance function ρ:X×X → ℝNonnegative ρ(x,x′) ≥ 0, ρ(x,x′) = 0 ⇔ x = x′Symmetric ρ(x,x′) = ρ(x′,x)Triangle inequality ρ(x,x′) ≤ ρ(x,x′′) + ρ(x′,x′′)
inner product ⊂ norm ⊂
metric ⊂ semimetric Slide9
9
Semimetric examplesShannon-Jensen divergenceEuclidean-squared (ℓ2
2
)
Fractional ℓp spaces (p<1)Example: p = ½||a-b||p = (∑|ai-bi|p)
1/p
(0,2)
(0,0)
(2,0)
2
2
8
(0,1)
(0,0)
(2,2)
1
8
5Slide10
10
Semimetric examplesHausdorff distance
Point in
A
farthest from B1-rank Hausdorff distancePoint in A closest to Bk-rank Hausdorff
distancePoint in A k
-th closest to
B
A
B
A
BSlide11
11
Semimetric examplesNote:Semimetrics are often unintuitive
Example:
Diameter > 2*radius
(0,2)
(0,0)
(2,0)
2
2
8Slide12
Classification for semimetrics?
Problem: no vector representationNo notion of dot-product (and no kernel)What to do?Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion!Use some NN
heuristic?..
NN
classifier has ∞ VC-dim!Our approachSample compressionNN classificationResult: Strong generalization bounds12
-1
+1Slide13
Classification for semimetrics?
Problem: no vector representationNo notion of dot-product (and no kernel)What to do?Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion!Use some NN
heuristic?..
NN
classifier has ∞ VC-dim!Our approachSample compressionNN classificationResult: Strong generalization bounds13
-1
+1Slide14
Classification for semimetrics?
Problem: no vector representationNo notion of dot-product (and no kernel)What to do?Invent kernel (e.g. embed into Euclidean space)?.. Provably high distortion!Use some NN
heuristic?..
NN
classifier has ∞ VC-dim!Our approachSample compressionNN classificationResult: Strong generalization bounds14
-1
+1Slide15
Classification for semimetrics?
Central contribution: We discover that the complexity of the classifier is controlled byMargin ɣ of the sampleDensity dimension of the spaceWith bounds close to optimal…
15
-1
+1
ɣSlide16
16
Density dimensionDefinition: Ball B(x,r) = all points within distance r from x.
The
density constant
(of a metric M) is the minimum value c bounding the number of points in B(x,r) at mutual distance r/2Example: Density dimensionFor a set of n d-dimension vectors:SJ divergence: O(d)ℓp (p<1): O(d/
p)k-rank Hausdorff: O(k(
d+logn))
r
r/2Slide17
Classifier construction
Recall Compress sampleNNInitial approach:Classifier consistent with respect the sampleThat is, NN reconstructs the sample exactly
Solution:
ɣ
-net N of the sample17
-1
+1
ɣSlide18
Classifier construction
Recall Compress sampleNNInitial approach:Classifier consistent with respect the sampleThat is, NN reconstructs the sample exactly
Solution:
γ
-net N of the sample18
-1
ɣ
+1Slide19
Classifier construction
Recall
Compress sample
NN
Initial approach:
Classifier
consistent
with respect the sampleThat is, NN reconstructs the sample exactly
Solution:
ɣ-net N of the sample
19
-1
ɣ
+1Slide20
Classifier construction
Solution:
ɣ
-net
C
of the sample
SMust be consistentBrute force construction: O(
n2) Crucial question:
How many points in the net?
20
-1
ɣ
+1Slide21
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S|C| ≤ (radius(S) / ɣ)dens(
S
)
Proof:C ← 2dens(S) points at mutual distance radius(S)/221Slide22
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S|C| ≤ (radius(S) / ɣ)dens(
S
)
Proof:C ← 2dens(S) points at mutual distance radius(S)/2Associate each point not in C with NN in C22Slide23
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S|C| ≤ (radius(S) / ɣ)dens(
S
)
Proof:C ← 2dens(S) points at mutual distance radius(S)/2Associate each point not in C with NN in C23Slide24
Classifier construction
Theorem: If C is an optimally small ɣ-net of the sample S|C| ≤ (radius(S) / ɣ)dens(
S
)
Proof:C ← 2dens(S) points at mutual distance radius(S)/2Associate each point not in C with NN in CRepeat log(radius(S) / ɣ) timesRuntime: n log(radius(S) / ɣ) Optimality: Constructed net may not be optimally smallBut it’s NP-hard to do much better
24Slide25
Generalization bounds
Upshot: 25Slide26
Generalization bounds
Upshot: What if we allow the classifier εn errors on the sample?
Better bound:
Where
k is the achieved compression sizeBut NP-hard to optimize bound, so consistent net is the best possible26Slide27
Generalization bounds
Upshot: even under margin assumptions, a sample of size exponential in dens will be required for some distributions
27