LeeAd Gottlieb Hebrew U Aryeh Kontorovich Ben Gurion U Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A ID: 400323
Download Presentation The PPT/PDF document "Efficient classification for metric data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficient classification for metric data
Lee-Ad Gottlieb Hebrew U.Aryeh Kontorovich Ben Gurion U.Robert Krauthgamer Weizmann Institute
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.:
A
A
A
ASlide2
Classification problem
A fundamental problem in learning:Point space X
Probability distribution
P
on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
Efficient classification for metric data
2
-1
+1Slide3
Classification problem
A fundamental problem in learning:Point space X
Probability distribution
P
on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
Efficient classification for metric data
3
-1
+1Slide4
Classification problem
A fundamental problem in learning:Point space X
Probability distribution
P
on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability
Efficient classification for metric data
4
-1
+1Slide5
Generalization bounds
How do we upper bound the
true error?
Use a
generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/nMore complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by
h
-1
+1
-1
+1
5Slide6
Popular approach for classification
Assume the points are in Euclidean space!ProsExistence of inner productEfficient algorithms (SVM)Good generalization bounds (max margin)
Cons
Many natural settings non-Euclidean
Euclidean structure is a strong assumptionRecent popular focusMetric space dataEfficient classification for metric data6Slide7
Efficient classification for metric data
7Metric space
(X,d)
is a
metric space if X = set of pointsd() = distance functionnonnegativesymmetrictriangle inequalityinner product → normnorm → metric But ⇐ doesn’t hold
חיפה
באר שבע
תל אביב
208km
95km
113kmSlide8
Classification for metric data?
Advantage: often much more naturalmuch weaker assumption stringsImages (earthmover distance)
Problem: no vector representation
No notion of dot-product (and no kernel)
What to do?Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion!Use some NN heuristic?.. NN classifier has ∞ VC-dim!Efficient classification for metric data8Slide9
Efficient classification for metric data
9Preliminaries: Lipschitz constant
The
Lipschitz constant
L of a function f: X → R measures its smoothnessIt is the smallest value L that satisfies for all points xi,xj in X Denoted bySuppose hypothesis h: S → {-1,1} is consistent with sample SIts Lipschitz constant of h is determined by the closest pair of differently labeled points
Or equivalently ≥ 2/d(S+,S
−)
-1
+1Slide10
Efficient classification for metric data
10
Preliminaries: Lipschitz extension
Lipschitz
extension: A classic problem in Analysisgiven a function f: S → R for S in X, extend f to all of X without increasing the Lipschitz constant
Example: Points on the real line
f(1) = 1f(
-1) = -1
credit: A. ObermanSlide11
Efficient classification for metric data
11Classification for metric data
A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04)
Construction
of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functionsEstimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with hLipschitz extension problemFor example f(x) = mini [f(xi) + 2d(x,
xi)/d(S+,
S−)] over all (xi,yi
) in SEvaluation of
h reduces to exact Nearest Neighbor SearchStrong theoretical motivation for the
NNS classification heuristic Slide12
More on von Luxburg & Bousquet
Efficient classification for metric data12
*modulo some cheatingSlide13
von Luxburg & Bousquet cont’d
Efficient classification for metric data13Slide14
Efficient classification for metric data
14Two new directions
The framework of [vLB ‘04] leaves open two further questions:
Constructing
h: handling noiseBias-Variance tradeoffWhich sample points in S should h ignore?Evaluating h on XIn arbitrary metric space, exact NNS requires Θ(n) timeCan we do better?
q
~1
~1
-1
+1Slide15
Efficient classification for metric data
15Doubling dimension
Definition: Ball
B(x,r)
= all points within distance r from x.The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by balls of half the radiusFirst used by [Assoud ‘83], algorithmically by [Clarkson ‘97].The doubling dimension is ddim(M)=log2(M)A metric is doubling if its doubling dimension is constantEuclidean: ddim(Rd) = O(
d)Packing property of doubling spaces
A set with diameter diam and minimum inter-point distance a, contains at most
(diam/a)O(
ddim) points
Here
≥
7
.Slide16
Applications of doubling dimension
Major application to databasesRecall that exact NNS requires
Θ
(
n) time in arbitrary metric spaceThere exists a linear size structure that supports approximate nearest neighbor search in time 2O(ddim) log n Database/network structures and tasks analyzed via the doubling dimensionNearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]Image recognition (Vision) [KG --]Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]
Clustering [Tal ‘04, ABS ‘08, FM ‘10]Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]
Further applicationsTravelling Salesperson [Tal ‘04]Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11
]Machine learning [BLL ‘09, KKL ‘10
, KKL --]
Note: Above algorithms can be extended to nearly-doubling spaces [
GK ‘10]Message: This is an active line of research…
16Slide17
Fat-shattering dimension
Alon, Ben-David, Cesa-Bianchi, Haussler ‘97
Analogue of VC-dimension for continuous functions
Efficient classification for metric data
17Slide18
Fat-shattering and generalization
Efficient classification for metric data18Slide19
Our dual use of doubling dimension
Interestingly, considering the doubling dimension yields contributes in two different areasStatistical: Function complexityWe bound the complexity of the hypothesis in terms of the
doubling dimension
of
X and the Lipschitz constant of the classifier hComputational: efficient approximate NNSEfficient classification for metric data19Slide20
Efficient classification for metric data
20Statistical contribution
We provide
generalization bounds
for Lipschitz functions on spaces with low doubling dimensionvLB provided similar bounds using covering numbers and Rademacher averagesFat-shattering analysis:L-Lipschitz functions shatter a set → inter-point distance is at least 2/LPacking property → set has (diam L)O(ddim) points This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity.Slide21
Efficient classification for metric data
21Statistical contribution
[BST ‘99]:
For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(
x, y) : sgn(f(x)) ≠
y} ≤ k/n + [2/n (d
ln(34en/d) log
2(578n) + ln(4/
))]1/2.In both cases,
d is bound by the fat-shattering dimension, d ≤ (
diam L)ddim + 1 Done with the statistical contribution … On to the computational contribution. Slide22
Efficient classification for metric data
22Computational contribution
Evaluation of
h
for new points in XLipschitz extension functionf(x) = mini [yi + 2d(x, xi)/d(S+,S−)]Requires exact nearest neighbor search, which can be expensive!New tool: (1+)-approximate nearest neighbor search2O(ddim) log
n + O(-ddim)
time[KL ‘04, HM ‘06, BKL ‘06, CG ‘06]If we evaluate f(
x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of
g(x) = (1+
) f(x) +
e(x) = (1+) f(
x) - Note that g(
x) ≥ f(x) ≥ e(x)
g(x) and e(x
) have Lipschitz constant (1+)L, so they and the approximate function generalizes well
g(
x
)
f(
x
)
e(
x
)
2
Slide23
Efficient classification for metric data
23Final problem: bias variance tradeoff
Which sample points in
S
should h ignore?If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/
))]1/2.Where
d ≤ (diam L)ddim + 1
-1
+1Slide24
Efficient classification for metric data
24Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant
LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possible
-1
+1Slide25
Efficient classification for metric data
25Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant
LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possibleMinimum vertex coverNP-CompleteAdmits a 2-approximation in O(E) time
-1
+1Slide26
Efficient classification for metric data
26Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant
LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possibleMinimum vertex coverNP-CompleteAdmits a 2-approximation in O(E) timeMinimum vertex cover on a bipartite graphEquivalent to maximum matching (Konig’s theorem)Admits an exact solution in O(n2.376) randomized time [MS ‘04]
-1
+1Slide27
Efficient classification for metric data
27Efficient SRM
Algorithm:
For each of O(
n2) values of LRun matching algorithm to find minimum errorEvaluate generalization bound for this value of LO(n4.376) randomized timeBetter algorithmBinary search over O(n2) values of LFor each valueRun greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of LSlide28
Efficient classification for metric data
28Conclusion
Results:
Generalization bounds
for Lipschitz classifiers in doubling spacesEfficient evaluation of the Lipschitz extension hypothesis using approximate NNSEfficient Structural Risk MinimizationContinuing research: Continuous labelsRisk bound via the doubling dimensionClassifier h determined via an LPFaster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints.Slide29
Application: earthmover distance
Efficient classification for metric data29
S
T