/
Efficient classification for metric data Efficient classification for metric data

Efficient classification for metric data - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
400 views
Uploaded On 2016-07-11

Efficient classification for metric data - PPT Presentation

LeeAd Gottlieb Hebrew U Aryeh Kontorovich Ben Gurion U Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A ID: 400323

classification metric efficient data metric classification data efficient points lipschitz doubling dimension problem distance space ddim probability constant error

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Efficient classification for metric data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Efficient classification for metric data

Lee-Ad Gottlieb Hebrew U.Aryeh Kontorovich Ben Gurion U.Robert Krauthgamer Weizmann Institute

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.:

A

A

A

ASlide2

Classification problem

A fundamental problem in learning:Point space X

Probability distribution

P

on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability

Efficient classification for metric data

2

-1

+1Slide3

Classification problem

A fundamental problem in learning:Point space X

Probability distribution

P

on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability

Efficient classification for metric data

3

-1

+1Slide4

Classification problem

A fundamental problem in learning:Point space X

Probability distribution

P

on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in XProduces hypothesis h: X → {-1,1} with empirical error and true errorGoal: uniformly over h in probability

Efficient classification for metric data

4

-1

+1Slide5

Generalization bounds

How do we upper bound the

true error?

Use a

generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/nMore complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set that can be shattered by

h

-1

+1

-1

+1

5Slide6

Popular approach for classification

Assume the points are in Euclidean space!ProsExistence of inner productEfficient algorithms (SVM)Good generalization bounds (max margin)

Cons

Many natural settings non-Euclidean

Euclidean structure is a strong assumptionRecent popular focusMetric space dataEfficient classification for metric data6Slide7

Efficient classification for metric data

7Metric space

(X,d)

is a

metric space if X = set of pointsd() = distance functionnonnegativesymmetrictriangle inequalityinner product → normnorm → metric But ⇐ doesn’t hold

חיפה

באר שבע

תל אביב

208km

95km

113kmSlide8

Classification for metric data?

Advantage: often much more naturalmuch weaker assumption stringsImages (earthmover distance)

Problem: no vector representation

No notion of dot-product (and no kernel)

What to do?Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion!Use some NN heuristic?.. NN classifier has ∞ VC-dim!Efficient classification for metric data8Slide9

Efficient classification for metric data

9Preliminaries: Lipschitz constant

The

Lipschitz constant

L of a function f: X → R measures its smoothnessIt is the smallest value L that satisfies for all points xi,xj in X Denoted bySuppose hypothesis h: S → {-1,1} is consistent with sample SIts Lipschitz constant of h is determined by the closest pair of differently labeled points

Or equivalently ≥ 2/d(S+,S

−)

-1

+1Slide10

Efficient classification for metric data

10

Preliminaries: Lipschitz extension

Lipschitz

extension: A classic problem in Analysisgiven a function f: S → R for S in X, extend f to all of X without increasing the Lipschitz constant

Example: Points on the real line

f(1) = 1f(

-1) = -1

credit: A. ObermanSlide11

Efficient classification for metric data

11Classification for metric data

A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04)

Construction

of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functionsEstimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with hLipschitz extension problemFor example f(x) = mini [f(xi) + 2d(x,

xi)/d(S+,

S−)] over all (xi,yi

) in SEvaluation of

h reduces to exact Nearest Neighbor SearchStrong theoretical motivation for the

NNS classification heuristic Slide12

More on von Luxburg & Bousquet

Efficient classification for metric data12

*modulo some cheatingSlide13

von Luxburg & Bousquet cont’d

Efficient classification for metric data13Slide14

Efficient classification for metric data

14Two new directions

The framework of [vLB ‘04] leaves open two further questions:

Constructing

h: handling noiseBias-Variance tradeoffWhich sample points in S should h ignore?Evaluating h on XIn arbitrary metric space, exact NNS requires Θ(n) timeCan we do better?

q

~1

~1

-1

+1Slide15

Efficient classification for metric data

15Doubling dimension

Definition: Ball

B(x,r)

= all points within distance r from x.The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by  balls of half the radiusFirst used by [Assoud ‘83], algorithmically by [Clarkson ‘97].The doubling dimension is ddim(M)=log2(M)A metric is doubling if its doubling dimension is constantEuclidean: ddim(Rd) = O(

d)Packing property of doubling spaces

A set with diameter diam and minimum inter-point distance a, contains at most

(diam/a)O(

ddim) points

Here

7

.Slide16

Applications of doubling dimension

Major application to databasesRecall that exact NNS requires

Θ

(

n) time in arbitrary metric spaceThere exists a linear size structure that supports approximate nearest neighbor search in time 2O(ddim) log n Database/network structures and tasks analyzed via the doubling dimensionNearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]Image recognition (Vision) [KG --]Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]

Clustering [Tal ‘04, ABS ‘08, FM ‘10]Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]

Further applicationsTravelling Salesperson [Tal ‘04]Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11

]Machine learning [BLL ‘09, KKL ‘10

, KKL --]

Note: Above algorithms can be extended to nearly-doubling spaces [

GK ‘10]Message: This is an active line of research…

16Slide17

Fat-shattering dimension

Alon, Ben-David, Cesa-Bianchi, Haussler ‘97

Analogue of VC-dimension for continuous functions

Efficient classification for metric data

17Slide18

Fat-shattering and generalization

Efficient classification for metric data18Slide19

Our dual use of doubling dimension

Interestingly, considering the doubling dimension yields contributes in two different areasStatistical: Function complexityWe bound the complexity of the hypothesis in terms of the

doubling dimension

of

X and the Lipschitz constant of the classifier hComputational: efficient approximate NNSEfficient classification for metric data19Slide20

Efficient classification for metric data

20Statistical contribution

We provide

generalization bounds

for Lipschitz functions on spaces with low doubling dimensionvLB provided similar bounds using covering numbers and Rademacher averagesFat-shattering analysis:L-Lipschitz functions shatter a set → inter-point distance is at least 2/LPacking property → set has (diam L)O(ddim) points This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity.Slide21

Efficient classification for metric data

21Statistical contribution

[BST ‘99]:

For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(

x, y) : sgn(f(x)) ≠

y} ≤ k/n + [2/n (d

ln(34en/d) log

2(578n) + ln(4/

))]1/2.In both cases,

d is bound by the fat-shattering dimension, d ≤ (

diam L)ddim + 1 Done with the statistical contribution … On to the computational contribution. Slide22

Efficient classification for metric data

22Computational contribution

Evaluation of

h

for new points in XLipschitz extension functionf(x) = mini [yi + 2d(x, xi)/d(S+,S−)]Requires exact nearest neighbor search, which can be expensive!New tool: (1+)-approximate nearest neighbor search2O(ddim) log

n + O(-ddim)

time[KL ‘04, HM ‘06, BKL ‘06, CG ‘06]If we evaluate f(

x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of

g(x) = (1+

) f(x) + 

e(x) = (1+) f(

x) - Note that g(

x) ≥ f(x) ≥ e(x)

g(x) and e(x

) have Lipschitz constant (1+)L, so they and the approximate function generalizes well

g(

x

)

f(

x

)

e(

x

)

2

Slide23

Efficient classification for metric data

23Final problem: bias variance tradeoff

Which sample points in

S

should h ignore?If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/

))]1/2.Where

d ≤ (diam L)ddim + 1

-1

+1Slide24

Efficient classification for metric data

24Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant

LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possible

-1

+1Slide25

Efficient classification for metric data

25Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant

LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possibleMinimum vertex coverNP-CompleteAdmits a 2-approximation in O(E) time

-1

+1Slide26

Efficient classification for metric data

26Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant

LO(n2) possibilitiesLocate all pairs of points from S+ and S- whose distance is less than 2LAt least one of these points has to be taken as an errorGoal: Remove as few points as possibleMinimum vertex coverNP-CompleteAdmits a 2-approximation in O(E) timeMinimum vertex cover on a bipartite graphEquivalent to maximum matching (Konig’s theorem)Admits an exact solution in O(n2.376) randomized time [MS ‘04]

-1

+1Slide27

Efficient classification for metric data

27Efficient SRM

Algorithm:

For each of O(

n2) values of LRun matching algorithm to find minimum errorEvaluate generalization bound for this value of LO(n4.376) randomized timeBetter algorithmBinary search over O(n2) values of LFor each valueRun greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of LSlide28

Efficient classification for metric data

28Conclusion

Results:

Generalization bounds

for Lipschitz classifiers in doubling spacesEfficient evaluation of the Lipschitz extension hypothesis using approximate NNSEfficient Structural Risk MinimizationContinuing research: Continuous labelsRisk bound via the doubling dimensionClassifier h determined via an LPFaster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints.Slide29

Application: earthmover distance

Efficient classification for metric data29

S

T