/
Stochastic k- Stochastic k-

Stochastic k- - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
403 views
Uploaded On 2016-08-06

Stochastic k- - PPT Presentation

Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21 2013 Kevin Swersky Ilya Sutskever Laurent Charlin Richard Zemel ID: 434458

neighbors nca accuracy neighborhood nca neighbors neighborhood accuracy analysis blue component chosen sne learning knn distribution objective distance 2004

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Stochastic k-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning

University of Toronto Machine Learning SeminarFeb 21, 2013

Kevin

Swersky

Ilya

Sutskever

Laurent Charlin

Richard Zemel

Danny

TarlowSlide2

Distance Metric Learning

Distance metrics are everywhere.But they're arbitrary! Dimensions are scaled weirdly, and even if they're normalized, it's not clear that Euclidean distance means much.

So learning sounds nice, but what you learn should depend on the task.A really common task is kNN

. Let's look at how to learn distance metrics for that. Slide3

Popular Approaches for Distance Metric Learning

Large margin nearest neighbors (LMNN)

“Target neighbors”

must be chosen

ahead of time

[Weinberger et al., NIPS 2006]

Some satisfying properties

Based on local structure (doesn’t have to pull all points into one region)

Some unsatisfying properties

Initial choice of target neighbors is difficult Choice of objective function has reasonable forces (pushes and pulls), but beyond that, it is pretty heuristic. No probabilistic interpretation.Slide4

Our goal: give a probabilistic interpretation of kNN and

properly learn a model based upon this interpretation.Related work that kind of does this: Neighborhood Components Analysis (NCA). Our approach is a direct generalization.

Probabilistic Formulations for Distance Metric LearningSlide5

Generative ModelSlide6

Generative ModelSlide7

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Given a query point i

.We select neighborsrandomly according to d.

Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?Slide8

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is the

probability that a randomly

selected neighbor will belong

to the correct (blue) class?Slide9

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is the

probability that a randomly

selected neighbor will belong

to the correct (blue) class?Slide10

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Question: what is the

probability that a randomly

selected neighbor will belong

to the correct (blue) class?Slide11

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Another way to write this:

(y are the class labels)Slide12

Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]

Objective: maximize thelog-likelihood of stochastically

selecting neighbors of thesame class.Slide13

After Learning

We might hope to learn a

projection that looks like this.Slide14

NCA is happy if points pair up and ignore global

structure. This is not ideal if we want k > 1.

Problem with 1-NCASlide15

k-Neighborhood Component Analysis (k-NCA)

NCA:

k-NCA:

(S is all sets of k neighbors of point

i)

Setting k=1 recovers NCA.

[

] is the indicator functionSlide16

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution Slide17

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution Slide18

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose k neighbors such thatthe majority is blue.

Computing the numerator of the distribution Slide19

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose subsets of k neighbors.Computing the denominator of the distribution Slide20

k-Neighborhood Component Analysis (k-NCA)

Stochastically choose subsets of k neighbors.Computing the denominator of the distribution Slide21

k-NCA puts more pressure on points to form

bigger clusters.k-NCA IntuitionSlide22

k-NCA Objective

Technical challenge: efficiently compute and

Learning:

find A that (locally) maximizes

this.Given: inputs X, labels y, neighborhood size k.Slide23

Factor Graph Formulation

Focus on a single

iSlide24

Factor Graph Formulation

Step 2: Constrain total # neighbors chosen to be k.

Step 1:

Split Majority function into cases (i.e., use gates)

Switch(k'=|ys

= yi

|) // # neighbors w/ label yi

Maj(y

s) = 1 iff forall

c != yi, |ys

=c| < k'Slide25

Assume y

i

= 'blue"

Z(

Z(

)

)

Binary variable: is j chosen as neighbor?

Total number of "blue" neighbors chosen

Exactly k' "blue" neighbors are chosen

Less than k' "pink" neighbors are chosen

Exactly

k

total neighbors must be chosen

Count total number of neighbors chosen

At this point, everything is just a matter of

inference in these factor

graphs

Partition functions: give

objective

Marginals

: give gradients Slide26

Sum-Product Inference

Number of neighbors chosen from

first two "blue" points

Number of neighbors chosen from

first three"blue" points

Number of neighbors chosen from

"blue" or "pink" classes

Total number of neighbors chosen

Lower level messages: O(k) time each

Upper level messages: O(k

2

) time each

Total runtime: O(Nk + C k

2)** Although slightly better is possible asymptotically. See Tarlow et al., UAI 2012.Slide27

Instead of Majority(ys)=

yi function, use All(ys)=yi.

Computation gets a little easier (just one k’ needed)Loses the

kNN interpretation.Exerts more pressure for homogeneity; tries to create a larger margin between classes.

Usually works a little better.Alternative VersionSlide28

Unsupervised Learning with t-SNE[van der

Maaten and Hinton, 2008]Visualize the structure of data in a 2D embedding.

Each input point x maps to an embedding point

e.

SNE tries to preserve relative pairwise distancesas faithfully as possible.

[Turian, http://metaoptimize.com/projects/wordreprs/]

[van

der Maaten & Hinton, JMLR 2008]Slide29

Problem with

t-SNE (also based on k=1)

[van der Maaten & Hinton, JMLR 2008]Slide30

Data distribution:

Embedding distribution:

Objective (minimize

wrt

e):

Unsupervised Learning with t-SNE[van der Maaten and Hinton, 2008]

D

istances:Slide31

kt-SNE

Data distribution:

Embedding distribution:

Objective:

Minimize objective

wrt

eSlide32

kt-SNE

kt-SNE can potentially lead to better higher order structure preservation (exponentially many more distance constraints).

Gives another “dial to turn” in order to obtainbetter visualizations.Slide33

ExperimentsSlide34

WINE embeddings

“All” Method

“Majority” MethodSlide35

IRIS—worst kNCA relative performance (full D)

Testing accuracy

Training accuracy

kNN

accuracy

kNN

accuracySlide36

ION—best kNCA relative performance (full D)

Training accuracy

Testing accuracySlide37

USPS kNN Classification (0% noise, 2D)

Training accuracy

Testing accuracySlide38

USPS kNN Classification (25% noise,

2D)

Training accuracy

Testing accuracySlide39

USPS kNN Classification (50% noise,

2D)

Training accuracy

Testing accuracySlide40

NCA Objective Analysis on Noisy USPS

0% Noise

25% Noise

50% Noise

Y-axis:

objective of 1-NCA, evaluated at

the parameters learned from k-NCA with

varying k and neighbor methodSlide41

t-SNE vs kt

-SNE

t-SNE

5t-SNE

kNN

AccuracySlide42

Discussion

Local is good, but 1-NCA is too local.Not quite expected kNN accuracy,

but doesn’t seem to change results.

Expected Majority computation may be useful elsewhere?Slide43

Thank You!