Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21 2013 Kevin Swersky Ilya Sutskever Laurent Charlin Richard Zemel ID: 434458
Download Presentation The PPT/PDF document "Stochastic k-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning
University of Toronto Machine Learning SeminarFeb 21, 2013
Kevin
Swersky
Ilya
Sutskever
Laurent Charlin
Richard Zemel
Danny
TarlowSlide2
Distance Metric Learning
Distance metrics are everywhere.But they're arbitrary! Dimensions are scaled weirdly, and even if they're normalized, it's not clear that Euclidean distance means much.
So learning sounds nice, but what you learn should depend on the task.A really common task is kNN
. Let's look at how to learn distance metrics for that. Slide3
Popular Approaches for Distance Metric Learning
Large margin nearest neighbors (LMNN)
“Target neighbors”
must be chosen
ahead of time
[Weinberger et al., NIPS 2006]
Some satisfying properties
Based on local structure (doesn’t have to pull all points into one region)
Some unsatisfying properties
Initial choice of target neighbors is difficult Choice of objective function has reasonable forces (pushes and pulls), but beyond that, it is pretty heuristic. No probabilistic interpretation.Slide4
Our goal: give a probabilistic interpretation of kNN and
properly learn a model based upon this interpretation.Related work that kind of does this: Neighborhood Components Analysis (NCA). Our approach is a direct generalization.
Probabilistic Formulations for Distance Metric LearningSlide5
Generative ModelSlide6
Generative ModelSlide7
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Given a query point i
.We select neighborsrandomly according to d.
Question: what is theprobability that a randomlyselected neighbor will belongto the correct (blue) class?Slide8
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Question: what is the
probability that a randomly
selected neighbor will belong
to the correct (blue) class?Slide9
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Question: what is the
probability that a randomly
selected neighbor will belong
to the correct (blue) class?Slide10
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Question: what is the
probability that a randomly
selected neighbor will belong
to the correct (blue) class?Slide11
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Another way to write this:
(y are the class labels)Slide12
Neighborhood Component Analysis (NCA)[Goldberger et al., 2004]
Objective: maximize thelog-likelihood of stochastically
selecting neighbors of thesame class.Slide13
After Learning
We might hope to learn a
projection that looks like this.Slide14
NCA is happy if points pair up and ignore global
structure. This is not ideal if we want k > 1.
Problem with 1-NCASlide15
k-Neighborhood Component Analysis (k-NCA)
NCA:
k-NCA:
(S is all sets of k neighbors of point
i)
Setting k=1 recovers NCA.
[
] is the indicator functionSlide16
k-Neighborhood Component Analysis (k-NCA)
Stochastically choose k neighbors such thatthe majority is blue.
Computing the numerator of the distribution Slide17
k-Neighborhood Component Analysis (k-NCA)
Stochastically choose k neighbors such thatthe majority is blue.
Computing the numerator of the distribution Slide18
k-Neighborhood Component Analysis (k-NCA)
Stochastically choose k neighbors such thatthe majority is blue.
Computing the numerator of the distribution Slide19
k-Neighborhood Component Analysis (k-NCA)
Stochastically choose subsets of k neighbors.Computing the denominator of the distribution Slide20
k-Neighborhood Component Analysis (k-NCA)
Stochastically choose subsets of k neighbors.Computing the denominator of the distribution Slide21
k-NCA puts more pressure on points to form
bigger clusters.k-NCA IntuitionSlide22
k-NCA Objective
Technical challenge: efficiently compute and
Learning:
find A that (locally) maximizes
this.Given: inputs X, labels y, neighborhood size k.Slide23
Factor Graph Formulation
Focus on a single
iSlide24
Factor Graph Formulation
Step 2: Constrain total # neighbors chosen to be k.
Step 1:
Split Majority function into cases (i.e., use gates)
Switch(k'=|ys
= yi
|) // # neighbors w/ label yi
Maj(y
s) = 1 iff forall
c != yi, |ys
=c| < k'Slide25
Assume y
i
= 'blue"
Z(
Z(
)
)
Binary variable: is j chosen as neighbor?
Total number of "blue" neighbors chosen
Exactly k' "blue" neighbors are chosen
Less than k' "pink" neighbors are chosen
Exactly
k
total neighbors must be chosen
Count total number of neighbors chosen
At this point, everything is just a matter of
inference in these factor
graphs
Partition functions: give
objective
Marginals
: give gradients Slide26
Sum-Product Inference
Number of neighbors chosen from
first two "blue" points
Number of neighbors chosen from
first three"blue" points
Number of neighbors chosen from
"blue" or "pink" classes
Total number of neighbors chosen
Lower level messages: O(k) time each
Upper level messages: O(k
2
) time each
Total runtime: O(Nk + C k
2)** Although slightly better is possible asymptotically. See Tarlow et al., UAI 2012.Slide27
Instead of Majority(ys)=
yi function, use All(ys)=yi.
Computation gets a little easier (just one k’ needed)Loses the
kNN interpretation.Exerts more pressure for homogeneity; tries to create a larger margin between classes.
Usually works a little better.Alternative VersionSlide28
Unsupervised Learning with t-SNE[van der
Maaten and Hinton, 2008]Visualize the structure of data in a 2D embedding.
Each input point x maps to an embedding point
e.
SNE tries to preserve relative pairwise distancesas faithfully as possible.
[Turian, http://metaoptimize.com/projects/wordreprs/]
[van
der Maaten & Hinton, JMLR 2008]Slide29
Problem with
t-SNE (also based on k=1)
[van der Maaten & Hinton, JMLR 2008]Slide30
Data distribution:
Embedding distribution:
Objective (minimize
wrt
e):
Unsupervised Learning with t-SNE[van der Maaten and Hinton, 2008]
D
istances:Slide31
kt-SNE
Data distribution:
Embedding distribution:
Objective:
Minimize objective
wrt
eSlide32
kt-SNE
kt-SNE can potentially lead to better higher order structure preservation (exponentially many more distance constraints).
Gives another “dial to turn” in order to obtainbetter visualizations.Slide33
ExperimentsSlide34
WINE embeddings
“All” Method
“Majority” MethodSlide35
IRIS—worst kNCA relative performance (full D)
Testing accuracy
Training accuracy
kNN
accuracy
kNN
accuracySlide36
ION—best kNCA relative performance (full D)
Training accuracy
Testing accuracySlide37
USPS kNN Classification (0% noise, 2D)
Training accuracy
Testing accuracySlide38
USPS kNN Classification (25% noise,
2D)
Training accuracy
Testing accuracySlide39
USPS kNN Classification (50% noise,
2D)
Training accuracy
Testing accuracySlide40
NCA Objective Analysis on Noisy USPS
0% Noise
25% Noise
50% Noise
Y-axis:
objective of 1-NCA, evaluated at
the parameters learned from k-NCA with
varying k and neighbor methodSlide41
t-SNE vs kt
-SNE
t-SNE
5t-SNE
kNN
AccuracySlide42
Discussion
Local is good, but 1-NCA is too local.Not quite expected kNN accuracy,
but doesn’t seem to change results.
Expected Majority computation may be useful elsewhere?Slide43
Thank You!