/
Clustering and Dimensionality Reduction Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
415 views
Uploaded On 2016-08-02

Clustering and Dimensionality Reduction - PPT Presentation

Brendan and Yifang April 21 2015 Preknowledge We define a set A and we find the element that minimizes the error We can think of as a sample of Where is the point in C closest to X ID: 430211

data clustering point set clustering data set point points algorithm projection question means dimension dimensional pca linear principal subspace

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clustering and Dimensionality Reduction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Clustering and Dimensionality Reduction

Brendan and Yifang

April

21

2015Slide2

Pre-knowledge

We define a set A, and we find the element that minimizes the error

We can think of as a sample of

Where is the point in C closest to X. Slide3

Clustering methods

K-means clustering

Hierarchical clustering

Agglomerative clusteringDivisive clustering

Level set clusteringModal clusteringSlide4

K-partition clustering

In a K-partition problem, our goal is to find k points:

We define

So C are cluster centers. We partition the space into k sets, where

Slide5

K-partition clustering, cont’d

Given the data set, our goal is to find

WhereSlide6

K-means clustering

1. Choose k centers at random from the data.

2. Form the clusters where is the closest center to

3. Let denotes the number of points in a partition4. Repeat Step 2 until convergence. Slide7

Circular data Vs Spherical data

Question: Why K-means clustering is good for spherical data?(Grace)Slide8

Question: What is relationship between K-means and Naïve Bayes?

Answer: They have the followings in common:

1. Both of them estimate a probability density function.

2. Assign the closest category/label to the target point (Naïve Bayes), assign the closest centroid to the target point (K-means)They are different in these aspects:1. Naïve bayes

is supervised algorithm, K-means is an unsupervised method.2. K-means is optimization task, and it is an iterative process, but Naïve bayes is not. 3. K-means is like to a multiple runs of Naïve Bayes, and in each run, the labels are adaptively adjusted. Slide9

Question: Why K-means does not work well for Figure 35.6? Why spectral clustering helps with it? (Grace)

Answer: Special clustering maps data points in R

d

to data points in Rk. Circle-shaped data points in R

d will be spherical-shape in Rk. But special clustering involves matrix decomposition, is rather time-consuming.Slide10

Agglomerative clustering

Requires pairwise distance among clusters.

There are three commonly employed distance.

Single linkageComplete linkage (Max Linkage)Average linkage

Start with each point in a separate clusterMerge the two closest clusters.Go back to step 1Slide11

An Example, Single Linkage

Question

: Is the Figure 35.6 to illustrate when one type of linkage is better than another? (Brad) Slide12

An example, Complete LinkageSlide13

Divisive clustering

Starts with one large cluster and then recursively divide the larger clusters into small clusters, with any feasible clustering algorithms.

A divisive algorithm example

Build a Minimum Spanning Tree Create a new clustering by removing a link corresponding to the largest distance

Go back to Step 2Slide14

Level set clustering

For a fixed non-negative number , define the level set

We decompose into a collection of bounded, connected disjoined sets: Slide15

Level Set Clustering, Cont’d

Estimate density distribution function: Using KDE

Decide : fix a small number

Decide : Slide16

Cuevas, Fraiman Algorithm

Set j = 0 and I = { 1, …, n }

1. Choose a point from X

i, and call this point X1,. Find the nearest point to X1

and call this point X2. Let r1 = || X1 - X2 ||2. r1

> 2e: set j <- j + 1, remove i from I, and go to Step 1.3. If r

1 <= 2e, let X3 be closest to the set X1, X2, and let Min{ || X3 - X

1 ||, || X3 - X2

|| }.4. If r2 > 2e: set j <- j+1,go back to Step 1. Otherwise, set rk = min{ ||Xk+1

- Xk||, I = 1,…,k }. Continue until r

k > 2e. Then Set j <- j+1 and go back to Step 1.Slide17

An example

Question: Can you give an example to illustrate Level set clustering? (

Tavish

)

2

1

54

3

78

12

9106

11Slide18

Modal Clustering

A point x belongs to

T

j if and only if the steepest ascent path beginning at x leads to mj. Finally the data are clustered to its closest mode.However, p may not have a finite number of modes. A refinement is introduced.

Ph is a smoothed out version of p using a Gaussian kernel.Slide19

Mean shift algorithm

1. Choose a number of points x

1

, …, xN. Set t = 0.2. Let t = t + 1. For j = 1,…,N set

3. Repeat until convergence.Slide20

Question: Can you point out the differences and similarities between different clustering algorithms?(

Tavish

)

Can you compare the pros and cons of the clustering algorithms, and what are suitable situations for each of them?(Sicong)What is the relationship between the clustering algorithms? What assumptions do they have?(

Yuankai)Answer: K-meansPros: 1. Simple, very intuitive. Applicable to almost any scenario, any dataset. 2. Fast algorithmCons:

1. does not work for density-varying dataSlide21

Contour matters

K-means Cons:

2. Does not work well when data groups present special contoursSlide22

K-means Cons:

3. Does not work well on outliers

4. Requires K Slide23

Hierarchical clustering

Pros:

1. Its clustering result has the clusters at any level granularity. Any number of clusters could be achieved by cutting the

dendrogram at a corresponding level. 2. It provides a dendrogram, which could be visualized hierarchical tree.

3. Does not requires a specified K.Cons:1. Slower than K-means2. Hard to decide where to cut off the dendrogramSlide24

Level set clustering

Pros:

1. Work wells when data group presents special contours, e.g., circle.

2. Handle outliers well, because we get a density function.3. Handle varying density well. Cons:

1. Even slower than hierarchical clustering. KDE is n2 , and Cuevas and Fraiman algorithm is also n2.Slide25

Question: Does K-means clustering guarantee convergence? (

Jiyun

)

Answer: Yes. Its time complexity upper bound is O(n4)Question: In Cuevas-fraiman

algorithm, does the choice of the start vertex matter? (Jiyun)Answer: The choice of start vertex does no matter.Question: Does not the choice of Xj in Mean shit algorithm matter?

Answer: No. The Xj converges to the modes in the iterative process. The initial value does not matter.Slide26

Dimension ReductionSlide27

Motivation

Recall

: Our overall goal is to find low-dimensional structure

Want to find a representation expressible with

dimensionsShould minimize some quantization of the associated projection error, ie

In clustering, we found

sets of points such that similar points are grouped together

In

d

imension reduction, we focus more on the space (

ie

can we transform the space such that a projection preserves data properties while shedding excess dimensionality)

 Slide28

Question – Dimension Reduction Benefits

Dimensionality reduction aims at reducing the number of

random variables

in the data before processing. However, this seems counterintuitive as it can reduce distinct features in the data set leading to poor results in succeeding steps. So, how does it help? - Tavish

Implicit assumption is that our data contains more features than are useful/necessary (ie highly correlated or purely noise)Common in big dataCommon when data is naively recordedReducing the number of dimensions produces a more compact representation and helps with the curse of dimensionalitySome methods (

ie manifold-based) avoid lossSlide29

Principal Component Analysis (PCA)

Simple dimension reduction technique

Intuition

: project the data onto a -dimensional linear subspace

 Slide30

Question – Linear Subspace

In

Principal Component Analysis, it projects the data onto linear subspaces. Could you explain a bit about what is a linear subspace? 

- YuankaiA linear subspace is a vector space subset of a higher dimensional vector subspaceSlide31

Example – Linear SubspaceSlide32

Example – Subspace projectionSlide33

PCA Objective

Let

and

denote all

-dimension linear subspaces

The

th principal subspace (our goal given

) is

and

(X)

Dimension reduced

,

is the projection of

onto

Let

We can restart risk,

Recall

:

 Slide34

PCA Algorithm

Compute the sample covariance matrix,

Compute the eigenvalues,

and eigenvectors,

, of the sample covariance matrix

Choose a dimension,

Define the dimension reduced data,

 Slide35

Question – Choosing a Good Dimension

 In Principal Component Analysis, the algorithm needs to project the data onto lower dimensional space. Can you elaborate how to make this dimensionality reduction effective and efficient? Being more specific, in step 3 of the PCA algorithm in page 709, how to choose a good dimension? Or, does it matter? –

Sicong

Larger values of

reduce the risk (because we’re removing less dimensions) so we have a trade-offA good value of

should be as small as possible while meeting some error thresholdIn practice, it should reflect your computational limitations

 Slide36

PCA – Choosing

 

Book recommends fixing some

and choosing

 Slide37

Example – PCA: d=2, k=1Slide38

Multidimensional Scaling

Instead of mean distance, we can instead try to preserve pairwise distance

Given a mapping of points,

,

We can define a pairwise distance loss function

Let,

This gives us an alternative to risk

W

e call the objective of finding the map

to minimize

, “Multidimensional scaling”

Insight

: You can find

by constructing a span out of the first

principal components

 Slide39

Example –Multidimensional ScalingSlide40

Kernel PCA

Suppose our data representation is less naïve, i.e. we’re given a “feature map”,

We’d still like to be able to perform PCA in the new feature space using pairwise distance

Recall the covariance matrix in PCA was,

In the feature space this becomes,

 Slide41

The Kernel Trick

Using the empirical to minimize requires a diagonalization that is very expensive for large values of

We can instead calculate the Gram matrix,

This lets us substitute

for the actual feature vectors

Requires a “Mercer kernel” – i.e. positive semi-definite

 Slide42

Kernel PCA Algorithm

Center the Kernel with

where

(i.e. square

matrix of

)

Compute

Diagonalize

Normalize eigenvectors

such that

Compute the projection of a test point x onto an eigenvector

with ,

 Slide43

Local Linear Embedding (LLE)

Same goal as PCA (i.e. reduce the dimensionality while minimizing the projection error)

Also tries to preserve the local (i.e. small-scale) structure of the data

Assumes that data resides on a low-dimensional manifold embedded in a high-dimensional spaceGenerates a nonlinear mapping of the data to low dimensions that preserves all of the local geometric features

Requires a parameter that determines how many nearest neighbors are used (i.e. the scale of “local”)Produces a weighted sparse graph representation of the data

 Slide44

Question - Manifolds

 I wanted to know what exactly "manifold" referred to

. – Brad

“A manifold is a topological space that is locally Euclidean” – Wolframi.e. the Earth appears flat on a human scale, but we know it’s roughly sphericalMaps are useful because they preserve all the surface features despite being a projectionSlide45

Example – ManifoldsSlide46

LLE Algorithm

Computer

nearest neighbors for each point (expensive to brute force, but KNN has good approximation algorithms)

Compute the local reconstruction weights

(i.e. each point is represented as a linear combination of its neighbors) by minimizing,

Compute outputs

by computing the first

eigenvectors with nonzero eigenvalues for the

matrix

 Slide47

LLE Objective

Wants to minimize the error in reconstructed points

Step 3. from the algorithm is equivalent to

Recall

:

 Slide48

Example – LLE Toy ExamplesSlide49

Isomap

Similar to LLE in its preservation of the original structure

Provides a “manifold” representation of the higher dimensional data

Assesses object similarity differently (distance, as a metric, is computed using graph path length)Constructs the low-dimensional mapping differently (uses metric multi-dimensional scaling)Slide50

Isomap Algorithm

Compute KNN for each point and record a nearest neighbor graph,

with vertices

Compute the graph distances

using

Dijkstra’s

algorithms

Embed the points into low dimensions using metric multidimensional scaling

 Slide51

Laplacian Eigenmaps

Another similar manifold method

Analogous to Kernel PCA ?

Given kernel generated weights,

, the graph Laplacian is given by,

Where

and

Calculate the first

eigenvectors of

,

They give an embedding,

Intuition

: eigenvector minimizes the weighted graph

norm

 Slide52

Estimating the Manifold Dimension

Recall

: These manifold methods all assume an embedded low-dimensional manifold

We haven’t addressed how to find it if it isn’t givenIt can be estimated from the provided dataConsider a small ball,

around some point on the manifold

Assume density is approximately constant on

The number of data points that fall in the ball is,

Let

be the smallest d-dimensional Euclidean ball around x containing j points

 Slide53

Estimating the Manifold Dimension Cont.

Given a sample

An estimate

of the dimension is given by,

Using KNN instead,

 Slide54

Principal Curves and Manifolds

We can, more generally, replace our linear subspaces with non-linear ones

This non-parametric generalization of the principal components is called “principal manifolds”

More formally, if we let

be a set of functions from

to

, the principal manifold is the function

that minimizes,

 Slide55

Principal Curves and Manifolds Cont.

For computability,

should be governed by some kernel (i.e. quadratic functions, Gaussian,

ect

.) should be expressible in the form,

i.e. for Gaussian,

Define a latent set of variables,

where,

Then the minimizer is given by

 Slide56

Random Projection

Making a random projection is often good enough

Key insight

: when

points in d dimensions are projected randomly down onto

dimensions, pairwise point distance is well preservedMore formally let,

, for some

Then for a random projection

,

 Slide57

Question – Making a Random Projection

Section 35.10 points out that the power of random projection is very surprising (for instance, the Johnson-

Lindenstrauss

Lemma). Can you talk more in detail about how to make random projection in class? –

SicongAlgorithm itself is very simplePick a -dimensional subspace at random

Project all

points onto the subspace normally Calculate the maximum divergence in pairwise distancesRepeat until maximum divergence is acceptableComputing the maximum divergence takes

Must be repeated ~

times in expectation

Takes

in total

 Slide58

Question – Distance Randomization

Why it important that the factor by which the distance between pairs of points changes during a random projection is limited? I can see how it's a good thing, but what is the significance of the randomness? – Brad

Randomness makes the projection trivial (algorithmically and computationally -

) to compute

For some applications, preserving pairwise distance is sufficient, i.e. multidimensional scaling