/
1 Clustering: K-Means 1 Clustering: K-Means

1 Clustering: K-Means - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
401 views
Uploaded On 2016-04-10

1 Clustering: K-Means - PPT Presentation

Machine Learning 10601 Fall 2014 Bhavana Dalvi Mishra PhD student LTI CMU Slides are based on materials from Prof  Eric Xing Prof   William Cohen and Prof Andrew Ng ID: 278301

means clustering bob glosses clustering means glosses bob similarity mixture alex clusters distance learning algorithm step cluster glofin gloss finding model maximization

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Clustering: K-Means" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Clustering: K-Means

Machine Learning 10-601, Fall 2014

Bhavana Dalvi MishraPhD student LTI, CMU

Slides are based

on materials

from

Prof. 

Eric Xing, Prof

William Cohen and Prof. Andrew NgSlide2

Outline

2What is clustering?

How are similarity measures defined? Different clustering algorithms K-Means Gaussian Mixture Models Expectation Maximization Advanced topics How to seed clustering? How to choose #clusters Application: Gloss finding for a Knowledge Base Slide3

Clustering

3Slide4

Classification

vs. Clustering

Supervision availableUnsupervised4

Unsupervised

learning

:

learning from raw (unlabeled)

data

Learning from supervised data: example classifications are

givenSlide5

Clustering

5

The process of grouping a set of objects into clustershigh intra-cluster similaritylow inter-cluster similarity

How many clusters?

How to identify them?Slide6

Applications of Clustering

6

Google news: Clusters news stories from different sources about same event ….Computational biology: Group genes that perform the same functionsSocial media analysis: Group individuals that have similar political viewsComputer graphics: Identify similar objects from picturesSlide7

Examples

7

People

Images

SpeciesSlide8

What is a natural grouping among these objects?

8Slide9

Similarity Measures

9Slide10

What is Similarity?

10

The real meaning of similarity is a philosophical question. Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors.

Hard to define! But we know it when we see itSlide11

Intuitions behind desirable distance measure properties

11

D(A,B) = D(B,A) SymmetryOtherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim "Alex looks more like Bob, than Bob does"D(A,B) = 0 IIf A= B Identity of

indiscerniblesects in your world that are different, but you cannot tell apart.D(A,B) £ D(A,C) + D(B,C)

Triangular

Inequality

Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Slide12

Intuitions behind desirable distance measure properties

12

D(A,B) = D(B,A) SymmetryOtherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim "Alex looks more like Bob, than Bob does"D(A,B) = 0 IIf A= B Identity of indiscerniblesOtherwise there are objects in your world that are different, but you cannot tell apart.

D(A,B) £ D(A,C) + D(B,C) Triangular InequalityOtherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Slide13

Distance Measures: Minkowski Metric

13

Suppose two object x and y both have p featuresThe Minkowski metric is defined byMost Common Minkowski MetricsSlide14

An Example

14

4

3

x

ySlide15

Hamming distance

15

Manhattan distance is called

Hamming distance when all features are binary.

Gene Expression Levels Under 17 Conditions (1-High,0-Low)Slide16

Similarity Measures: Correlation Coefficient

16

Time

Gene A

Gene B

Expression Level

Expression Level

Time

Gene A

Gene B

Positively correlated

Uncorrelated

Negatively correlated

1

2

3

Time

Gene A

Expression Level

Gene BSlide17

Similarity Measures: Correlation Coefficient

17

Pearson correlation coefficientSpecial case: cosine distanceSlide18

Clustering Algorithm

K-Means18Slide19

K-means Clustering: Step 1

19Slide20

K-means Clustering: Step 2

20Slide21

K-means Clustering: Step 3

21Slide22

K-means Clustering: Step 4

22Slide23

K-means Clustering: Step 5

23Slide24

K-Means: Algorithm

24

Decide on a value for k.Initialize the k cluster centers randomly if necessary.Repeat till any object changes its cluster assignmentDecide the cluster memberships of the N objects by assigning them to the nearest cluster centroid

Re-estimate the k cluster centers, by assuming the memberships found above are correct.Slide25

K-Means is widely used in practice

25Extremely fast and scalable: used in variety of applications

Can be easily parallelized Easy Map-Reduce implementationMapper: assigns each datapoint to nearest clusterReducer: takes all points assigned to a cluster, and re-computes the centroidsSensitive to starting points or random seed initialization(Similar to Neural networks)There are extensions like K-Means++ that try to solve this problemSlide26

Outliers

26Slide27

Clustering Algorithm

Gaussian Mixture Model

27Slide28

Density estimation

28

An aircraft testing facility measures Heat and Vibration parameters for every newly built aircraft.

Estimate density function P(x) given unlabeled datapoints X1 to XnSlide29

Mixture of Gaussians

29Slide30

Mixture

Models

30A density model p(x) may be multi-modal.We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).Each mode may correspond to a different sub-population (e.g., male and female).

ÞSlide31

Gaussian Mixture Models (GMMs)

31

Consider a mixture of K Gaussian components:This model can be used for unsupervised clustering.This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.

mixture proportion

mixture componentSlide32

Learning mixture models

32In fully observed

iid settings, the log likelihood decomposes into a sum of local terms.With latent variables, all the parameters become coupled together via marginalizationSlide33

If we are doing MLE

for completely observed

dataData log-likelihood

MLE

MLE for GMM

33

Gaussian Naïve BayesSlide34

Learning GMM (z’s are unknown)

34Slide35

Expectation Maximization (EM)

35Slide36

Expectation-Maximization

(EM)

36Start: "Guess" the mean and covariance of each of the K gaussiansLoop Slide37

37Slide38

Expectation-Maximization

(EM)

38Start: "Guess" the centroid and covariance of each of the K clusters Loop Slide39

The Expectation-Maximization (EM) Algorithm

39

E Step: Guess values of Z’s Slide40

The Expectation-Maximization (EM) Algorithm

40

M Step: Update parameter estimatesSlide41

EM Algorithm for GMM

41

E Step: Guess values of Z’s

M Step: Update parameter estimatesSlide42

K-means is a hard version of EM

42

In the K-means “E-step” we do hard assignment:In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:Slide43

Soft vs. Hard EM assignments

43GMM

K-MeansSlide44

Theory underlying EM

44

What are we doing?Recall that according to MLE, we intend to learn the model parameters that would maximize the likelihood of the data. But we do not observe z, so computing is difficult!What shall we do?Slide45

Intuition behind the EM algorithm

45Slide46

Jensen’s Inequality

46For a convex function f(x)

Similarly, for a concave function f(x)Slide47

Jensen’s Inequality: concave f(x)

47Slide48

EM and Jensen’s Inequality

48Slide49

Advanced Topics

49Slide50

How Many Clusters?

50

Number of clusters K is given Partition ‘n’ documents into predetermined #topicsSolve an optimization problem: penalize #clustersInformation theoretic approaches: AIC, BIC criteria for model selectionTradeoff between having clearly separable clusters and having too many clustersSlide51

Seed Choice: K-Means++

51

K-Means results can vary based on random seed selection.K-Means++ Choose one center uniformly at random among given datapoints.For each data point x, compute D(x)D(x) distance(x, nearest center)Choose one new data point at random as a new

centerP(x) D(x)2.Repeat Steps 2 and 3 until k centers have been chosen.

Run

standard 

K-Means with this centroid initialization.

 Slide52

Semi-supervised K-Means

52Slide53

Supervised Learning

Unsupervised Learning

53

Semi-supervised LearningSlide54

Automatic Gloss Finding for a Knowledge Base

54

Glosses: Natural language definitions of named entities.E.g. “Microsoft” is an American multinational corporation headquartered in Redmond that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services ...Input: Knowledge Base i.e. a set of concepts (e.g. company) and entities belonging to those concepts (e.g. Microsoft), and a set of potential glosses.

Output: Candidate glosses matched to relevant entities in the KB.“Microsoft is an American multinational corporation headquartered in Redmond …” is mapped to entity “Microsoft” of type “Company”.

[

Automatic Gloss Finding for a Knowledge Base using Ontological Constraints

,

Bhavana

Dalvi

Mishra,

Einat

Minkov

,

Partha

Pratim

Talukdar

, and William W. Cohen, 

2014, Under submission]Slide55

Example: Gloss finding

55Slide56

Example: Gloss finding

56Slide57

Example: Gloss finding

57Slide58

Example: Gloss finding

58Slide59

Training a clustering model

Train: Unambiguous glosses

Test: Ambiguous glosses

59

Fruit

Company Slide60

GLOFIN: Clustering glosses

60Slide61

GLOFIN: Clustering glosses

61Slide62

GLOFIN: Clustering glosses

62Slide63

GLOFIN: Clustering glosses

63Slide64

GLOFIN: Clustering glosses

64Slide65

GLOFIN: Clustering glosses

65Slide66

GLOFIN on NELL Dataset

66

275 categories, 247K candidate glosses, #train=20K, #test=227KSlide67

GLOFIN on Freebase Dataset

67

66 categories, 285K candidate glosses, #train=25K, #test=260KSlide68

Summary

68What is clustering?

What are similarity measures?K-Means clustering algorithmMixture of Gaussians (GMM)Expectation MaximizationAdvanced TopicsHow to seed clusteringHow to decide #clustersApplication: Gloss finding for a Knowledge BasesSlide69

Thank You

Questions?

69