Machine Learning 10601 Fall 2014 Bhavana Dalvi Mishra PhD student LTI CMU Slides are based on materials from Prof Eric Xing Prof William Cohen and Prof Andrew Ng ID: 278301
Download Presentation The PPT/PDF document "1 Clustering: K-Means" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Clustering: K-Means
Machine Learning 10-601, Fall 2014
Bhavana Dalvi MishraPhD student LTI, CMU
Slides are based
on materials
from
Prof.
Eric Xing, Prof
.
William Cohen and Prof. Andrew NgSlide2
Outline
2What is clustering?
How are similarity measures defined? Different clustering algorithms K-Means Gaussian Mixture Models Expectation Maximization Advanced topics How to seed clustering? How to choose #clusters Application: Gloss finding for a Knowledge Base Slide3
Clustering
3Slide4
Classification
vs. Clustering
Supervision availableUnsupervised4
Unsupervised
learning
:
learning from raw (unlabeled)
data
Learning from supervised data: example classifications are
givenSlide5
Clustering
5
The process of grouping a set of objects into clustershigh intra-cluster similaritylow inter-cluster similarity
How many clusters?
How to identify them?Slide6
Applications of Clustering
6
Google news: Clusters news stories from different sources about same event ….Computational biology: Group genes that perform the same functionsSocial media analysis: Group individuals that have similar political viewsComputer graphics: Identify similar objects from picturesSlide7
Examples
7
People
Images
SpeciesSlide8
What is a natural grouping among these objects?
8Slide9
Similarity Measures
9Slide10
What is Similarity?
10
The real meaning of similarity is a philosophical question. Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors.
Hard to define! But we know it when we see itSlide11
Intuitions behind desirable distance measure properties
11
D(A,B) = D(B,A) SymmetryOtherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim "Alex looks more like Bob, than Bob does"D(A,B) = 0 IIf A= B Identity of
indiscerniblesects in your world that are different, but you cannot tell apart.D(A,B) £ D(A,C) + D(B,C)
Triangular
Inequality
Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Slide12
Intuitions behind desirable distance measure properties
12
D(A,B) = D(B,A) SymmetryOtherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim "Alex looks more like Bob, than Bob does"D(A,B) = 0 IIf A= B Identity of indiscerniblesOtherwise there are objects in your world that are different, but you cannot tell apart.
D(A,B) £ D(A,C) + D(B,C) Triangular InequalityOtherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Slide13
Distance Measures: Minkowski Metric
13
Suppose two object x and y both have p featuresThe Minkowski metric is defined byMost Common Minkowski MetricsSlide14
An Example
14
4
3
x
ySlide15
Hamming distance
15
Manhattan distance is called
Hamming distance when all features are binary.
Gene Expression Levels Under 17 Conditions (1-High,0-Low)Slide16
Similarity Measures: Correlation Coefficient
16
Time
Gene A
Gene B
Expression Level
Expression Level
Time
Gene A
Gene B
Positively correlated
Uncorrelated
Negatively correlated
1
2
3
Time
Gene A
Expression Level
Gene BSlide17
Similarity Measures: Correlation Coefficient
17
Pearson correlation coefficientSpecial case: cosine distanceSlide18
Clustering Algorithm
K-Means18Slide19
K-means Clustering: Step 1
19Slide20
K-means Clustering: Step 2
20Slide21
K-means Clustering: Step 3
21Slide22
K-means Clustering: Step 4
22Slide23
K-means Clustering: Step 5
23Slide24
K-Means: Algorithm
24
Decide on a value for k.Initialize the k cluster centers randomly if necessary.Repeat till any object changes its cluster assignmentDecide the cluster memberships of the N objects by assigning them to the nearest cluster centroid
Re-estimate the k cluster centers, by assuming the memberships found above are correct.Slide25
K-Means is widely used in practice
25Extremely fast and scalable: used in variety of applications
Can be easily parallelized Easy Map-Reduce implementationMapper: assigns each datapoint to nearest clusterReducer: takes all points assigned to a cluster, and re-computes the centroidsSensitive to starting points or random seed initialization(Similar to Neural networks)There are extensions like K-Means++ that try to solve this problemSlide26
Outliers
26Slide27
Clustering Algorithm
Gaussian Mixture Model
27Slide28
Density estimation
28
An aircraft testing facility measures Heat and Vibration parameters for every newly built aircraft.
Estimate density function P(x) given unlabeled datapoints X1 to XnSlide29
Mixture of Gaussians
29Slide30
Mixture
Models
30A density model p(x) may be multi-modal.We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).Each mode may correspond to a different sub-population (e.g., male and female).
ÞSlide31
Gaussian Mixture Models (GMMs)
31
Consider a mixture of K Gaussian components:This model can be used for unsupervised clustering.This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.
mixture proportion
mixture componentSlide32
Learning mixture models
32In fully observed
iid settings, the log likelihood decomposes into a sum of local terms.With latent variables, all the parameters become coupled together via marginalizationSlide33
If we are doing MLE
for completely observed
dataData log-likelihood
MLE
MLE for GMM
33
Gaussian Naïve BayesSlide34
Learning GMM (z’s are unknown)
34Slide35
Expectation Maximization (EM)
35Slide36
Expectation-Maximization
(EM)
36Start: "Guess" the mean and covariance of each of the K gaussiansLoop Slide37
37Slide38
Expectation-Maximization
(EM)
38Start: "Guess" the centroid and covariance of each of the K clusters Loop Slide39
The Expectation-Maximization (EM) Algorithm
39
E Step: Guess values of Z’s Slide40
The Expectation-Maximization (EM) Algorithm
40
M Step: Update parameter estimatesSlide41
EM Algorithm for GMM
41
E Step: Guess values of Z’s
M Step: Update parameter estimatesSlide42
K-means is a hard version of EM
42
In the K-means “E-step” we do hard assignment:In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:Slide43
Soft vs. Hard EM assignments
43GMM
K-MeansSlide44
Theory underlying EM
44
What are we doing?Recall that according to MLE, we intend to learn the model parameters that would maximize the likelihood of the data. But we do not observe z, so computing is difficult!What shall we do?Slide45
Intuition behind the EM algorithm
45Slide46
Jensen’s Inequality
46For a convex function f(x)
Similarly, for a concave function f(x)Slide47
Jensen’s Inequality: concave f(x)
47Slide48
EM and Jensen’s Inequality
48Slide49
Advanced Topics
49Slide50
How Many Clusters?
50
Number of clusters K is given Partition ‘n’ documents into predetermined #topicsSolve an optimization problem: penalize #clustersInformation theoretic approaches: AIC, BIC criteria for model selectionTradeoff between having clearly separable clusters and having too many clustersSlide51
Seed Choice: K-Means++
51
K-Means results can vary based on random seed selection.K-Means++ Choose one center uniformly at random among given datapoints.For each data point x, compute D(x)D(x) distance(x, nearest center)Choose one new data point at random as a new
centerP(x) D(x)2.Repeat Steps 2 and 3 until k centers have been chosen.
Run
standard
K-Means with this centroid initialization.
Slide52
Semi-supervised K-Means
52Slide53
Supervised Learning
Unsupervised Learning
53
Semi-supervised LearningSlide54
Automatic Gloss Finding for a Knowledge Base
54
Glosses: Natural language definitions of named entities.E.g. “Microsoft” is an American multinational corporation headquartered in Redmond that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services ...Input: Knowledge Base i.e. a set of concepts (e.g. company) and entities belonging to those concepts (e.g. Microsoft), and a set of potential glosses.
Output: Candidate glosses matched to relevant entities in the KB.“Microsoft is an American multinational corporation headquartered in Redmond …” is mapped to entity “Microsoft” of type “Company”.
[
Automatic Gloss Finding for a Knowledge Base using Ontological Constraints
,
Bhavana
Dalvi
Mishra,
Einat
Minkov
,
Partha
Pratim
Talukdar
, and William W. Cohen,
2014, Under submission]Slide55
Example: Gloss finding
55Slide56
Example: Gloss finding
56Slide57
Example: Gloss finding
57Slide58
Example: Gloss finding
58Slide59
Training a clustering model
Train: Unambiguous glosses
Test: Ambiguous glosses
59
Fruit
Company Slide60
GLOFIN: Clustering glosses
60Slide61
GLOFIN: Clustering glosses
61Slide62
GLOFIN: Clustering glosses
62Slide63
GLOFIN: Clustering glosses
63Slide64
GLOFIN: Clustering glosses
64Slide65
GLOFIN: Clustering glosses
65Slide66
GLOFIN on NELL Dataset
66
275 categories, 247K candidate glosses, #train=20K, #test=227KSlide67
GLOFIN on Freebase Dataset
67
66 categories, 285K candidate glosses, #train=25K, #test=260KSlide68
Summary
68What is clustering?
What are similarity measures?K-Means clustering algorithmMixture of Gaussians (GMM)Expectation MaximizationAdvanced TopicsHow to seed clusteringHow to decide #clustersApplication: Gloss finding for a Knowledge BasesSlide69
Thank You
Questions?
69