Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang . & . Chen, 2011] . 2. Outline. Motivation and Background. Internal index. Motivation and general ideas. Variance-based internal indexes. ID: 634055

- Views :
**34**

**Direct Link:**- Link:https://www.docslides.com/briana-ranney/cluster-validation-ke
**Embed code:**

Download this presentation

DownloadNote - The PPT/PDF document "Cluster Validation Ke" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Slide1

Cluster Validation

Ke

Chen

Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang

&

Chen, 2011]

Slide22

Outline

Motivation and Background

Internal index

Motivation and general ideas

Variance-based internal indexes

Application: finding the “proper” cluster number

External index

Motivation and general ideas

Rand Index

Application: Weighted clustering ensemble

Summary

Slide33

Motivation and Background

Motivation

Supervised classification

Class labels known for ground truth

Accuracy: based on labels given

Clustering analysis

No class labelsEvaluation still demandedValidation needs toCompare clustering algorithmsSolve the number of clustersAvoid finding patterns in noiseFind the “best” clusters from data

P

Oranges:

Apples:

Accuracy = 8/10= 80%

Evaluation

Slide44

Illustrative Example: which one is the “best”?

Motivation and Background

Agglomerative (

Complete Link)

K

-means (

K

=3)

Data Set (Random Points)

K

-means (

K

=3)

Slide55

Cluster validation

refers to procedures that evaluate the results of clustering in a

quantitative

and

objective

fashion.

How to be “quantitative”: To employ the measures.How to be “objective”: To validate the measures!

m*

INPUT:

DataSet(

X

)

Clustering

Algorithm(s)

Validity Index

Different settings/configurations

Partitions

P

Motivation and Background

Slide6

6

Internal Criteria (Indexes)

Validate

without

external information

With different number of clusters

Solve the number of clusters

External Criteria (Indexes)Validate against “ground truth”Compare two partitions:(how similar)

?

?

?

?

Motivation and Background

Slide7

7

Internal Index

Ground truth is

unavailable

but unsupervised validation must be done with

“common sense”

or “a priori knowledge”.There are a variety of internal indexes:Variances-based methods Rate-distortion methodsDavies-Bouldin index (DBI)Bayesian Information Criterion (BIC)Silhouette CoefficientMinimum description principle (MDL)Stochastic complexity (SC)Modified Huber’s Г (MHГ) index

Slide88

Internal Index

Variances-based methods

Minimise

within cluster variance (SSW)

Maximise

between cluster variance (SSB)

Inter-cluster variance is maximized

Intra-cluster variance is minimized

c

Slide99

Internal Index

Variances-based methods (cont.)

Assume an algorithm leads to a partition of

K

clusters where cluster

i

has data points and ci is its centroid. d(.,.) is a distance used in this algorithm.Within cluster variance (SSW)Between cluster variance (SSB) where c is the global mean (centroid) of the whole data set.

Slide1010

Internal Index

Variance based F-ratio index

Measures ratio of between-cluster variance against the within-cluster variance (original F-test)

F-ratio index

(

W-B index

) for a partition of K clusters

Slide1111

Internal Index

Application: finding the “proper” cluster number

S1

Algorithm 1

Algorithm 2

Slide1212

Internal Index

Application: finding the “proper” cluster number

Algorithm 1

Algorithm 2

Slide1313

External Index

“Ground truth” is

available

but an clustering algorithm

doesn’t

use such information during unsupervised learning.

There are a variety of internal indexes:Rand IndexAdjusted Rand IndexPair counting indexInformation theoretic indexSet matching indexDVI indexNormalised mutual information (NMI) index

Slide1414

External Index

Main issues

If the “ground truth” is known, the validity of a clustering can be verified by comparing the class or clustering labels.

However, this is much more complicated than in supervised classification (where labels used in training)

The cluster IDs in a partition resulting from clustering have been assigned

arbitrarily

due to unsupervised learning – permutation. The number of clusters may be different from the number of classes (the “ground truth”) – inconsistence.The most important problem in external indexes would be how to find all possible correspondences between the “ground truth” and a partition (or two candidate partitions in the case of comparison).

Slide1515

External Index

Rand Index

This is the first external index proposed by Rand (1971) to address the “correspondence” problem.

Basic idea: considering all pairs in the data set by looking into both

agreement

and

disagreement against the “ground truth”The index defined as RI (X, Y)= (a+d)/(a+b+c+d) X/YY: Pairs in the same classY: Pairs in different classesX: Pairs in the same cluster a

b

X: Pairs in different clusters c

d

Slide1616

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

To calculate

a, b, c

and

d, we have to list all possible pairs (excluding any data point to itself) [1, 2], [1, 3], [1, 4], [1, 5] [2, 3], [2, 4], [2, 5] [3, 4], [3, 5] [4, 5] 1 2 3 4 5 X i ii ii i i

Y

q p q

p q

Slide1717

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

Initialisation:

a, b, c, d

0For data point pair [1, 2], we have X: [1, 2] assigned to (i, ii) –> in different clustersY: [1, 2] labelled by (q, p) –> in different classesThus, d d +1 = 1 1 2

3

4 5

X

i ii ii i i Y q p q p

q

Slide1818

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

Current Status:

a=0, b=0, c=0,

d =1For data point pair [1, 3], we have X: [1, 3] assigned to (i, ii) –> in different clustersY: [1, 3] labelled by (q, q) –> in the same classThus, c c +1 = 1 1

2 3

4 5

X

i ii ii i i Y q p q p

q

Slide1919

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

Current Status:

a=0, b=0,

c=1, d =1For data point pair [1, 4], we have X: [1, 4] assigned to (i, i) –> in the same clusterY: [1, 4] labelled by (q, p) –> in different classesThus, b b +1 = 1

1

2 3 4 5

X

i ii ii i i Y q p q

p

q

Slide2020

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

Current Status:

a=0,

b=1, c=1, d =1For data point pair [1, 5], we have X: [1, 5] assigned to (i, i) –> in the same clusterY: [1, 5] labelled by (q, q) –> in the same classThus, a a +1 = 1

1

2 3 4 5

X

i ii ii i i Y q p q

p

q

Slide2121

External Index

Rand Index (cont.)

Example: for a 5-point data set, we have

Current Status:

a=1

, b=1, c=1, d =1On-class Exercise: continuing until have the final value of a, b, c and d. 1 2 3 4 5 X i ii ii i

i Y q

p

q

p q

Slide2222

External Index

Rand Index: Contingency Table

In general,

a, b, c

and

d

are calculated from a contingency table Assume there are k clusters in X and l classes in Y

Slide23

23

External Index

Rand Index: Contingency Table (cont.)

Example: for a 5-point data set, we have

1 2 3 4 5 X i ii ii i i Y q p q

p

q

Contingency table

p

qi123ii1

12

2

35

Slide24Same cluster/class in

X

and Y

Same cluster in

X

/different classes in

Y

Different clusters in

X / same class in YDifferent clusters/classes in X and Y

External Index

Rand Index: measure the number of pairs in

Ex. 4

Slide25COMP24111 Machine Learning

25

Weighted Clustering Ensemble

Motivation

Evidence-accumulation based clustering ensemble

(Fred & Jain, 2005) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, …

However, this clustering ensemble algorithm is subject to

limitations:Don’t distinguish between “non-trivial” and “trivial” partitions; i.e., all partitions used in a clustering ensemble are treated equally important.Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensusClustering validity indexes provide an effective way to measure “non-trivialness” or “importance” of a partitionInternal index: directly measure the importance of a partition quantitatively and objectivelyExternal index: indirectly measure the importance of a partition via comparing with other partitions used in clustering ensemble for another round “evidence accumulation”

Slide26Chen & Yang: Temporal Data Clustering with Different Representations

26

Weighted Clustering Ensemble

Yun

Yang and

Ke

Chen, “Temporal data clustering via weighted clustering ensemble with different

representations,”

IEEE Transactions on Knowledge and Data Engineering

23(2), pp. 307-320, 2011.

Use validity

index values

as weight

s

Slide27Weighted Clustering EnsembleExample: convert clustering results into binary “Distance” matrix

A

B

C

D

Cluster 1 (C1)

Cluster 2 (C2)

A

D

C

B

D

C

A

B

“distance” Matrix

27

Chen & Yang: Temporal Data Clustering with Different Representations

Slide28Weighted Clustering EnsembleExample: convert clustering results into binary “Distance” matrix

A

B

C

D

Cluster 1 (C1)

Cluster 2 (C2)

A

D

C

B

D

C

A

B

“distance Matrix”

28

Chen & Yang: Temporal Data Clustering with Different Representations

Cluster 3 (C3)

Slide29Weighted Clustering Ensemble Evidence accumulation: form the collective weighted “distance” matrix

29

Chen & Yang: Temporal Data Clustering with Different Representations

Slide30Chen & Yang: Temporal Data Clustering with Different Representations

30

Weighted Clustering Ensemble

Clustering analysis with our weighted clustering ensemble on CAVIAR database

annotated video sequences of pedestrians, a set of 222 high-quality moving trajectories

clustering analysis of trajectories is useful for many applications

Experimental setting

Representations:

PCF, DCF, PLS and PDWT

Initial clustering analysis:

K-mean algorithm (4<K<20), 6 initial settings

Ensemble: 320 partitions totally (80 partitions/representation)

Slide31Chen & Yang: Temporal Data Clustering with Different Representations

31

Weighted Clustering Ensemble

Clustering results on CAVIAR database

Slide3232

Weighted Clustering Ensemble

Application:

UCR time series benchmarks

Slide33

33

Weighted Clustering Ensemble

Application:

Rand index

(%) values of clustering ensembles (Yang & Chen 2011)

Slide34

34

Summary

Cluster validation

is

a

process that evaluate clustering results with a pre-defined criterion.Two different types of cluster validation methodsInternal indexes no “ground truth” availabledefined based on “common sense” or “a priori knowledge”Application: finding the “proper” number of clusters, …External indexes“ground truth” known or reference given (“relative index”) Application: performance evaluation of clustering with reference informationApart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions. K. Wang et al, “CVAP: Validation for cluster analysis,” Data Science Journal, vol. 8, May 2009.[Code online available: http://www.mathworks.com/matlabcentral/fileexchange/authors/24811]

Next Slides