/
Clustering Clustering

Clustering - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
381 views
Uploaded On 2016-03-03

Clustering - PPT Presentation

Basic Concepts and Algorithms Bamshad Mobasher DePaul University 2 What is Clustering in Data Mining Cluster a collection of data objects that are similar to one another and thus can be treated collectively as one group ID: 240909

cluster clusters clustering distance clusters cluster distance clustering similarity data step means link objects method methods hierarchical average based

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clustering" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ClusteringBasic Concepts and Algorithms

Bamshad Mobasher

DePaul UniversitySlide2

2What is Clustering in Data Mining?

Cluster:

a collection of data objects that are “similar” to one another and thus can be treated collectively as one group

but as a collection, they are sufficiently different from other groups

Clusteringunsupervised classificationno predefined classes

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Helps users understand the natural grouping or structure in a data setSlide3

Applications of Cluster AnalysisData reductionSummarization: Preprocessing for regression, PCA, classification, and association analysis

Compression: Image processing: vector quantization

Hypothesis generation and testing

Prediction based on groups

Cluster & find characteristics/patterns for each groupFinding K-nearest NeighborsLocalizing search to one or a small number of clustersOutlier detection: Outliers are often viewed as those “far away” from any cluster

3Slide4

Basic Steps to Develop a Clustering TaskFeature selection / Preprocessing

Select info concerning the task of interest

Minimal information redundancy

May need to do normalization/standardization

Distance/Similarity measureSimilarity of two feature vectorsClustering criterionExpressed via a cost function or some rulesClustering algorithms

Choice of algorithmsValidation of the resultsInterpretation of the results with applications4Slide5

5Distance or Similarity Measures

Common Distance Measures:

Manhattan distance:

Euclidean distance:

Cosine similarity:Slide6

6More Similarity Measures

Simple Matching

Cosine Coefficient

Dice’s Coefficient

Jaccard’s Coefficient

In vector-space model many similarity measures can be used in clusteringSlide7

Quality: What Is Good Clustering?

A

good clustering

method will produce high quality clusters

high intra-class similarity: cohesive within clusterslow

inter-class similarity: distinctive between clustersThe quality of a clustering method depends on

the similarity measure used

its implementation, and

Its ability to discover some or all of the

hidden

patterns

7Slide8

Major Clustering Approaches

Partitioning approach

:

Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors

Typical methods: k-means, k-medoids, CLARANSHierarchical approach:

Create a hierarchical decomposition of the set of data (or objects) using some criterionTypical methods: Diana, Agnes, BIRCH, CAMELEONDensity-based approach: Based on connectivity and density functionsTypical methods: DBSCAN, OPTICS,

DenClueModel-based

:

A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other

Typical methods:

EM, SOM,

COBWEB

8Slide9

9Partitioning Approaches

The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster

cluster representatives can be actual items in the cluster or other “virtual” representatives such as the centroid

this methodology reduces the number of similarity computations in clustering

clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be madeReallocation-Based Partitioning MethodsStart with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioningMost common algorithm: k-meansSlide10

The K-Means Clustering Method

Given the number of desired clusters

k

, the

k-means algorithm follows four steps:Randomly assign objects to create

k nonempty initial partitions (clusters)Compute the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point

, of the cluster)Assign each object to the cluster with the nearest centroid (

reallocation step

)

Go back to Step 2, stop when the assignment does not change

10Slide11

11K-Means Example: Document Clustering

Initial (arbitrary) assignment:

C1 = {D1,D2},

C2 = {D3,D4},

C3 = {D5,D6}

Cluster Centroids

Now compute the similarity (or distance) of each item

to

each cluster, resulting a cluster-document similarity matrix (

here we use dot product as the similarity measure

).Slide12

12Example (Continued)

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…Slide13

13Example (Continued)

Now compute new cluster centroids using the original document-term matrix

This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}

New assignment

Note: This process is now repeated with new clusters. However, the next iteration in this example

Will show no change to the clusters, thus terminating the algorithm.Slide14

14K-Means Algorithm

Strength of the

k

-

means:Relatively efficient: O(

tkn), where n is # of objects, k is # of clusters, and t

is # of iterations. Normally, k,

t

<<

n

Often terminates at a

local optimum

Weakness of the

k-means

:

Applicable only when

mean

is defined; what about categorical data?

Need to specify

k,

the

number

of clusters, in advance

Unable to handle noisy data and

outliers

Variations of K-Means usually differ in:

Selection of the initial

k

means

Distance or similarity measures used

Strategies to calculate cluster meansSlide15

A Disk Version of k-meansk-means can be implemented with data on disk

In each iteration, it scans the database once

The centroids are computed incrementally

It can be used to cluster large datasets that do not fit in main memory

We need to control the number of iterations In practice, a limited is set (< 50)There are better algorithms that scale up for large data sets, e.g., BIRCH15Slide16

BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objects

Only one scan of data is necessary

Does not need the whole data set in advance

Two key phases:

Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes

16Slide17

Hierarchical Clustering AlgorithmsTwo main types of hierarchical clustering

Agglomerative:

Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one cluster (or

k clusters) leftDivisive:

Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrixMerge or split one cluster at a timeSlide18

18Hierarchical Clustering Algorithms

Use

dist

/

sim matrix as clustering criteriadoes not require the no. of clusters as input, but needs a termination conditiona

bc

d

e

ab

cd

cde

abcde

Step 0

Step 1

Step 2

Step 3

Step 4

Step 4

Step 3

Step 2

Step 1

Step 0

Agglomerative

DivisiveSlide19

19

Hierarchical Agglomerative

Clustering

Basic procedure

1. Place each of N items into a cluster of its own.2. Compute all pairwise item-item similarity coefficientsTotal of N(

N-1)/2 coefficients3. Form a new cluster by combining the most similar pair of current clusters i and j

(methods for determining which clusters to merge: single-link, complete link, group average, etc.);

update similarity matrix by deleting the rows and columns corresponding to

i

and

j

;

calculate the entries in the row corresponding to the new cluster

i

+

j

.

4.

Repeat step 3 if the number of clusters left is great than 1.Slide20

Hierarchical Agglomerative Clustering

:: Example

Nested Clusters

Dendrogram

1

2

3

4

5

6

1

2

5

3

4Slide21

Distance Between Two ClustersThe basic procedure varies based on the method used to determine inter-cluster

distances or similarities

Different methods results in different variants of the algorithm

Single link

Complete linkAverage linkWard’s methodEtc.21Slide22

22

Single

Link Method

The distance between two clusters is the distance between two

closest data points in the two clusters, one data point from each cluster It can find arbitrarily shaped clusters, butIt may cause the undesirable “

chain effect” due to noisy points

Two natural clusters are split into twoSlide23

Distance between two clustersSingle-link distance between clusters C

i

and

C

j is the minimum distance between any object in Ci and any object in Cj The distance is defined by the two

most similar objects

1

2

3

4

5Slide24

24

Complete

Link Method

The distance between two clusters is the distance of two

furthest data points in the two clusters It is sensitive to outliers because they are far awaySlide25

Distance between two clustersComplete-link distance between clusters C

i

and

C

j is the maximum distance between any object in Ci and any object in Cj The distance is defined by the two

least similar objects

1

2

3

4

5Slide26

26

Average link and centroid methods

Average link

:

A compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical

objects In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters.

Centroid method

: In this method, the distance between two clusters is the distance between their centroids Slide27

Distance between two clustersGroup average distance between clusters C

i

and

C

j is the average distance between objects in Ci and objects in Cj The distance is defined by the average similarities

1

2

3

4

5Slide28

ClusteringBasic Concepts and Algorithms

Bamshad Mobasher

DePaul University