CSE 881: Data Mining Lecture 23: Miscellaneous - PowerPoint Presentation

342 views
Uploaded On 2022-07-28

CSE 881: Data Mining Lecture 23: Miscellaneous - PPT Presentation

Large Scale Clustering What is largescale Large number of data points to be clustered Streaming high velocity data Strategies SamplingApproximation approach OnlineIncremental clustering approach ID: 930228

index data cluster clusters data index clusters cluster internal clustering point sc1 sc7 node ln2 ln3 sc5 correlation sc2

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/930228" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "CSE 881: Data Mining Lecture 23: Miscell..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

CSE 881: Data Mining

Lecture 23: Miscellaneous

Slide2

Large Scale Clustering

What is large-scale?

Large number of data points to be clustered

Streaming (high velocity) data

Strategies

Sampling/Approximation approach

Online/Incremental clustering approach

Parallel/Distributed approach

Slide3

Large-Scale Clustering

K-means has a time complexity O(

)

It requires making several passes over the database

Can we design a one- or two-pass algorithm?

You can only get 1 or 2 looks at the data

You only have bounded memory (i.e., you can’t store everything in memory)

A common strategy is to initially

create intermediate clusters

(with a single pass of the data) and then

refine the intermediate clusters

(with or without making another pass over the data)

Slide4

Example (Birch Clustering)

Scan database once to create subclusters

Merge subclusters to produce final clusters

Slide5

Example: BIRCH Algorithm

Slide6

CF-Tree

Slide7

CF-Tree

Root

LN1

LN2

LN3

LN1

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

Slide8

Birch Algorithm

Designed for clustering large database (does not require the data to fit into main memory)

Has (at least) 2 steps

Construct a CF-tree from the data

Apply hierarchical clustering to merge the CFs

Other steps

Refine the CF-tree (requires re-scanning the data)

Refine the clusters (requires re-assigning the data points using the initial clusters found)

Slide9

Birch Algorithm

CF-Tree Construction

Clustering

Refinement

Slide10

CF-Tree Construction

Starting from the root, find a path to the leaf node

Path is selected based on a distance measure between the data point and the CF

Adding data point to a CF in the leaf node

Check whether the data point can be added to one of the CFs in the leaf node

A data point can be added as long as the cluster diameter does not exceed threshold T

If data point cannot be added, create a new CF for the data point; if there is no room for the new CF, convert the leaf node into an internal node and distribute the CFs accordingly

Update the CFs along the path to the leaf node

Slide11

Example of Insertion in BIRCH

Root

LN1

LN2

LN3

LN1

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

New subcluster

Slide12

Example of Insertion in BIRCH

April 13, 2021

If the branching factor of a leaf node can not exceed 3, then LN1 is split.

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

LN1’

LN1”

Slide13

If the branching factor of a non-leaf node cannot exceed 3, then the root is split and the height of the CF Tree increases by one.

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

LN1’

LN1”

NLN1

NLN2

Example of Insertion in BIRCH

Slide14

Cluster Validity

For supervised classification we have a variety of measures to evaluate how good our model is

Accuracy, precision, recall, F-measure

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

Slide15

Internal Index (Unsupervised):

Used to measure the goodness of a clustering structure

without

respect to external information.

Sum of Squared Error (SSE)

External Index (Supervised):

Used to measure the extent to which cluster labels match externally supplied class labels.Entropy Measures of Cluster Validity

Slide16

Internal Index: Correlation

Two matrices

Proximity Matrix

“Incidence” Matrix

One row and one column for each data point

An entry is 1 if the associated pair of points belongs to the same cluster

An entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between proximity and incidence matricesSince the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated.High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some density or contiguity based clusters

Slide17

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

Corr = -0.9235

Corr = -0.5810

Internal Index: Correlation

Slide18

Internal Index: SSE

SSE is good for comparing two clusterings or two clusters

Can also be used to estimate the number of clusters

Slide19

Internal Index: SSE

SSE curve for a more complicated data set

SSE of clusters found using K-means

Slide20

Silhouette Coefficient combines both cluster cohesion and separation

For each data point,

Calculate

= average distance of

to the points in its clusterCalculate b = min (average distance of i to points in another cluster)The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case) Typically between 0 and 1. The closer to 1 the better.Can calculate the average Silhouette for a cluster or a clustering

Internal Index: Silhouette Coefficient

Slide21

Internal Index for Hierarchical Clustering

Distance Matrix:

Single

Link

Slide22

Internal Index: CPCC

CPCC (CoPhenetic Correlation Coefficient)

Correlation between original distance matrix and cophenetic distance matrix

Cophenetic

Distance Matrix for Single Link