Large Scale Clustering What is largescale Large number of data points to be clustered Streaming high velocity data Strategies SamplingApproximation approach OnlineIncremental clustering approach ID: 930228
Download Presentation The PPT/PDF document "CSE 881: Data Mining Lecture 23: Miscell..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CSE 881: Data Mining
Lecture 23: Miscellaneous
Slide2Large Scale Clustering
What is large-scale?
Large number of data points to be clustered
Streaming (high velocity) data
Strategies
Sampling/Approximation approach
Online/Incremental clustering approach
Parallel/Distributed approach
Slide3Large-Scale Clustering
K-means has a time complexity O(
kN
)
It requires making several passes over the database
Can we design a one- or two-pass algorithm?
You can only get 1 or 2 looks at the data
You only have bounded memory (i.e., you can’t store everything in memory)
A common strategy is to initially
create intermediate clusters
(with a single pass of the data) and then
refine the intermediate clusters
(with or without making another pass over the data)
Slide4Example (Birch Clustering)
Scan database once to create subclusters
Merge subclusters to produce final clusters
Slide5Example: BIRCH Algorithm
CF-Tree
CF-Tree
Root
LN1
LN2
LN3
LN1
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
Slide8Birch Algorithm
Designed for clustering large database (does not require the data to fit into main memory)
Has (at least) 2 steps
Construct a CF-tree from the data
Apply hierarchical clustering to merge the CFs
Other steps
Refine the CF-tree (requires re-scanning the data)
Refine the clusters (requires re-assigning the data points using the initial clusters found)
Slide9Birch Algorithm
CF-Tree Construction
Clustering
Refinement
Refinement
Slide10CF-Tree Construction
Starting from the root, find a path to the leaf node
Path is selected based on a distance measure between the data point and the CF
Adding data point to a CF in the leaf node
Check whether the data point can be added to one of the CFs in the leaf node
A data point can be added as long as the cluster diameter does not exceed threshold T
If data point cannot be added, create a new CF for the data point; if there is no room for the new CF, convert the leaf node into an internal node and distribute the CFs accordingly
Update the CFs along the path to the leaf node
Slide11Example of Insertion in BIRCH
Root
LN1
LN2
LN3
LN1
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
New subcluster
Slide12Example of Insertion in BIRCH
April 13, 2021
12
If the branching factor of a leaf node can not exceed 3, then LN1 is split.
Root
LN1”
LN2
LN3
LN1’
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
LN1’
LN1”
Slide13If the branching factor of a non-leaf node cannot exceed 3, then the root is split and the height of the CF Tree increases by one.
Root
LN1”
LN2
LN3
LN1’
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
LN1’
LN1”
NLN1
NLN2
Example of Insertion in BIRCH
Slide14Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
Accuracy, precision, recall, F-measure
For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
Slide15Internal Index (Unsupervised):
Used to measure the goodness of a clustering structure
without
respect to external information.
Sum of Squared Error (SSE)
External Index (Supervised):
Used to measure the extent to which cluster labels match externally supplied class labels.Entropy Measures of Cluster Validity
Slide16Internal Index: Correlation
Two matrices
Proximity Matrix
“Incidence” Matrix
One row and one column for each data point
An entry is 1 if the associated pair of points belongs to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters
Compute the correlation between proximity and incidence matricesSince the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated.High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some density or contiguity based clusters
Slide17Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.
Corr = -0.9235
Corr = -0.5810
Internal Index: Correlation
Slide18Internal Index: SSE
SSE is good for comparing two clusterings or two clusters
Can also be used to estimate the number of clusters
Slide19Internal Index: SSE
SSE curve for a more complicated data set
SSE of clusters found using K-means
Slide20Silhouette Coefficient combines both cluster cohesion and separation
For each data point,
i
Calculate
a
= average distance of
i
to the points in its clusterCalculate b = min (average distance of i to points in another cluster)The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case) Typically between 0 and 1. The closer to 1 the better.Can calculate the average Silhouette for a cluster or a clustering
Internal Index: Silhouette Coefficient
Slide21Internal Index for Hierarchical Clustering
Distance Matrix:
Single
Link
Slide22Internal Index: CPCC
CPCC (CoPhenetic Correlation Coefficient)
Correlation between original distance matrix and cophenetic distance matrix
Cophenetic
Distance Matrix for Single Link
Single
Link
Slide23Internal Index: CPCC
Single Link
Complete Link
Slide24External Index: Entropy and Purity