Methods of cluster analysis Goals 1 We want to identify groups of similar artifacts or features or sites or graves etc that represent cultural functional or chronological differences We want to create groups as a measurement technique to see how they vary with external variables ID: 204396
Download Presentation The PPT/PDF document "Grouping Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Grouping Data
Methods of cluster analysisSlide2
Goals 1
We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences
We want to create groups as a measurement technique to see how they vary with external variablesSlide3
Goals 2
We want to cluster artifacts or sites based on their location to identify spatial clustersSlide4
Real vs. Created Types
Differences in goals
Real types are the aim of Goal 1
Created types are the aim of Goal 2
Debate over whether Real types can be discovered with any degree of certainty
Cluster analysis guarantees groups – you must confirm their utilitySlide5
Initial Decisions 1
What variables to use?
All possible
Constructed variables (from principal components, correspondence analysis, or multi-dimensional scaling)
Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)Slide6
Initial Decisions 2
How to transform the variables?
Log transforms
Conversion to percentages (to weight rows equally)
Size standardization (dividing by geometric mean)
Z – scores (to weight columns equally)
Conversion of categorical variablesSlide7
Initial Decisions 3
How to measure distance?
Types of variables
Goals of the analysis
If uncertain, try multiple methodsSlide8
Methods of Grouping
Partitioning Methods – divide the data into groups
Hierarchical Methods
Agglomerating – from n clusters to 1 cluster
Divisive – from 1 cluster to k clustersSlide9
Partitioning
K – Means, K –
Medoids
, Fuzzy
Measure of distance – but do not need to compute full distance matrix
Specify number of groups in advance
Minimizing within group variability
Finds spherical clustersSlide10
Procedure
Start with centers for k groups (user-supplied or random)
Repeat up to iter.max times (default 10)
Allocate rows to their closest center
Recalculate the center positions
Stop
Different criteria for allocation
Use multiple starts (e.g. 5 – 15)Slide11
Evaluation 1
Compute groups for a range of cluster sizes and plot within group sums of squares to look for sharp increases
Cluster randomized versions of the data and compare the results
Examine table of statistics by groupSlide12
Evaluation 2
Plot groups in two dimensions with PCA, CA, or MDS
Compare the groups using data or information not included in
the analysisSlide13
Partitioning Using R
Base R includes
kmeans
() for forming groups by partitioning
Rcmdr
includes
KMeans
() to iterate
kmeans
() for best solution
Package cluster() includes
pam
() which uses
medoids
for more robust grouping and fanny() which forms fuzzy clustersSlide14
Example
Dar
l
Points
(not
Dar
t
Points
) has 4 measurements for 23
Darl
points
Create Z-scores to weight variables equally with Data | Manage variables in active data set | Standardize variables …
(or could use PCA and PC Scores)Slide15
Example (cont)
Use
Rcmdr
to partition the data into 5, 4, 3, and 2 groups
Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis …
TWSS = 15.42, 19.78, 25.83, 34.24
Select group number and have
Rcmdr
add group to data setSlide16Slide17Slide18
Evaluation
Evaluate groups against randomized data
Randomly permute each variable
Run k-means
Compare random and non-random results
Evaluate groups against external criteria (location, material, age, etc)Slide19
KMPlotWSS
<- function(data,
ming
,
maxg
) {
WSS <-
sapply
(
ming:maxg
, function(x)
kmeans
(data, centers = x,
iter.max
= 10,
nstart
= 10)$
tot.withinss
)
plot(
ming:maxg
, WSS,
las
=1, type="b",
xlab
="Number of Groups",
ylab
="Total Within Sum of Squares",
pch
=16)
print(WSS)
}
KMRandWSS
<- function(data, samples, min, max) {
KRand
<- function(data, min, max){
Rnd
<- apply(data, 2, sample)
sapply
(
min:max
, function(y)
kmeans
(
Rnd
, y, iter.max= 10,
nstart
=5)$
tot.withinss
)
}
Sim
<-
sapply
(1:samples, function(x)
KRand
(data, min, max))
t(apply(
Sim
, 1,
quantile
, c(0,.005, .01, .025, .5,
.975, .99, .995, 1)))
}Slide20
# Compare data to randomized sets
KMPlotWSS
(
DarlPoints
[,6:9], 1, 10)
Qtiles
<-
KMRandWSS
(
DarlPoints
[,6:9], 2000, 1, 10)
matlines
(1:10,
Qtiles
[,c(1, 5, 9)],
lty
=c(3, 2, 3),
lwd
=2,
col
="dark gray")
legend("
topright
", c("Observed", "Median (Random)",
"Max/Min Random"),
col
=c("black", "dark gray",
"dark gray"),
lwd
=c(1, 2, 2),
lty
=c(1, 2, 3))Slide21Slide22
Hierarchical Methods
Agglomerative – successive merging
Divisive - successive splitting
Monothetic
– binary data
Polythetic
– interval/ratioSlide23
Agglomerative
At the start all rows are in separate groups (n groups or clusters)
At each stage two rows are merged, a row and a group are merged, or two groups are merged
The process stops when all rows are in a single clusterSlide24
Agglomeration Methods
How should clusters be formed?
Single Linkage, irregular shape groups
Average Linkage – spherical groups
Complete Linkage – spherical groups
Ward’s Method – spherical groups
Median – dendrogram inversions
Centroid
– dendrogram inversions
McQuitty
– similarity by reciprocal pairsSlide25
Agglomerating with R
Base R includes
hclus
() for forming groups by partitioning
Package cluster() includes
agnes
()
Rcmdr
uses
hclus
() via Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …Slide26
HClust
Rcmdr
menus provide
Cluster analysis and plot
Summary statistics by group
Adding cluster to data set
To get traditional dendrogram:
plot(HClust.1,
hang=-1
, main= "
Darl
Points",
xlab
= "Catalog Number", sub="Method=Ward; Distance=Euclidian")
rect.hclust
(HClust.1, 3)Slide27Slide28Slide29
summary(
as.factor
(
cutree
(HClust.1, k = 3))) # Cluster Sizes
1 2 3
11 6 6
by(
model.matrix
(~-1 +
Z.Length
+
Z.Thickness
+
Z.Weight
+
Z.Width
,
DarlPoints
),
as.factor
(
cutree
(HClust.1, k = 3)), mean) # Cluster
Centroids
INDICES: 1
Z.Length
Z.Thickness
Z.Weight
Z.Width
-0.1345150 -0.1585615 -0.2523805 -0.1241642
------------------------------------------------------------
INDICES: 2
Z.Length
Z.Thickness
Z.Weight
Z.Width
-1.1085541 -0.9209550 -0.9400026 -0.8200594
------------------------------------------------------------
INDICES: 3
Z.Length
Z.Thickness
Z.Weight
Z.Width
1.355165 1.211651 1.402700 1.047694
>
biplot
(
princomp
(
model.matrix
(~-1 +
Z.Length
+
Z.Thickness
+
Z.Weight
+
Z.Width
,
DarlPoints
)),
xlabs
=
as.character
(
cutree
(HClust.1, k = 3)))Slide30Slide31
>
cbind
(HClust.1$merge, HClust.1$height)
[,1] [,2] [,3]
[1,] -12 -13 0.3983821
[2,] -2 -3 0.5112670
[3,] -9 -14 0.5247650
[4,] -10 -17 0.5572146
[5,] -15 3 0.7362171
[6,] -1 -11 0.7471874
[7,] -6 -18 0.8120594
[8,] -7 -8 0.8491895
[9,] 4 5 0.9841552
[10,] 2 6 1.2150606
[11,] -19 -21 1.2300507
[12,] 1 10 1.4059158
[13,] -22 11 1.4963400
[14,] -16 -20 1.5800167
[15,] -4 9 1.6195709
[16,] -5 12 2.1556543
[17,] -23 13 2.4007863
[18,] 7 14 2.4252670
[19,] 8 17 3.2632812
[20,] 16 18 4.9021149
[21,] 15 20 6.6290417
[22,] 19 21 18.7730146Slide32Slide33Slide34
Divisive
At the start all rows are considered to be a single group
At each stage a group is divided into two groups based on the average dissimilarities
The process stops when all rows are in separate clusters