/
Grouping Data Grouping Data

Grouping Data - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
486 views
Uploaded On 2015-11-24

Grouping Data - PPT Presentation

Methods of cluster analysis Goals 1 We want to identify groups of similar artifacts or features or sites or graves etc that represent cultural functional or chronological differences We want to create groups as a measurement technique to see how they vary with external variables ID: 204396

data groups analysis cluster groups data cluster analysis variables group hclust weight max clusters methods rows function length partitioning plot rcmdr kmeans

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Grouping Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Grouping Data

Methods of cluster analysisSlide2

Goals 1

We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences

We want to create groups as a measurement technique to see how they vary with external variablesSlide3

Goals 2

We want to cluster artifacts or sites based on their location to identify spatial clustersSlide4

Real vs. Created Types

Differences in goals

Real types are the aim of Goal 1

Created types are the aim of Goal 2

Debate over whether Real types can be discovered with any degree of certainty

Cluster analysis guarantees groups – you must confirm their utilitySlide5

Initial Decisions 1

What variables to use?

All possible

Constructed variables (from principal components, correspondence analysis, or multi-dimensional scaling)

Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)Slide6

Initial Decisions 2

How to transform the variables?

Log transforms

Conversion to percentages (to weight rows equally)

Size standardization (dividing by geometric mean)

Z – scores (to weight columns equally)

Conversion of categorical variablesSlide7

Initial Decisions 3

How to measure distance?

Types of variables

Goals of the analysis

If uncertain, try multiple methodsSlide8

Methods of Grouping

Partitioning Methods – divide the data into groups

Hierarchical Methods

Agglomerating – from n clusters to 1 cluster

Divisive – from 1 cluster to k clustersSlide9

Partitioning

K – Means, K –

Medoids

, Fuzzy

Measure of distance – but do not need to compute full distance matrix

Specify number of groups in advance

Minimizing within group variability

Finds spherical clustersSlide10

Procedure

Start with centers for k groups (user-supplied or random)

Repeat up to iter.max times (default 10)

Allocate rows to their closest center

Recalculate the center positions

Stop

Different criteria for allocation

Use multiple starts (e.g. 5 – 15)Slide11

Evaluation 1

Compute groups for a range of cluster sizes and plot within group sums of squares to look for sharp increases

Cluster randomized versions of the data and compare the results

Examine table of statistics by groupSlide12

Evaluation 2

Plot groups in two dimensions with PCA, CA, or MDS

Compare the groups using data or information not included in

the analysisSlide13

Partitioning Using R

Base R includes

kmeans

() for forming groups by partitioning

Rcmdr

includes

KMeans

() to iterate

kmeans

() for best solution

Package cluster() includes

pam

() which uses

medoids

for more robust grouping and fanny() which forms fuzzy clustersSlide14

Example

Dar

l

Points

(not

Dar

t

Points

) has 4 measurements for 23

Darl

points

Create Z-scores to weight variables equally with Data | Manage variables in active data set | Standardize variables …

(or could use PCA and PC Scores)Slide15

Example (cont)

Use

Rcmdr

to partition the data into 5, 4, 3, and 2 groups

Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis …

TWSS = 15.42, 19.78, 25.83, 34.24

Select group number and have

Rcmdr

add group to data setSlide16
Slide17
Slide18

Evaluation

Evaluate groups against randomized data

Randomly permute each variable

Run k-means

Compare random and non-random results

Evaluate groups against external criteria (location, material, age, etc)Slide19

KMPlotWSS

<- function(data,

ming

,

maxg

) {

WSS <-

sapply

(

ming:maxg

, function(x)

kmeans

(data, centers = x,

iter.max

= 10,

nstart

= 10)$

tot.withinss

)

plot(

ming:maxg

, WSS,

las

=1, type="b",

xlab

="Number of Groups",

ylab

="Total Within Sum of Squares",

pch

=16)

print(WSS)

}

KMRandWSS

<- function(data, samples, min, max) {

KRand

<- function(data, min, max){

Rnd

<- apply(data, 2, sample)

sapply

(

min:max

, function(y)

kmeans

(

Rnd

, y, iter.max= 10,

nstart

=5)$

tot.withinss

)

}

Sim

<-

sapply

(1:samples, function(x)

KRand

(data, min, max))

t(apply(

Sim

, 1,

quantile

, c(0,.005, .01, .025, .5,

.975, .99, .995, 1)))

}Slide20

# Compare data to randomized sets

KMPlotWSS

(

DarlPoints

[,6:9], 1, 10)

Qtiles

<-

KMRandWSS

(

DarlPoints

[,6:9], 2000, 1, 10)

matlines

(1:10,

Qtiles

[,c(1, 5, 9)],

lty

=c(3, 2, 3),

lwd

=2,

col

="dark gray")

legend("

topright

", c("Observed", "Median (Random)",

"Max/Min Random"),

col

=c("black", "dark gray",

"dark gray"),

lwd

=c(1, 2, 2),

lty

=c(1, 2, 3))Slide21
Slide22

Hierarchical Methods

Agglomerative – successive merging

Divisive - successive splitting

Monothetic

– binary data

Polythetic

– interval/ratioSlide23

Agglomerative

At the start all rows are in separate groups (n groups or clusters)

At each stage two rows are merged, a row and a group are merged, or two groups are merged

The process stops when all rows are in a single clusterSlide24

Agglomeration Methods

How should clusters be formed?

Single Linkage, irregular shape groups

Average Linkage – spherical groups

Complete Linkage – spherical groups

Ward’s Method – spherical groups

Median – dendrogram inversions

Centroid

– dendrogram inversions

McQuitty

– similarity by reciprocal pairsSlide25

Agglomerating with R

Base R includes

hclus

() for forming groups by partitioning

Package cluster() includes

agnes

()

Rcmdr

uses

hclus

() via Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …Slide26

HClust

Rcmdr

menus provide

Cluster analysis and plot

Summary statistics by group

Adding cluster to data set

To get traditional dendrogram:

plot(HClust.1,

hang=-1

, main= "

Darl

Points",

xlab

= "Catalog Number", sub="Method=Ward; Distance=Euclidian")

rect.hclust

(HClust.1, 3)Slide27
Slide28
Slide29

summary(

as.factor

(

cutree

(HClust.1, k = 3))) # Cluster Sizes

1 2 3

11 6 6

by(

model.matrix

(~-1 +

Z.Length

+

Z.Thickness

+

Z.Weight

+

Z.Width

,

DarlPoints

),

as.factor

(

cutree

(HClust.1, k = 3)), mean) # Cluster

Centroids

INDICES: 1

Z.Length

Z.Thickness

Z.Weight

Z.Width

-0.1345150 -0.1585615 -0.2523805 -0.1241642

------------------------------------------------------------

INDICES: 2

Z.Length

Z.Thickness

Z.Weight

Z.Width

-1.1085541 -0.9209550 -0.9400026 -0.8200594

------------------------------------------------------------

INDICES: 3

Z.Length

Z.Thickness

Z.Weight

Z.Width

1.355165 1.211651 1.402700 1.047694

>

biplot

(

princomp

(

model.matrix

(~-1 +

Z.Length

+

Z.Thickness

+

Z.Weight

+

Z.Width

,

DarlPoints

)),

xlabs

=

as.character

(

cutree

(HClust.1, k = 3)))Slide30
Slide31

>

cbind

(HClust.1$merge, HClust.1$height)

[,1] [,2] [,3]

[1,] -12 -13 0.3983821

[2,] -2 -3 0.5112670

[3,] -9 -14 0.5247650

[4,] -10 -17 0.5572146

[5,] -15 3 0.7362171

[6,] -1 -11 0.7471874

[7,] -6 -18 0.8120594

[8,] -7 -8 0.8491895

[9,] 4 5 0.9841552

[10,] 2 6 1.2150606

[11,] -19 -21 1.2300507

[12,] 1 10 1.4059158

[13,] -22 11 1.4963400

[14,] -16 -20 1.5800167

[15,] -4 9 1.6195709

[16,] -5 12 2.1556543

[17,] -23 13 2.4007863

[18,] 7 14 2.4252670

[19,] 8 17 3.2632812

[20,] 16 18 4.9021149

[21,] 15 20 6.6290417

[22,] 19 21 18.7730146Slide32
Slide33
Slide34

Divisive

At the start all rows are considered to be a single group

At each stage a group is divided into two groups based on the average dissimilarities

The process stops when all rows are in separate clusters