Input for Multidimensional Scaling and Clustering Distances and Similarities Both are ways of measuring how similar two objects are Distances increase as objects are less similar The distance of an object to itself is 0 ID: 269331
Download Presentation The PPT/PDF document "Measuring Distance" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Measuring Distance
Input for Multidimensional Scaling and ClusteringSlide2
Distances and Similarities
Both are ways of measuring how similar two objects are
Distances increase as objects are less similar. The distance of an object to itself is 0
Similarities increase as objects are more similar. The similarity of an object to itself is the maximum value for
the similarity
measureSlide3
Distance Examples
Mileage between two towns measured in straight line (Euclidian) distance (“as the crow flies”), as driving distance, or as great circle (spherical) distance
Instead of geographic locations we can treat measurements such as length, width, and thickness of an artifact as defining its positionSlide4
Similarity Examples
The number of characteristics two objects have in common (cultural traits, genes, presence/absence traits)
Similarity measures can be converted to distances by subtracting each similarity from the maximum possible similaritySlide5
Interval/Ratio Measures
Manhattan Distance (or City Block, 1-norm)
Euclidian Distance (and Squared Euclidian Distance, 2-norm)
Minkowski
Distance (p-norm)
Chebyshev
Distance (Maximum Distance, infinite norm)Slide6
Definitions
p
Distance
1
Manhattan
2
Euclidian
p
Minkowski
Infinity
Chebyshev
, MaximumSlide7
Counts
Ecologists use counts of species between plots to analyze compositional changes in community structure
Bray-Curtis compares the number of specimens and number of overlapping speciesSlide8
DefinitionsBray Curtis Dissimilarity
Note: If samples j and k are percentages,
then the denominator becomes 200.Slide9
Ordinal Measures
Few measures specifically for rank data, but rank correlation coefficients (spearman, Kendall) can be usedSlide10
Dichotomies
Can use interval/ratio measures
Numerous options based on 2x2 table
Many similarity measures based on weighting of presence/presence and absence/absence
Subtract from 1 to create distancesSlide11
Definitions
Present
Absent
Present
a
b
Absent
c
d
Simple Matching Coefficient: (
a+d
)/(
a+b+c+d
)
Jacard’s
Coefficient (asymmetric binary): a/(
a+b+c
)
Phi and Yule’s Q measures of association
ade4 and proxy have many different options for dichotomiesSlide12
Nominal Variables
Similarity can be measured with chi-square based measures
Convert to multiple dichotomies
E.g. Temper: Sand, Silt, Gravel becomes three variables:
TSand
,
TSilt
,
Tgravel
Then use measures for dichotomies/ metric variablesSlide13
Multiple Types
Gower’s Index is the only one that computes a similarity index using variables with different levels of measurement. Take the mean of the variables:
Presence/Absence –
Jaccard
Categorical – 1 if the same, 0 if not
Interval/Ratio/Ranks – absolute difference divided by rangeSlide14
Issues
Weighting – how to weight variables with different variances – standardization, weighting
Correlations between variables – how (and whether) to take correlations into account (
Mahalanobis
Distances)Slide15
Distance Matrix
For simple analyses, dist() in base R provides
euclidean
, maximum,
manhattan
,
canberra
, binary (
Jaccard
), and
minkowski
Other packages including different measures: Many others. See packages ade4,
amap
, cluster, ecodist, labdsv, proxy, and vegan Slide16
# Load
Darl
#
Rcmdr
to create
scatterplot
matrix
> Euclid <- dist(Darl[,2:5])
> Euclid
35-3043 35-2871 35-2866 36-3619 36-3520
35-2871 11.437657
35-2866 5.380520 6.542935
36-3619
14.621217
3.682391 9.570266 36-3520
15.309148
4.068169 10.163661 1.757840
36-3036 7.760155 4.442972 2.495997 7.195832 7.860662
>
scatterplot
(
Width~Length
,
reg.line
=lm, smooth=FALSE,
spread=FALSE,
pch
=16,
id.n
=6,
boxplots
=FALSE,
ellipse=TRUE, grid=FALSE, data=
Darl
)
>
mahalanobis
(
Darl
[,2:3], mean(
Darl
[,2:3]),
cov
=
cov
(
Darl
[,2:3]))
35-3043 35-2871 35-2866 36-3619 36-3520 36-3036
2.2577596 1.8173684 0.4641912 2.9652763 1.7527347 0.7426699Slide17Slide18
>
install.packages
("
ecodist
")
> library(
ecodist
)
>
Mahal
<- distance(
Darl
[,2:3], method="
mahalanobis")> Mahal
35-3043 35-2871 35-2866 36-3619 36-352035-2871 4.9367446
35-2866 0.6900956 2.8905096
36-3619
8.5903617 7.5849187
4.7250487
36-3520 6.8826044 0.6084649 3.6631704 4.9720621
36-3036 2.4467510 4.8835727 0.8163226 1.9192663 4.3901066Slide19
#
Rcmdr
> .PC <-
princomp
(~
Length+Weight
,
cor
=TRUE, data=
Darl
)
> Darl$PC1 <- .
PC$scores
[,1]> Darl$PC2 <- .PC$scores
[,2]# Typed commands>
PCDist
<- dist(
Darl
[,6:7])
> PCDist
35-3043 35-2871 35-2866 36-3619 36-3520
35-2871 2.5498737
35-2866 2.1968323 1.1918768
36-3619 3.7858013 1.2539806 1.9883494
36-3520 4.2220041 1.8034110 2.1957351 0.7029308
36-3036 2.6677120 0.9201698 0.5717135 1.4339465 1.6290415
>
scatterplot
(PC2~PC1,
reg.line
=FALSE, smooth=FALSE,
spread=FALSE, grid=FALSE,
boxplots
=FALSE,
pch
=16,
ellipse=TRUE,
id.n
=6, span=0.5, data=
Darl
)
[1] "35-3043" "35-2866" "36-3619" "36-3520" "35-2871" "36-3036"Slide20