Log 2 transformation Row centering and normalization Filtering Log 2 Transformation Log 2 transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values ID: 935281
Download Presentation The PPT/PDF document "Clustering Clustering Preliminaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Clustering
Slide2Clustering Preliminaries
Log
2
transformation
Row centering and normalization
Filtering
Slide3Log
2
Transformation
Log
2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.We would like dist(100,200)=dist(1000,2000).
Advantages of log
2 transformation:
Slide4Row Centering & Normalization
x
y=x-
mean(x)
z
=
y/stdev(y
)
Slide5Filtering genes
Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data
After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.
Clustering
Supervised Analysis
Marker Selection
All genes
Slide6Clustering/Class Discovery
Aim:
Partition data (e.g. genes or samples) into sub-groups (clusters),
such
that points of the same cluster are “more similar”.
Challenge: Not well defined. No single objective function / evaluation criterion
Example
:
How many clusters?
2+noise, 3+noise, 20,
Hierarchical: 2
3 + noise
One has to choose:
Similarity/distance measure
Clustering method
Evaluate clusters
Slide7Clustering in GenePattern
Representative based:
Find representatives/centroids
K-means: KMeansClusteringSelf Organizing Maps (SOM): SOMClustering
Bottom-up (Agglomerative): HierarchicalClustering
Hierarchically unite clusterssingle linkage analysis
complete linkage analysisaverage linkage analysis
Clustering-like:
NMFConsensus
PCA (Principal Components Analysis)
No BEST method! For easy problems – most of them work.
Each algorithm has its assumptions and strengths and weaknesses
Slide8K-means Clustering
Aim:
Partition the data points into
K
subsets and associate each subset with a centroid such that the sum of squared distances between the data points and their associated centroid is minimal.
Slide9Iteration = 0
K-means: Algorithm
Initialize
centroids at
random positions
Iterate:
Assign each data point to
its closest
centroid
Move centroids to center of assigned points
Stop
when converged
Guaranteed to reach a local minimum
Iteration = 1
K
=3
Iteration = 1
Iteration = 2
Iteration = 2
Slide10K-means: Summary
Result depends on
initial
centroids’ position
Fast algorithm: needs to compute distances from data points to centroidsMust preset number of clusters.
Fails for non-spherical distributions
Slide11Hierarchical Clustering
3
1
4
2
5
5
2
4
1
3
Distance between joined clusters
Dendrogram
Slide125
2
4
1
3
3
1
4
2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a
linear ordering
of the data points (up to left/right flip in each split)
Hierarchical Clustering
Need to define the
distance
between the
new cluster
and the
other clusters.
Single Linkage:
distance between closest pair.
Complete Linkage:
distance between farthest pair.
Average Linkage:
average distance between all pairs
or distance between cluster centers
Slide13Average Linkage
Leukemia samples and genes
Slide14Single and Complete Linkage
Single-linkage
Complete-linkage
Leukemia samples and genes
Slide15Similarity/Distance Measures
Decide:
which samples/genes should be clustered together
Euclidean
: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula Pearson correlation - a parametric measure of the strength of linear dependence between two variables. Absolute Pearson correlation - the absolute value of the Pearson correlation
Spearman rank correlation - a non-parametric measure of independence between two variablesUncentered correlation - same as Pearson but assumes the mean is 0
Absolute uncentered correlation - the absolute value of the uncentered correlationKendall’s tau - a non-parametric similarity measure used to measure the degree of correspondence between two rankings
City-block/Manhattan - the distance that would be traveled to get from one point to the other if a grid-like path is followed
Slide16Reasonable Distance Measure
Gene 1
Gene 2
Gene 3
Gene 4
Genes:
Close
-> Correlated
Samples:
Similar profile giving
Gene 1 and 2 a similar contribution to the distance between sample 1 and 5
Sample 1
Sample 5
Euclidean distance on samples and genes on row-centered and normalized data.
Slide17Pitfalls in Clustering
Elongated
clusters
Filament
Clusters of different sizes
Slide18Compact Separated Clusters
All methods work
Adapted from E. Domany
Slide19Elongated Clusters
Single linkage succeeds to partition
Average linkage fails
Slide20Filament
Single linkage not robust
Adapted from E. Domany
Slide21Filament with Point Removed
Single linkage not robust
Adapted from E. Domany
Slide22Two-way Clustering
Two independent
cluster analyses on genes and samples used to reorder the data (
two-way clustering
):
Slide23Hierarchical Clustering
Results depend on distance update method
Single Linkage:
elongated clusters
Complete Linkage: sphere-like clusters
Greedy iterative process NOT robust against noise
No inherent measure to choose the clusters – we return to this point in cluster validation
Summary
Slide24Clustering Protocol
Slide25Validating Number of Clusters
How do we know how many real clusters exist in the dataset?
Slide26...
D
1
D
2
D
n
Generate “perturbed”
datasets
Consensus Clustering
Apply
clustering algorithm
to each D
i
Clustering
1
Clustering
2
.. Clustering
n
Original
Dataset
Consensus matrix: counts proportion of times two samples are clustered together.
(
1
) two samples
always
cluster together
(
0
) two samples
never
cluster together
s
1
s
2
… s
n
s
1
s
2
…
s
n
WHITE
RED
compute
consensus matrix
dendogram
based on matrix
The Broad Institute of MIT and Harvard
Slide27Consensus Clustering
Consensus matrix: counts proportion of times two samples are clustered together.
(
1) two samples always cluster together (0) two samples never cluster together
C
1
C
2
C
3
s
1
s
3
… s
i
s
1
s
3…
siWHITE
RED
consensus matrix
ordered according
to dendogram
compute
consensus matrix
dendogram
based on matrix
...
D
1
D
2
D
n
Apply
clustering algorithm
to each D
i
Clustering
1
Clustering
2
.. Clustering
n
Original
Dataset
Slide28Validation
Aim:
Measure agreement between clustering results on “perturbed” versions of the data.
Method:
Iterate N times:
Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noiseCluster the perturbed dataset
Calculate fraction of iterations where different samples belong to the same cluster
Optimize the number of clusters K by choosing the value of K which yields the most consistent results
Consistency / Robustness Analysis
Slide29Consensus Clustering in GenePattern
Slide30Clustering Cookbook
Reduce number of genes by variation filtering
Use stricter parameters than for comparative marker selection
Choose a method for cluster discovery (e.g. hierarchical clustering)
Select a number of clustersCheck for sensitivity of clusters against filtering and clustering parametersValidate on independent data sets
Internally test robustness of clusters with consensus clustering
Slide31References
Brunet, J-P., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization.
Proc. Natl. Acad. Sci.
USA 101(12):4164–4169.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci.
USA 95:14863-14868.Kim, P. M. and Tidor, B. 2003. Subsystem Identification Through Dimensionality Reduction of Large-Scale Gene Expression Data.
Genome Research 13:1706-1718.
MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. In
Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability
, Vol. 1. University of California Press, California. pp. 281-297.
Monti, S., Tamayo, P., Mesirov, J.P., and Golub, T. 2003. Consensus Clustering: A resampling-based method for class discovery and visualization of gene expression microarray data.
Machine Learning Journal
52(1-2):91-118.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting gene expression with self-organizing maps: Methods and application to hematopoeitic differentiation.
Proc. Natl. Acad. Sci.
USA
.