/
Clustering Clustering Preliminaries Clustering Clustering Preliminaries

Clustering Clustering Preliminaries - PowerPoint Presentation

MoonBabe
MoonBabe . @MoonBabe
Follow
343 views
Uploaded On 2022-08-04

Clustering Clustering Preliminaries - PPT Presentation

Log 2 transformation Row centering and normalization Filtering Log 2 Transformation Log 2 transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values ID: 935281

clusters clustering distance linkage clustering clusters linkage distance samples data cluster genes consensus measure analysis matrix gene single points

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clustering Clustering Preliminaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Clustering

Slide2

Clustering Preliminaries

Log

2

transformation

Row centering and normalization

Filtering

Slide3

Log

2

Transformation

Log

2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.We would like dist(100,200)=dist(1000,2000).

Advantages of log

2 transformation:

Slide4

Row Centering & Normalization

x

y=x-

mean(x)

z

=

y/stdev(y

)

Slide5

Filtering genes

Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data

After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis.

Clustering

Supervised Analysis

Marker Selection

All genes

Slide6

Clustering/Class Discovery

Aim:

Partition data (e.g. genes or samples) into sub-groups (clusters),

such

that points of the same cluster are “more similar”.

Challenge: Not well defined. No single objective function / evaluation criterion

Example

:

How many clusters?

2+noise, 3+noise, 20,

Hierarchical: 2

3 + noise

One has to choose:

Similarity/distance measure

Clustering method

Evaluate clusters

Slide7

Clustering in GenePattern

Representative based:

Find representatives/centroids

K-means: KMeansClusteringSelf Organizing Maps (SOM): SOMClustering

Bottom-up (Agglomerative): HierarchicalClustering

Hierarchically unite clusterssingle linkage analysis

complete linkage analysisaverage linkage analysis

Clustering-like:

NMFConsensus

PCA (Principal Components Analysis)

No BEST method! For easy problems – most of them work.

Each algorithm has its assumptions and strengths and weaknesses

Slide8

K-means Clustering

Aim:

Partition the data points into

K

subsets and associate each subset with a centroid such that the sum of squared distances between the data points and their associated centroid is minimal.

Slide9

Iteration = 0

K-means: Algorithm

Initialize

centroids at

random positions

Iterate:

Assign each data point to

its closest

centroid

Move centroids to center of assigned points

Stop

when converged

Guaranteed to reach a local minimum

Iteration = 1

K

=3

Iteration = 1

Iteration = 2

Iteration = 2

Slide10

K-means: Summary

Result depends on

initial

centroids’ position

Fast algorithm: needs to compute distances from data points to centroidsMust preset number of clusters.

Fails for non-spherical distributions

Slide11

Hierarchical Clustering

3

1

4

2

5

5

2

4

1

3

Distance between joined clusters

Dendrogram

Slide12

5

2

4

1

3

3

1

4

2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a

linear ordering

of the data points (up to left/right flip in each split)

Hierarchical Clustering

Need to define the

distance

between the

new cluster

and the

other clusters.

Single Linkage:

distance between closest pair.

Complete Linkage:

distance between farthest pair.

Average Linkage:

average distance between all pairs

or distance between cluster centers

Slide13

Average Linkage

Leukemia samples and genes

Slide14

Single and Complete Linkage

Single-linkage

Complete-linkage

Leukemia samples and genes

Slide15

Similarity/Distance Measures

Decide:

which samples/genes should be clustered together

Euclidean

: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula Pearson correlation - a parametric measure of the strength of linear dependence between two variables. Absolute Pearson correlation - the absolute value of the Pearson correlation

Spearman rank correlation - a non-parametric measure of independence between two variablesUncentered correlation - same as Pearson but assumes the mean is 0

Absolute uncentered correlation - the absolute value of the uncentered correlationKendall’s tau - a non-parametric similarity measure used to measure the degree of correspondence between two rankings

City-block/Manhattan - the distance that would be traveled to get from one point to the other if a grid-like path is followed

Slide16

Reasonable Distance Measure

Gene 1

Gene 2

Gene 3

Gene 4

Genes:

Close

-> Correlated

Samples:

Similar profile giving

Gene 1 and 2 a similar contribution to the distance between sample 1 and 5

Sample 1

Sample 5

Euclidean distance on samples and genes on row-centered and normalized data.

Slide17

Pitfalls in Clustering

Elongated

clusters

Filament

Clusters of different sizes

Slide18

Compact Separated Clusters

All methods work

Adapted from E. Domany

Slide19

Elongated Clusters

Single linkage succeeds to partition

Average linkage fails

Slide20

Filament

Single linkage not robust

Adapted from E. Domany

Slide21

Filament with Point Removed

Single linkage not robust

Adapted from E. Domany

Slide22

Two-way Clustering

Two independent

cluster analyses on genes and samples used to reorder the data (

two-way clustering

):

Slide23

Hierarchical Clustering

Results depend on distance update method

Single Linkage:

elongated clusters

Complete Linkage: sphere-like clusters

Greedy iterative process NOT robust against noise

No inherent measure to choose the clusters – we return to this point in cluster validation

Summary

Slide24

Clustering Protocol

Slide25

Validating Number of Clusters

How do we know how many real clusters exist in the dataset?

Slide26

...

D

1

D

2

D

n

Generate “perturbed”

datasets

Consensus Clustering

Apply

clustering algorithm

to each D

i

Clustering

1

Clustering

2

.. Clustering

n

Original

Dataset

Consensus matrix: counts proportion of times two samples are clustered together.

(

1

) two samples

always

cluster together

(

0

) two samples

never

cluster together

s

1

s

2

… s

n

s

1

s

2

s

n

WHITE

RED

compute

consensus matrix

dendogram

based on matrix

The Broad Institute of MIT and Harvard

Slide27

Consensus Clustering

Consensus matrix: counts proportion of times two samples are clustered together.

(

1) two samples always cluster together (0) two samples never cluster together

C

1

C

2

C

3

s

1

s

3

… s

i

s

1

s

3…

siWHITE

RED

consensus matrix

ordered according

to dendogram

compute

consensus matrix

dendogram

based on matrix

...

D

1

D

2

D

n

Apply

clustering algorithm

to each D

i

Clustering

1

Clustering

2

.. Clustering

n

Original

Dataset

Slide28

Validation

Aim:

Measure agreement between clustering results on “perturbed” versions of the data.

Method:

Iterate N times:

Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noiseCluster the perturbed dataset

Calculate fraction of iterations where different samples belong to the same cluster

Optimize the number of clusters K by choosing the value of K which yields the most consistent results

Consistency / Robustness Analysis

Slide29

Consensus Clustering in GenePattern

Slide30

Clustering Cookbook

Reduce number of genes by variation filtering

Use stricter parameters than for comparative marker selection

Choose a method for cluster discovery (e.g. hierarchical clustering)

Select a number of clustersCheck for sensitivity of clusters against filtering and clustering parametersValidate on independent data sets

Internally test robustness of clusters with consensus clustering

Slide31

References

Brunet, J-P., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization.

Proc. Natl. Acad. Sci.

USA 101(12):4164–4169.

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci.

USA 95:14863-14868.Kim, P. M. and Tidor, B. 2003. Subsystem Identification Through Dimensionality Reduction of Large-Scale Gene Expression Data.

Genome Research 13:1706-1718.

MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. In

Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability

, Vol. 1. University of California Press, California. pp. 281-297.

Monti, S., Tamayo, P., Mesirov, J.P., and Golub, T. 2003. Consensus Clustering: A resampling-based method for class discovery and visualization of gene expression microarray data.

Machine Learning Journal

52(1-2):91-118.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting gene expression with self-organizing maps: Methods and application to hematopoeitic differentiation.

Proc. Natl. Acad. Sci.

USA

.