/
Simultaneous Clustering of Multiple Gene Expression Simultaneous Clustering of Multiple Gene Expression

Simultaneous Clustering of Multiple Gene Expression - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
417 views
Uploaded On 2016-07-17

Simultaneous Clustering of Multiple Gene Expression - PPT Presentation

and Physical Interaction Datasets Manikandan Narayanan Adrian Vetta Eric E Schadt Jun Zhu PLoS Computational Biology 2010 Presented by Tal Saiag Seminar in Algorithmic Challenges in Analyzing Big Data in Biology and ID: 408078

cluster graph genes clustering graph cluster clustering genes cut jointcluster graphs set weight edges expression clusters multiple physical datasets

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Simultaneous Clustering of Multiple Gene..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Simultaneous Clustering of Multiple Gene Expressionand Physical Interaction DatasetsManikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun ZhuPLoS Computational Biology 2010

Presented by: Tal Saiag

Seminar in Algorithmic Challenges in Analyzing Big Data* in Biology and

Medicine; With Prof. Ron Shamir @TAUSlide2

OutlineBasic TerminologyIntroductionJointCluster: A simultaneous clustering algorithmResults

Discussion

Conclusion

2Slide3

Basic Terminology3Slide4

Basic TerminologyCell – building block of lifeContains nucleusChromosome – genetic material

Forms the genome – DNA

Gene – Stretch of DNA

Proteins – multi functional workers

4Slide5

Basic TerminologyGene expression: from gene to proteinTranscriptionRNATranslationTranscription factor

5Slide6

Basic TerminologyDNA microarray / chipsMeasures expression of genesCondition specific

6Slide7

Introduction7Slide8

IntroductionGenome-wide datasets provide different views of the biology of a cell.Physical interactions (protein-protein

) and regulatory

interactions (protein-DNA)

maintain

and regulate the cell's processes.Expression of molecules (proteins

or transcripts of genes) provide a snapshot of the cell’s state.Researchers have exploited the complementarity of both.8Slide9

Introduction

Integrating the physical and expression datasets.

Developing efficient solution

for combined

analysis of multiple networks.

Goal

: find common clusters of genes supported by all of the networks of interest.

Computationally intractable for large networks.

Theoretical guarantees (reasonably approximates the optimal clustering

).

9Slide10

JointCluster: A simultaneous clustering algorithm10Slide11

DeclarationsA cut refers to a partition of nodes in a graph into two sets.A cut is called sparse-enough in a graph if the ratio of edges crossing the cut in the graph to the edges incident at the smaller side of the cut is smaller than a threshold specific to the graph

.

Inter-cluster edges

: edges with endpoints in different

clusters.

Connectedness of a cluster in a graph: the cost of a set of edges is the ratio of their weight to the total edge weight in the graph.11Slide12

Algorithm outline

Approximate the sparsest cut in each input graph using a spectral

method.

Choose

among them any cut that is sparse-enough in the corresponding graph yielding the

cut.Recurse on the two node sets of the chosen cut.Until well connected node sets with no sparse-enough cuts are obtained.

12Slide13

Definitions

Graph

for any node pair

Total

weight of any

edge set

:

For

any node sets

:

Define

Total

edge weight in the graph: a(V)/

2

Singletons:

 

13Slide14

The ConductanceThe conductance of a cut

in a node set

:

The

inter-cluster

edges

:

where

and

belong to different

clusters.

An

clustering of

is a partition of its nodes into clusters such that:

The

conductance of the clustering is at least

, and

The

total weight of the inter-cluster edges

is at most an

fraction of the total edge weight in the graph; i.e.,

.

 

14Slide15

Approximating the sparsest cutSince finding the sparsest cut in a graph is a NP-hard problem, an approximation algorithm for the problem is used.

Efficient

spectral

techniques.

Spectral Algorithm:Find the top

right singular vectors

(using SVD).

Let

be the matrix whose

th

column is given by

.

Place row

in cluster

if

is the largest entry in the

th

row of

.

 

15Slide16

Multiple GraphsConsider graphs

An

simultaneous

clustering:

The

conductance of the clustering is at least

in graph

for all

, and

The

total weight of the inter-cluster edges

is at most an

fraction of the total edge weight in all graphs; i.e.,

.

Inter-cluster edge

cost:

.

A cut in

is sparse enough if the conductance of the

cut

is at most

.

 

16Slide17

Scaling HeuristicMixture graph

for graph

at scale

has a weight function:

The heuristic finds sparsest cuts in mixture graphs

.

The heuristic starts with the sum graph to control edges lost in all graphs, and transitions through a series of mixture graphs that approach the individual graphs to refine the clusters.

 

17Slide18

Cut Selection HeuristicCombine a cut selection heuristic.Choose the cut that is sparse-enough in the most number of input graphs.

18Slide19

Overview19Slide20

Overview20Slide21

Overall frameworkThe modularity score of a cluster in a graph

is: fraction

of edges contained within the cluster minus the fraction expected by

chance.

The partition of

that respects the clustering tree and optimizes the min–modularity score can be found by dynamic programming.

Ordered

by the min-modularity scores of

clusters.

 

21Slide22

Parameter learningWe desire an unsupervised method for learning the related conductance threshold

for each network of interest

.

Algorithm

for each graph

:

Cluster only

using JointCluster, without loss of generality set

to maximum possible value

.

Set

to the minimum conductance threshold that would result in the same set of clusters

.

Goal

: automatically choose a threshold

that is sufficiently low

and sufficiently high.

 

22Slide23

Results23Slide24

Benchmarking JointCluster on simulated data

Alternative algorithms:

Tree:

Choose one of the input graphs

as a reference, cluster this single graph using an efficient spectral clustering method

to

obtain a clustering tree, and parse this tree into clusters using the min-modularity score computed from all graphs.

Coassociation

:

Cluster each graph separately using

a

spectral

method,

combine the resulting clusters from different graphs into a

coassociation

graph,

and cluster this graph using the same spectral

method

.

Parametr

 

24Slide25

Evaluation measures

Intra-cluster

are a pair

of elements that belong to a single

cluster.

Jaccard Index =

 

25Slide26

Benchmarking JointCluster on simulated data26Slide27

Systematic evaluation of the methods using diverse reference classes in yeast

Two yeast strains grown under two conditions where glucose or ethanol was the predominant carbon source

.

Coexpression

networks using all 4,482 profiled genes as

nodes.Weight of an edge as the absolute value of the Pearson's correlation coefficient between

the

expression profiles

of the two genes.

Physical gene-protein interactions (

from various interaction

databases): total of 41,660

non-redundant

interactions.

27Slide28

Different classes of yeast genesGO Process: Genes in each reference set in this class are annotated to the same GO Biological Process term.TF (Transcription Factor) Perturbations: Genes in each set have altered expression when a TF is deleted or overexpressed.

Compendium of Perturbations:

Genes in each set have altered expression under deletions of specific genes, or chemical perturbations.

TF Binding Sites:

Genes in a set have binding sites of the same TF in their upstream genomic regions, with sites predicted using

ChIP binding data.eQTL Hotspots: Certain genomic regions exhibit a significant excess of linkages of expression traits to genotypic variations.

28Slide29

Evaluation measures [cont.]Intra-cluster are a pair of elements that belong to a single cluster.

Jaccard

Index =

Sensitivity:

the

fraction of reference sets that are enriched for genes belonging to some cluster output by the method

.

[

coverage]

Specificity:

the

fraction of clusters that are enriched for genes belonging to some reference set

.

[

accuracy

]

 

29Slide30

Systematic evaluation of the methods using diverse reference classes in yeast30Slide31

JointCluster yields better coverage of genes than a competing methodComparing JointCluster against methods that integrate only a single coexpression network with a physical network.

Combined

<

glucose+ethanol

> coexpression network and the physical network.Comparing on fair

terms for all algorithms:Setting minimum cluster size parameter in Matisse to 10.Size limit of 100 genes for JointCluster

.

Co-clustering didn't have a parameter to directly limit cluster

size.

31Slide32

JointCluster yields better coverage of genes than a competing method32Slide33

Discussion33Slide34

Discussion

Heterogeneous large-scale datasets are accumulating at a rapid pace.

Efforts to integrate them are intensifying.

JointCluster

provides a versatile approach to integrating any number of heterogeneous datasets.

Natural progression from clustering of single to multiple datasets.

34Slide35

DiscussionTesting JointCluster algorithm on simulated datasets.Testing

JointCluster

on yeast empirical datasets

.More flexible than two-network clustering methods.

Consistent with known biology, extend our knowledge.JointCluster can handle multiple heterogeneous network.Enables better coverage of genes especialy when knowledge of physical interactions is less complete.

Unsupervised

and exploratory approach to data

integration.

35Slide36

Conclusion36Slide37

ConclusionThe challenge: integrating multiple datasets in order to study different aspects of biological systems.Proposed simultaneous clustering of multiple

networks.

Efficient solution that permits

certain theoretical

guaranteesEffective scaling heuristicFlexibility to handle multiple heterogeneous

networksResults of JointCluster:More robust, and can handle high false positive rates.More

consistently enriched for various reference

classes.

Yielding

better

coverage.

Agree with known biology of

yeast.

37Slide38

38

Questions?Slide39

39

Questions?

Bibliography:

Manikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun Zhu (2010), PLoS Computational

Biology.

Simultaneous

Clustering of Multiple Gene Expression and Physical Interaction

Datasets.

Supplementary

Text for “Simultaneous clustering of multiple gene expression and physical interaction datasets

”.

CPP Source code

Kannan

R,

Vempala

S,

Vetta

A (2000), Proceedings Annual IEEE Symposium on Foundations of Computer

Science (

FOCS).

pp

367–377.

On

clusterings - good, bad and spectral

.

Shi J, Malik J (2000), IEEE Transactions on Pattern Analysis and Machine

Intelligence (

TPAMI)

22

:

888–905.

Normalized cuts and image

segmentation.

Andersen R, Lang KJ (2008), Proceedings Annual

ACM-SIAM

Symposium

on

Discrete Algorithms (SODA).

pp

651–660.

An algorithm for improving graph

partitions.Slide40

40

Thanks!