and Physical Interaction Datasets Manikandan Narayanan Adrian Vetta Eric E Schadt Jun Zhu PLoS Computational Biology 2010 Presented by Tal Saiag Seminar in Algorithmic Challenges in Analyzing Big Data in Biology and ID: 408078
Download Presentation The PPT/PDF document "Simultaneous Clustering of Multiple Gene..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Simultaneous Clustering of Multiple Gene Expressionand Physical Interaction DatasetsManikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun ZhuPLoS Computational Biology 2010
Presented by: Tal Saiag
Seminar in Algorithmic Challenges in Analyzing Big Data* in Biology and
Medicine; With Prof. Ron Shamir @TAUSlide2
OutlineBasic TerminologyIntroductionJointCluster: A simultaneous clustering algorithmResults
Discussion
Conclusion
2Slide3
Basic Terminology3Slide4
Basic TerminologyCell – building block of lifeContains nucleusChromosome – genetic material
Forms the genome – DNA
Gene – Stretch of DNA
Proteins – multi functional workers
4Slide5
Basic TerminologyGene expression: from gene to proteinTranscriptionRNATranslationTranscription factor
5Slide6
Basic TerminologyDNA microarray / chipsMeasures expression of genesCondition specific
6Slide7
Introduction7Slide8
IntroductionGenome-wide datasets provide different views of the biology of a cell.Physical interactions (protein-protein
) and regulatory
interactions (protein-DNA)
maintain
and regulate the cell's processes.Expression of molecules (proteins
or transcripts of genes) provide a snapshot of the cell’s state.Researchers have exploited the complementarity of both.8Slide9
Introduction
Integrating the physical and expression datasets.
Developing efficient solution
for combined
analysis of multiple networks.
Goal
: find common clusters of genes supported by all of the networks of interest.
Computationally intractable for large networks.
Theoretical guarantees (reasonably approximates the optimal clustering
).
9Slide10
JointCluster: A simultaneous clustering algorithm10Slide11
DeclarationsA cut refers to a partition of nodes in a graph into two sets.A cut is called sparse-enough in a graph if the ratio of edges crossing the cut in the graph to the edges incident at the smaller side of the cut is smaller than a threshold specific to the graph
.
Inter-cluster edges
: edges with endpoints in different
clusters.
Connectedness of a cluster in a graph: the cost of a set of edges is the ratio of their weight to the total edge weight in the graph.11Slide12
Algorithm outline
Approximate the sparsest cut in each input graph using a spectral
method.
Choose
among them any cut that is sparse-enough in the corresponding graph yielding the
cut.Recurse on the two node sets of the chosen cut.Until well connected node sets with no sparse-enough cuts are obtained.
12Slide13
Definitions
Graph
for any node pair
Total
weight of any
edge set
:
For
any node sets
:
Define
Total
edge weight in the graph: a(V)/
2
Singletons:
13Slide14
The ConductanceThe conductance of a cut
in a node set
:
The
inter-cluster
edges
:
where
and
belong to different
clusters.
An
clustering of
is a partition of its nodes into clusters such that:
The
conductance of the clustering is at least
, and
The
total weight of the inter-cluster edges
is at most an
fraction of the total edge weight in the graph; i.e.,
.
14Slide15
Approximating the sparsest cutSince finding the sparsest cut in a graph is a NP-hard problem, an approximation algorithm for the problem is used.
Efficient
spectral
techniques.
Spectral Algorithm:Find the top
right singular vectors
(using SVD).
Let
be the matrix whose
th
column is given by
.
Place row
in cluster
if
is the largest entry in the
th
row of
.
15Slide16
Multiple GraphsConsider graphs
An
simultaneous
clustering:
The
conductance of the clustering is at least
in graph
for all
, and
The
total weight of the inter-cluster edges
is at most an
fraction of the total edge weight in all graphs; i.e.,
.
Inter-cluster edge
cost:
.
A cut in
is sparse enough if the conductance of the
cut
is at most
.
16Slide17
Scaling HeuristicMixture graph
for graph
at scale
has a weight function:
The heuristic finds sparsest cuts in mixture graphs
.
The heuristic starts with the sum graph to control edges lost in all graphs, and transitions through a series of mixture graphs that approach the individual graphs to refine the clusters.
17Slide18
Cut Selection HeuristicCombine a cut selection heuristic.Choose the cut that is sparse-enough in the most number of input graphs.
18Slide19
Overview19Slide20
Overview20Slide21
Overall frameworkThe modularity score of a cluster in a graph
is: fraction
of edges contained within the cluster minus the fraction expected by
chance.
The partition of
that respects the clustering tree and optimizes the min–modularity score can be found by dynamic programming.
Ordered
by the min-modularity scores of
clusters.
21Slide22
Parameter learningWe desire an unsupervised method for learning the related conductance threshold
for each network of interest
.
Algorithm
for each graph
:
Cluster only
using JointCluster, without loss of generality set
to maximum possible value
.
Set
to the minimum conductance threshold that would result in the same set of clusters
.
Goal
: automatically choose a threshold
that is sufficiently low
and sufficiently high.
22Slide23
Results23Slide24
Benchmarking JointCluster on simulated data
Alternative algorithms:
Tree:
Choose one of the input graphs
as a reference, cluster this single graph using an efficient spectral clustering method
to
obtain a clustering tree, and parse this tree into clusters using the min-modularity score computed from all graphs.
Coassociation
:
Cluster each graph separately using
a
spectral
method,
combine the resulting clusters from different graphs into a
coassociation
graph,
and cluster this graph using the same spectral
method
.
Parametr
24Slide25
Evaluation measures
Intra-cluster
are a pair
of elements that belong to a single
cluster.
Jaccard Index =
25Slide26
Benchmarking JointCluster on simulated data26Slide27
Systematic evaluation of the methods using diverse reference classes in yeast
Two yeast strains grown under two conditions where glucose or ethanol was the predominant carbon source
.
Coexpression
networks using all 4,482 profiled genes as
nodes.Weight of an edge as the absolute value of the Pearson's correlation coefficient between
the
expression profiles
of the two genes.
Physical gene-protein interactions (
from various interaction
databases): total of 41,660
non-redundant
interactions.
27Slide28
Different classes of yeast genesGO Process: Genes in each reference set in this class are annotated to the same GO Biological Process term.TF (Transcription Factor) Perturbations: Genes in each set have altered expression when a TF is deleted or overexpressed.
Compendium of Perturbations:
Genes in each set have altered expression under deletions of specific genes, or chemical perturbations.
TF Binding Sites:
Genes in a set have binding sites of the same TF in their upstream genomic regions, with sites predicted using
ChIP binding data.eQTL Hotspots: Certain genomic regions exhibit a significant excess of linkages of expression traits to genotypic variations.
28Slide29
Evaluation measures [cont.]Intra-cluster are a pair of elements that belong to a single cluster.
Jaccard
Index =
Sensitivity:
the
fraction of reference sets that are enriched for genes belonging to some cluster output by the method
.
[
coverage]
Specificity:
the
fraction of clusters that are enriched for genes belonging to some reference set
.
[
accuracy
]
29Slide30
Systematic evaluation of the methods using diverse reference classes in yeast30Slide31
JointCluster yields better coverage of genes than a competing methodComparing JointCluster against methods that integrate only a single coexpression network with a physical network.
Combined
<
glucose+ethanol
> coexpression network and the physical network.Comparing on fair
terms for all algorithms:Setting minimum cluster size parameter in Matisse to 10.Size limit of 100 genes for JointCluster
.
Co-clustering didn't have a parameter to directly limit cluster
size.
31Slide32
JointCluster yields better coverage of genes than a competing method32Slide33
Discussion33Slide34
Discussion
Heterogeneous large-scale datasets are accumulating at a rapid pace.
Efforts to integrate them are intensifying.
JointCluster
provides a versatile approach to integrating any number of heterogeneous datasets.
Natural progression from clustering of single to multiple datasets.
34Slide35
DiscussionTesting JointCluster algorithm on simulated datasets.Testing
JointCluster
on yeast empirical datasets
.More flexible than two-network clustering methods.
Consistent with known biology, extend our knowledge.JointCluster can handle multiple heterogeneous network.Enables better coverage of genes especialy when knowledge of physical interactions is less complete.
Unsupervised
and exploratory approach to data
integration.
35Slide36
Conclusion36Slide37
ConclusionThe challenge: integrating multiple datasets in order to study different aspects of biological systems.Proposed simultaneous clustering of multiple
networks.
Efficient solution that permits
certain theoretical
guaranteesEffective scaling heuristicFlexibility to handle multiple heterogeneous
networksResults of JointCluster:More robust, and can handle high false positive rates.More
consistently enriched for various reference
classes.
Yielding
better
coverage.
Agree with known biology of
yeast.
37Slide38
38
Questions?Slide39
39
Questions?
Bibliography:
Manikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun Zhu (2010), PLoS Computational
Biology.
Simultaneous
Clustering of Multiple Gene Expression and Physical Interaction
Datasets.
Supplementary
Text for “Simultaneous clustering of multiple gene expression and physical interaction datasets
”.
CPP Source code
Kannan
R,
Vempala
S,
Vetta
A (2000), Proceedings Annual IEEE Symposium on Foundations of Computer
Science (
FOCS).
pp
367–377.
On
clusterings - good, bad and spectral
.
Shi J, Malik J (2000), IEEE Transactions on Pattern Analysis and Machine
Intelligence (
TPAMI)
22
:
888–905.
Normalized cuts and image
segmentation.
Andersen R, Lang KJ (2008), Proceedings Annual
ACM-SIAM
Symposium
on
Discrete Algorithms (SODA).
pp
651–660.
An algorithm for improving graph
partitions.Slide40
40
Thanks!