Vipin Kumar kumarcsumnedu httpwwwcsumnedukumar Department of Computer Science and Engineering ICCABS Feb 2012 Highthroughput technologies DNA Methylation Proteins Gene Expression amp ID: 593873
Download Presentation The PPT/PDF document "Discovering Combinatorial Biomarkers" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Discovering Combinatorial Biomarkers
Vipin Kumarkumar@cs.umn.eduhttp://www.cs.umn.edu/~kumar
Department
of Computer Science and Engineering
ICCABS, Feb 2012
Slide2
High-throughput technologies
DNA
Methylation
Proteins
Gene Expression &
non-coding RNA
SNP
Structural Variation
Metabolites
Clinical Data e.g. brain imaging
Adopted from E.
Schadt
2
Data
mining offers potential solution for analysis of
these
large-scale
datasets
Novel associations between genotypes and phenotypes
Biomarker discovery for complex diseases
Personalized
Medicine – Automated analysis of patients history for customized treatmentSlide3
Biomarker Discovery and its Impact
Biomarkers:Genes:BRCA1 (breast cancer)Protein variantsIVS5-13insC (type 2 diabetes)Pathways/networks:P53 (cancers)Clinical Impact:Diagnosis
PrognosisTreatment
fMRI
Schizophrenia
vs controls
Lim et al.
3
Miki et al.
1994
Chiefari
et al.
2
011
Oren et al.
2010Slide4
SNP as an illustration
4NHGRI GWA Catalog www.genome.gov/GWAStudiesPublished Genome-wide Associations through
06/2010 1,904 published GWA at p≤5*10
-8 for 165 traitsSlide5
5
Published Genome-wide Associations through
06/2011
1,449
published GWA at p≤5*10
-8
for
237
traits
SNP as an illustration
NHGRI GWA Catalog
www.genome.gov/GWAStudies
50% increase in one yearSlide6
High coverage but low odds ratio (1.2)
High odds ratio (15.9) but low coverage (7%)
No significant associations
6
Challenge: Limitations of Single-locus Association Test
Many other studiesSlide7
7
Given a SNP data set of Myeloma patients, find SNPs that are associated with short vs. long survival.3404 SNPs selected from various regions of the chromosome70 cases (Patients survived shorter than 1 year)73 Controls (Patients survived longer than 3 years)
cases
Controls
3404 SNPs
A Example where Single-locus Test Led to No Significant Associations
Top ranked SNP:
-log
10
P-value = 3.8; Odds Ratio = 3.7
Van Ness et al
2008
Myeloma SNP data has
signal
the need of discovering combinations of SNPs
Myeloma Survival DataSlide8
8
Single-locus Tests Ignore Genetic InteractionRipke et al. 2011
Costanzo et al. 2010
Scholl et al. 2009Ruzankina et al. 2009Kamath, 2003
Extensively observed in model organisms,
e.g. yeast, C.
elegans
,
fly
.
Non-additive effect “Genetic Interaction”Slide9
A synthetic pattern
The focus of this talk: Higher-order Combinatorial Biomarker9
......
Complex biological system
Complex human diseases
Higher-order genetic buffering
Disease
Control
Triple mutations only exist in disease subjectsSlide10
Disqualified
Prune all the
supersets
10
+
+
Millions of user,
thousands
of items
Discovering High-order Combinatorial Biomarkers
Challenge I:
Computational
Efficiency
Given
n
features,
there are
2
n
candidates
!
How
to effectively handle the combinatorial search space?
Brute-force search e.g. MDR can only handle 10~100 SNPs. [Rita et al. 2001]
The Apriori framework for efficient search of exponential space
[Agrawal et al. 1994]
Support based pruningSlide11
Discovering High-order Combinatorial Biomarkers
Challenge I: Computational Efficiency11[Fang et al. TKDE 2010]A novel anti-monotonic objective function designed for
mining low-support discriminative patterns from dense
and high-dimensional data
Traditional
Apriori
-based pattern mining techniques
Designed for sparse data
Unique challenges of genomic datasets
High
density
A SNP dataset has a density of 33.33%
Three binary columns per SNP
the three genotypes
High dimensionality
Makes the search more challenging Disease heterogeneity Each combination supported by a small fraction of subjectsSlide12
Targeting
patterns with better association than their subsets reduces # of hypothesis tests
Subsets having higher association
Subsets having lower association
Computational challenges
can be addressed by
Better algorithm
design,
e.g.
Apriori
-based
High-performance
computing
Statistical
challenges
call for additional
efforts
Limited
sample
size
Huge
number of
hypothesis
tests
12[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oetting, VanNess, Kumar, PLoS ONE, 2012]
Discovering High-order Combinatorial Biomarkers
Challenge II: Statistical Power
Many combinations are trivial extensions of their subsets
Myeloma Survival Data
Kidney Rejection Data
Lung Cancer DataSlide13
13
High-order Combinatorial Biomarkers: an example[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012][Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012]
Patients
Control
Best size-1
Best size-2
Best size-3
Best size-4
Size-5
Data from Church et al. 2010
www.ingenuity.com
Jump
All heavy smokers
Lung Cancer Data
The five genes are functionally relatedSlide14
14
Insights on High-order Functional InteractionsPatterns with positive Jump are functionally more coherent[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012]
Lungcancer
Control
Best size-1
Best size-2
Best size-3
Best size-4
Size-5
Lung cancer dataset
Jump
Kidney Rejection Data
Lung Cancer Data
CombinedSlide15
15
High-order Combinations Discovered from Different Types of Data[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012][Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012]
mRNA
: Breast Cancer
Metabolites
: COPD
SNP
: acute kidney rejection
Data from Oetting et al. 2008
Data from Vijver et al. 2002
Data from Wendt et al. 2010
Rejection
No-rejection
Survived (5-year)
Control
AE COPD
Stable COPD
The proposed framework is general to handle different types of dataSlide16
16
Biomarker Discovery using Error-tolerant Patterns01
1
1
1
0
1
1
1
1
0
1
1
1
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
√
X
True
patterns
are
fragmented due to
noise and variability
Possible solution: Error-tolerant
patterns
These patterns differ in the way errors/noise in the data are
tolerated
[
Yang et al 2001]; [Pei et al 2001]; [
Seppanen
et al 2004]; [Liu et al
2006
]; [Cheng et al 2006]; [Gupta et al., KDD 2008]; [
Poernomo
et al
2009
]
See
Gupta et al KDD 2008 for a
surveySlide17
17
Greater fraction of error-tolerant patterns enrich at least one gene set (higher precision)Greater fraction of gene sets are enriched by at least one error-tolerant pattern (higher recall)Four Breast cancer gene-expression data sets are used for experiments:
GSE7390
GSE6532
GSE3494
GSE1456
+
+
+
158
cases
Cases
: patients with metastasis within 5 years of follow-up;
Controls
: patients with no metastasis within 8 years of follow-up
Discriminative Error-tolerant and traditional association patterns case/control are discovered and evaluated by enrichment analysis using
MSigDB
gene
sets
Error-tolerant patterns
Traditional patterns
Error-tolerant patterns
Traditional patterns
433
controls
Error-tolerant pattern vs. Traditional association patterns
Gupta et al.
BICoB
2010;
Gupta et al.
BMC Bioinformatics 2011Slide18
18
Differential Expression (DE)
Traditional analysis targets
changes of expression level
Differential Coexpression (DC)
Changes of the coherence
of gene expression
Combinatorial Search
Genetic Heterogeneity
calls for subspace analysis
[Silva et al., 1995], [Li, 2002], [
Kostka
&
Spang
, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc.
[
Eisen
et al. 1999] [
Golub
et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc.
Differential Coexpression PatternsSlide19
Subspace Differential Coexpression Analysis
≈ 60%
≈ 10%
Enriched with the TNF-
α
/
NF
k
B
signaling pathway (
6/10 overlap with the pathway, corrected p value: 1.4*10
-3
)
Suggests that the
dysregulation
of TNF-
α
/
NF
k
B pathway may be related to lung cancerSelected for highlight talk, RECOMB SB 2010 Best Network Model award, Sage Congress, 2010
[Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]
Three lung cancer datasets [Bhattacharjee et al. 2001], [Stearman et al. 2005], [Su et al. 2007]Slide20
20
Combinatorial Biomarkers: Summary Higher-order combinations Important for understanding complex human diseases A novel framework
Improved computational efficiency
Enhanced statistical powerNaturally handles disease heterogeneityError-tolerance
Different types of differentiation: coexpression
General to handle different types of data
SNP
Gene expression
Metabolomic
data
Brian imaging data (e.g. fMRI)Slide21
21
ReferencesG. Fang, R. Kuang, G. Pandey, M. Steinbach, C.L. Myers, and V. Kumar. Subspace differential coexpression analysis: problem definition and a general approach. Pacific Symposium on Biocomputing, 15:145-156, 2010.G. Fang, G.
Pandey, W. Wang, M. Gupta, M. Steinbach, and V. Kumar. Mining low-support discriminative patterns from dense and high-dimensional data. IEEE TKDE, 24(2):279-294, 2012
.G. Fang, Majda Haznadar
, Wen Wang,
Haoyu
Yu, Michael Steinbach, Tim Church, William
Oetting
, Brian Van Ness, and Vipin Kumar. High-order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions.
PLoS
ONE, page in press, 2012
.
R. Gupta, N.
Rao
, and V. Kumar. Discovery of
errortolerant
biclusters
from noisy gene expression data. In BMC Bioinformatics, 12(S12):S1, 2011.
R. Gupta, Smita Agrawal,
Navneet Rao, Ze Tian
, Rui Kuang, Vipin Kumar, "Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining", In
Proc. of the International Conference on Bioinformatics and Computational Biology (BICoB), 2010Gowtham Atluri, Rohit Gupta, Gang Fang, Gaurav Pandey, Michael Steinbach and Vipin Kumar, Association Analysis Techniques for Bioinformatics Problems, Proceedings of the 1st International Conference on Bioinformatics and Computational Biology (BICoB), pp 1-13, 2009.S. Landman Vipin Kumar Michael Steinbach, Haoyu Yu. Identification of Co-occurring Insertions in Cancer Genomes Using Association Analysis. International Journal of Data Mining and Bioinformatics, in press, 2012
.M. Steinbach, H. Yu, G. Fang, and V. Kumar. Using constraints to generate and explore higher order discriminative patterns. Advances in Knowledge Discovery and Data Mining, pages 338-350
, 2011.S. Dey, Gowtham Atluri, Michael Steinbach, Angus MacDonald, Kelvin Lim, and Vipin Kumar. A pattern mining based integrative framework for biomarker discovery. Tech report, Department of Computer Science, University of Minnesota, (002), 2012.G. Pandey, C. Myers, and V. Kumar. Incorporating functional inter-relationships into protein function prediction algorithms. BMC bioinformatics, 10(1):142, 2009. G. Pandey, B. Zhang, A.N. Chang, C.L. Myers, J. Zhu, V. Kumar, and E.E. Schadt. An integrative multi-network and multi-classifier approach to predict genetic interactions. PLoS computational biology, 6(9):e1000928, 2010 (Cited as one of the major computational biology breakthroughs of 2010 by a Nature Biotechnology feature article).
J. Bellay, G. Atluri, T.L. Sing, K. Toufighi, M. Costanzo, P.S.M. Ribeiro, G.
Pandey, J. Baller, B. VanderSluis, M. Michaut, et al. Putting genetic interactions in context through a global modular decomposition. Genome Research, 21(8):1375-1387, 2011.Slide22
Acknowledgement
Kumar Lab, Data MiningGang FangWen WangVanja PaunicYi YangBenjamin OatleyXiaoye LiuSanjoy DeyGowtham Atluri
Gaurav PandeyMichael
SteinbachMyers Lab,
FuncGenomics
Jeremy
Bellay
Chad Myers
Kuang
Lab,
Compbio
TaeHyun
Hwang
Rui
Kuang
Wendt Lab, Lung Disease
Chris Wendt
Masonic Cancer Center
Tim Church
Bill
Oetting
McDonald Lab, BehaviorAngus McDonaldMayo Clinic-IBM-UMR fellowship, Walter Barnes Lang fellowship, NSF: #IIS0916439, UMII seed grant, BICB seed grant, Computations enabled by the Minnesota Supercomputing Institute. BioMedical Genomics Center at University of Minnesota, International Myeloma Foundation. Etiology and Early Marker Study program of the Prostate Lung Colorectal and Ovarian Cancer Screening Trial
Van Ness Lab, MyelomaBrian
Van NessLim Lab, Brain ImagingKelvin LimSlide23
Thanks!
23