Differential expression clustering networks and functional enrichment STEMREM 201 Fall 2012 Aaron Newman PhD 101712 A genomics approach to biology involves A plethora of ID: 577418
Download Presentation The PPT/PDF document "Microarray Gene Expression Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Microarray Gene Expression Analysis
Differential expression, clustering, networks, and functional enrichment
STEMREM 201 Fall 2012Aaron Newman, Ph.D.10/17/12Slide2
A genomics approach to biology involves…
A plethora of techniques and tools exist…My goal is to introduce some practical, powerful, and freely available methods for gene expression analysis
Finding significant patterns in high-throughput data
Interpreting these patterns in the context of prior knowledgeGenerating new hypotheses and predictionsSlide3
Example workflow
Gene expression profiling
DEGs
Cluster analysis
GSEA
Functional enrichment
Network
analysis
Biological meaningSlide4
Materials and Methods
GSE22651
ESCs and differentiated cells
Excel
&
GenePattern
Hierarchical
&
AutoSOME
GSEA
DAVID
Toppfun
STRING
&
Cytoscape
Biological meaningSlide5
Microarray normalizationProbe/array-level normalization (reduce inter-array technical errors and noise)Raw CEL files merged and normalized text fileRobust Multi-Chip Averaging (RMA) for
Affymetrix arraysAffymetrix Gene Expression Console softwareQuantile Normalization for Agilent, Illumina, and others.
Sets all arrays to the same distribution (mean, median, sd, etc.)Analysis-level normalizationLog2 TransformationImproves statistical properties for analysis (log-normal)
Median/mean centeringUseful for reducing impact of transcript abundance on identifying/visualizing co-expressed genes
Unit
variance
S
tandardize each column (array) to mean of 0 and standard deviation of 1
centerSlide6
Differentially expressed genes (DEGs)Goal: given known phenotypes, determine which genes exhibit significant differential expression.Many genes will be tested; multiple hypothesis
testing should be performed to control the false discovery rate (FDR).Bonferroni (α / n)
Benjamini-Hochberg (largest k s.t. P(k) ≤ (αk) / n)Storey
Q-value (p-value specific FDR)Q-value software - http://
genomics.princeton.edu
/
storeylab
/
qvalue
/
Sample permutations can improve p-value accuracy
*Only apply with ≥ 10 samples per class
Implemented in
GenePattern
(
ComparativeMarkerSelection module), SAMSlide7
Supervised DEG identification tools (i.e., classes are known)Likely familiarExcel (T-test) coupled with FDR assessment
Watch for conversion to dates! (e.g., MARCH6 3/6)
Basic to IntermediateStatistical Analysis of Microarrays (Windows only)GenePattern (Broad Institute)p.adjust (R)Advanced
Bioconductor packages in R (e.g., limma)Slide8
Tutorial 1:Identifying DEGs in Excel using FDR cutoff of 1% (BH method)Open “DEG_example_large.txt
” in Excel (2 classes, 25151 genes).In column L, use T-test function to test for significant differential expression between ESCs and non-ESCs.=TTEST(“ESCs”, “non-ESCs”, 2, 3)2-tailed, unpaired with unequal variance
Sort p-values in column L in ascending order.In column M, input p-value rank, going from 1,2,3…n.Input following formula into column N to test for FDR of 1%In general: = (0.01 * rank) / n
In our case: = (0.01 * M2) / 25151Autocomplete column NIn column O, test for p-values that do not exceed FDR of 1%= if(L2 <= N2, 1, 0)That’s it! All genes with a 1 in column O are significant at an FDR of 1%.
Should
be
3237 significant
genes.Slide9
Cleaning up result and creating a heat mapCalculate log2 fold change in column P= AVERAGE(B2:F2) – AVERAGE(G2:K2)
AutocompleteUse filter to isolate significant genes with absolute fold change ≥ 5Copy and paste into new tabSort by fold change in descending order
Save as new file (e.g., DEG_example_sorted_fdr01_fold5.txt)Center genes in Cluster 3.0Open fileGo to “Adjust Data”Check “Center genes” and select “Median”
Press “Apply”, then save file (e.g. DEG_example_sorted_fdr01_fold5_medcen.txt)Open in Java TreeView. To customize heat map display and text, use “Settings>Pixel Settings”, and “Settings>Font Settings”.
Export heat map image using “Export>Save Tree Image”.Slide10
Expected resultFDR ≤ 1%Fold change ≥ abs(25)= 79 genesSlide11
Tutorial 2:Identifying DEGs in GenePattern
Create account (if you have not already) and log in to GenePatternSelect “Differential Expression Analysis”.Skip to ComparativeMarkerSelection (step 2) and click “Open module”.Enter following fields:
input* file = GSEA_example_expression_large.gctcls* file = GSEA_example_classes.cls
number of permutations* = 0Select “Run”When processing is finished, select “ComparativeMarkerSelectionViewer” from the
pulldown
menu next to “
GSEA_example_expression_large.comp.marker.odf
”
Select “Run”
Select “Open Visualizer” and “Allow”
A table will appear showing all genes and various statistics, including:
Benjamini_Hochberg
corrected p-value;
FDR(BH)
Storey
q-value;
Q Value
Bonferroni p-valueFold changeSlide12
Extracting data and displaying as a heat mapPipe .odf file to “ExtractComparativeMarkerResults” module.
Select genes with BH FDR ≤ 1%.
Filtered data are available as a new .
gct file.To display a heat map, pipe .gct file to “HeatMapViewer” module.Select “Run”Slide13
Example workflow
Gene expression profiling
DEGs
Cluster analysis
GSEA
Functional enrichment
Network
analysis
Biological meaningSlide14
Widespread ImportanceGenomics
PhylogeneticsDiseaseGalaxy Clusters
etc., etc., etc.
The “Clustering Problem
”
for Large Data SetsSlide15
Common clustering methodsFigure 1. D’haeseleer
, Nat Biotechnol. 2005
Toy data setHierarchical (Eisen)
K-means
SOMSlide16
Feature Comparison
Method
Handle
Large Datasets
Diverse Cluster Shapes
Detect Cluster Number
Identify Outliers
Low Output Variance
Hierarch-ical
√
√*
√
K-Means
√
√*
Self-Organizing Map (SOM)
√Slide17
Tutorial 3: Hierarchical ClusteringOpen “GSE22651_filtered.txt” in Cluster 3.0Normalize data (Adjust Data tab)
Check “Log transform data”Check “Center genes” and select “Median”Press “Apply Button”Cluster data (Hierarchical tab)Check “Cluster” under Genes and under Arrays
Leave “Similarity Metric” at Uncentered CorrelationPress “Centroid linkage” under “Cluster method
”Centroid = distance between cluster centersSingle = closest distance between clustersComplete = farthest distance between clustersAverage = mean of all pairwise distances between clusters
Open
clustered data table
file (*.
cdt
) in Java
TreeView
. Slide18
Java TreeView
Navigate the cluster tree to highlight genes with distinct expression patterns in particular samplesExport>”Save List” to copy or save gene lists of interest for further analysis.Slide19
Expected result
Genes
SamplesSlide20
Auto
matic clustering of Self-Organizing
Map Ensembles
AutoSOME
Serial application of
-
SOM
- Density Equalization
- Minimum Spanning Tree
- Ensemble Averaging
Newman and Cooper (2010)
BMC Bioinformatics
, 11:117Slide21
AutoSOME
Method
Handle
Large Datasets
Diverse Cluster Shapes
Detect Cluster Number
Identify Outliers
Low Output Variance
Hierarch-ical
√
√*
√
K-Means
√
√*
Self-Organizing Map (SOM)
√
Affinity Propagation
√*
√
√
√
Spectral Clustering
√*
√
√
nNMF
√
√
√
AutoSOME
√
√
√
√
√Slide22
AutoSOME Webstarthttp://
jimcooperlab.mcdb.ucsb.edu/autosomeSlide23
Tutorial 4:Identifying discrete clusters without prior knowledge of cluster numberLaunch the AutoSOME
GUI via the large launch button.Open “GSE22651_filtered.txt”Skip filteringShow “Basic Fields”Set p-value = 0.05
Show “Input Adjustment”Check Log2 Scaling, Unit Variance, Median Center, and Sum Squares = 1Press the large “Run” button on the left.From the
menubar, select View>heat map> green red.Select cluster 1 in the cluster list.The data are rendered as a normalized heat map. To change the display, go to View>settings>image settings.Under the Normalization tab, check “Display Original Data”, “Log2 Scaling”, and “Median Center”.
Check “Manually adjust range for contrast” and set minimum to -2 and maximum to 2. Press “Update” (lower left corner).Slide24
AutoSOME continued…Select clusters 1 to 5 (hold down shift). Right click mouse in heat map window to resize.Set “Zoom Factor” to 40 and Press “Save”.
See website for further tutorials and documentation.
Representative heat map
View>heat map>rainbowSlide25
Example workflow
Gene expression profiling
DEGs
Cluster analysis
GSEA
Functional enrichment
Network
analysis
Biological meaningSlide26
How to interpret clustering resultsGene set functional annotationDAVIDToppfunMSigDB
Network analysisSTRING
http://string-
db.org/
http://
www.broadinstitute.org
/
gsea
/
msigdb
/
http://
toppgene.cchmc.org
/
enrichment.jsp
http://
david.abcc.ncifcrf.gov
/Slide27
Gene sets: where do they come from?Gene ontologyAn attempt to semantically organize genes and their functional relationships.Data are arranged in a graph structure, from broad to specificOntologies:Biological process (BP)
: series of ordered eventsMolecular function (MF): activities that occur at the molecular levelCellular component (CC): part of a cell
Biocarta/KEGG pathways (curated wiring diagrams)High-throughput studiese.g.,
MSigDBSlide28
Tutorial 5: Functional annotation of a large cluster using DAVIDIn AutoSOME, find the cluster of genes with higher expression in stem cells.
Go to View>raw data. Highlight all genes in the cluster and copy the list.
Open the DAVID homepage and press “Start Analysis”Select “Upload” tab and paste in gene list.Under “Select Identifier,” select “OFFICIAL_GENE_SYMBOL”.Select “Gene List” for “List Type”.
Submit the list.Press OK at multi-species warning message.Select “Homo sapiens” and press “Select Species”.Select “Functional Annotation Tool”Press “Functional Annotation Clustering” button.
http://
david.abcc.ncifcrf.gov
/Slide29
DAVID Output
Pathways
Gene sets
Similar gene sets are clustered together, eliminating redundancy and facilitating interpretationSlide30
Tutorial 6: Protein-protein associations with STRING
Top 50 non-ESC genesSlide31
Protein-protein association network among top 50 non-ESC genes
Evidence typesSlide32
Top 100 non-ESC genesSlide33
Example workflow
Gene expression profiling
DEGs
Cluster analysis
GSEA
Functional enrichment
Network
analysis
Biological meaningSlide34
Gene set enrichment in a ranked listGene Set Enrichment Analysis (GSEA)“Threshold-less”Arbitrary DEG cutoffs are avoided
Two modes of operation to rank input genes:Rank by differential expression between phenotypesDefault metric is signal to noise, defined as: (
avg[class 1] – avg[class 2]) / (sd[class 1] +
sd[class 2])Pre-ranked according to user-defined criteriaEvaluates statistical bias in the distribution of each defined gene set over the list of ranked input genes.Slide35
Input formatInput: Expression data (gene cluster text file format, *.gct)Classes (*.cls)
File extensions matter!If using Notepad in Windows, set Save as type: to “All Files (*.*)If using TextEdit in Mac, go to Preferences > Open and Save >
and uncheck ‘Add “.txt” extension to plain text files’.Formatting instructions: http://www.broadinstitute.org
/cancer/software/gsea/wiki/index.php/Data_formatsSlide36
Tutorial 7: GSEA
Load data
Run GSEA
Input files: GSEA_example_expression_large.gct,
GSEA_example_classes.cls
Slide37
OutputSlide38
Summary
Gene expression profiling
DEGs
Cluster analysis
GSEA
Functional enrichment
Network
analysis
Biological meaning