/
Microarray Gene Expression Analysis Microarray Gene Expression Analysis

Microarray Gene Expression Analysis - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
418 views
Uploaded On 2017-08-10

Microarray Gene Expression Analysis - PPT Presentation

Differential expression clustering networks and functional enrichment STEMREM 201 Fall 2012 Aaron Newman PhD 101712 A genomics approach to biology involves A plethora of ID: 577418

genes cluster expression gene cluster genes gene expression select analysis gsea data map functional large fdr heat file column

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Microarray Gene Expression Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Microarray Gene Expression Analysis

Differential expression, clustering, networks, and functional enrichment

STEMREM 201 Fall 2012Aaron Newman, Ph.D.10/17/12Slide2

A genomics approach to biology involves…

A plethora of techniques and tools exist…My goal is to introduce some practical, powerful, and freely available methods for gene expression analysis

Finding significant patterns in high-throughput data

Interpreting these patterns in the context of prior knowledgeGenerating new hypotheses and predictionsSlide3

Example workflow

Gene expression profiling

DEGs

Cluster analysis

GSEA

Functional enrichment

Network

analysis

Biological meaningSlide4

Materials and Methods

GSE22651

ESCs and differentiated cells

Excel

&

GenePattern

Hierarchical

&

AutoSOME

GSEA

DAVID

Toppfun

STRING

&

Cytoscape

Biological meaningSlide5

Microarray normalizationProbe/array-level normalization (reduce inter-array technical errors and noise)Raw CEL files  merged and normalized text fileRobust Multi-Chip Averaging (RMA) for

Affymetrix arraysAffymetrix Gene Expression Console softwareQuantile Normalization for Agilent, Illumina, and others.

Sets all arrays to the same distribution (mean, median, sd, etc.)Analysis-level normalizationLog2 TransformationImproves statistical properties for analysis (log-normal)

Median/mean centeringUseful for reducing impact of transcript abundance on identifying/visualizing co-expressed genes

Unit

variance

S

tandardize each column (array) to mean of 0 and standard deviation of 1

centerSlide6

Differentially expressed genes (DEGs)Goal: given known phenotypes, determine which genes exhibit significant differential expression.Many genes will be tested; multiple hypothesis

testing should be performed to control the false discovery rate (FDR).Bonferroni (α / n)

Benjamini-Hochberg (largest k s.t. P(k) ≤ (αk) / n)Storey

Q-value (p-value specific FDR)Q-value software - http://

genomics.princeton.edu

/

storeylab

/

qvalue

/

Sample permutations can improve p-value accuracy

*Only apply with ≥ 10 samples per class

Implemented in

GenePattern

(

ComparativeMarkerSelection module), SAMSlide7

Supervised DEG identification tools (i.e., classes are known)Likely familiarExcel (T-test) coupled with FDR assessment

Watch for conversion to dates! (e.g., MARCH6  3/6)

Basic to IntermediateStatistical Analysis of Microarrays (Windows only)GenePattern (Broad Institute)p.adjust (R)Advanced

Bioconductor packages in R (e.g., limma)Slide8

Tutorial 1:Identifying DEGs in Excel using FDR cutoff of 1% (BH method)Open “DEG_example_large.txt

” in Excel (2 classes, 25151 genes).In column L, use T-test function to test for significant differential expression between ESCs and non-ESCs.=TTEST(“ESCs”, “non-ESCs”, 2, 3)2-tailed, unpaired with unequal variance

Sort p-values in column L in ascending order.In column M, input p-value rank, going from 1,2,3…n.Input following formula into column N to test for FDR of 1%In general: = (0.01 * rank) / n

In our case: = (0.01 * M2) / 25151Autocomplete column NIn column O, test for p-values that do not exceed FDR of 1%= if(L2 <= N2, 1, 0)That’s it! All genes with a 1 in column O are significant at an FDR of 1%.

Should

be

3237 significant

genes.Slide9

Cleaning up result and creating a heat mapCalculate log2 fold change in column P= AVERAGE(B2:F2) – AVERAGE(G2:K2)

AutocompleteUse filter to isolate significant genes with absolute fold change ≥ 5Copy and paste into new tabSort by fold change in descending order

Save as new file (e.g., DEG_example_sorted_fdr01_fold5.txt)Center genes in Cluster 3.0Open fileGo to “Adjust Data”Check “Center genes” and select “Median”

Press “Apply”, then save file (e.g. DEG_example_sorted_fdr01_fold5_medcen.txt)Open in Java TreeView. To customize heat map display and text, use “Settings>Pixel Settings”, and “Settings>Font Settings”.

Export heat map image using “Export>Save Tree Image”.Slide10

Expected resultFDR ≤ 1%Fold change ≥ abs(25)= 79 genesSlide11

Tutorial 2:Identifying DEGs in GenePattern

Create account (if you have not already) and log in to GenePatternSelect “Differential Expression Analysis”.Skip to ComparativeMarkerSelection (step 2) and click “Open module”.Enter following fields:

input* file = GSEA_example_expression_large.gctcls* file = GSEA_example_classes.cls

number of permutations* = 0Select “Run”When processing is finished, select “ComparativeMarkerSelectionViewer” from the

pulldown

menu next to “

GSEA_example_expression_large.comp.marker.odf

Select “Run”

Select “Open Visualizer” and “Allow”

A table will appear showing all genes and various statistics, including:

Benjamini_Hochberg

corrected p-value;

FDR(BH)

Storey

q-value;

Q Value

Bonferroni p-valueFold changeSlide12

Extracting data and displaying as a heat mapPipe .odf file to “ExtractComparativeMarkerResults” module.

Select genes with BH FDR ≤ 1%.

Filtered data are available as a new .

gct file.To display a heat map, pipe .gct file to “HeatMapViewer” module.Select “Run”Slide13

Example workflow

Gene expression profiling

DEGs

Cluster analysis

GSEA

Functional enrichment

Network

analysis

Biological meaningSlide14

Widespread ImportanceGenomics

PhylogeneticsDiseaseGalaxy Clusters

etc., etc., etc.

The “Clustering Problem

for Large Data SetsSlide15

Common clustering methodsFigure 1. D’haeseleer

, Nat Biotechnol. 2005

Toy data setHierarchical (Eisen)

K-means

SOMSlide16

Feature Comparison

Method

Handle

Large Datasets

Diverse Cluster Shapes

Detect Cluster Number

Identify Outliers

Low Output Variance

Hierarch-ical

√*

K-Means

√*

Self-Organizing Map (SOM)

√Slide17

Tutorial 3: Hierarchical ClusteringOpen “GSE22651_filtered.txt” in Cluster 3.0Normalize data (Adjust Data tab)

Check “Log transform data”Check “Center genes” and select “Median”Press “Apply Button”Cluster data (Hierarchical tab)Check “Cluster” under Genes and under Arrays

Leave “Similarity Metric” at Uncentered CorrelationPress “Centroid linkage” under “Cluster method

”Centroid = distance between cluster centersSingle = closest distance between clustersComplete = farthest distance between clustersAverage = mean of all pairwise distances between clusters

Open

clustered data table

file (*.

cdt

) in Java

TreeView

. Slide18

Java TreeView

Navigate the cluster tree to highlight genes with distinct expression patterns in particular samplesExport>”Save List” to copy or save gene lists of interest for further analysis.Slide19

Expected result

Genes

SamplesSlide20

Auto

matic clustering of Self-Organizing

Map Ensembles

AutoSOME

Serial application of

-

SOM

- Density Equalization

- Minimum Spanning Tree

- Ensemble Averaging

Newman and Cooper (2010)

BMC Bioinformatics

, 11:117Slide21

AutoSOME

Method

Handle

Large Datasets

Diverse Cluster Shapes

Detect Cluster Number

Identify Outliers

Low Output Variance

Hierarch-ical

√*

K-Means

√*

Self-Organizing Map (SOM)

Affinity Propagation

√*

Spectral Clustering

√*

nNMF

AutoSOME

√Slide22

AutoSOME Webstarthttp://

jimcooperlab.mcdb.ucsb.edu/autosomeSlide23

Tutorial 4:Identifying discrete clusters without prior knowledge of cluster numberLaunch the AutoSOME

GUI via the large launch button.Open “GSE22651_filtered.txt”Skip filteringShow “Basic Fields”Set p-value = 0.05

Show “Input Adjustment”Check Log2 Scaling, Unit Variance, Median Center, and Sum Squares = 1Press the large “Run” button on the left.From the

menubar, select View>heat map> green red.Select cluster 1 in the cluster list.The data are rendered as a normalized heat map. To change the display, go to View>settings>image settings.Under the Normalization tab, check “Display Original Data”, “Log2 Scaling”, and “Median Center”.

Check “Manually adjust range for contrast” and set minimum to -2 and maximum to 2. Press “Update” (lower left corner).Slide24

AutoSOME continued…Select clusters 1 to 5 (hold down shift). Right click mouse in heat map window to resize.Set “Zoom Factor” to 40 and Press “Save”.

See website for further tutorials and documentation.

Representative heat map

View>heat map>rainbowSlide25

Example workflow

Gene expression profiling

DEGs

Cluster analysis

GSEA

Functional enrichment

Network

analysis

Biological meaningSlide26

How to interpret clustering resultsGene set functional annotationDAVIDToppfunMSigDB

Network analysisSTRING

http://string-

db.org/

http://

www.broadinstitute.org

/

gsea

/

msigdb

/

http://

toppgene.cchmc.org

/

enrichment.jsp

http://

david.abcc.ncifcrf.gov

/Slide27

Gene sets: where do they come from?Gene ontologyAn attempt to semantically organize genes and their functional relationships.Data are arranged in a graph structure, from broad to specificOntologies:Biological process (BP)

: series of ordered eventsMolecular function (MF): activities that occur at the molecular levelCellular component (CC): part of a cell

Biocarta/KEGG pathways (curated wiring diagrams)High-throughput studiese.g.,

MSigDBSlide28

Tutorial 5: Functional annotation of a large cluster using DAVIDIn AutoSOME, find the cluster of genes with higher expression in stem cells.

Go to View>raw data. Highlight all genes in the cluster and copy the list.

Open the DAVID homepage and press “Start Analysis”Select “Upload” tab and paste in gene list.Under “Select Identifier,” select “OFFICIAL_GENE_SYMBOL”.Select “Gene List” for “List Type”.

Submit the list.Press OK at multi-species warning message.Select “Homo sapiens” and press “Select Species”.Select “Functional Annotation Tool”Press “Functional Annotation Clustering” button.

http://

david.abcc.ncifcrf.gov

/Slide29

DAVID Output

Pathways

Gene sets

Similar gene sets are clustered together, eliminating redundancy and facilitating interpretationSlide30

Tutorial 6: Protein-protein associations with STRING

Top 50 non-ESC genesSlide31

Protein-protein association network among top 50 non-ESC genes

Evidence typesSlide32

Top 100 non-ESC genesSlide33

Example workflow

Gene expression profiling

DEGs

Cluster analysis

GSEA

Functional enrichment

Network

analysis

Biological meaningSlide34

Gene set enrichment in a ranked listGene Set Enrichment Analysis (GSEA)“Threshold-less”Arbitrary DEG cutoffs are avoided

Two modes of operation to rank input genes:Rank by differential expression between phenotypesDefault metric is signal to noise, defined as: (

avg[class 1] – avg[class 2]) / (sd[class 1] +

sd[class 2])Pre-ranked according to user-defined criteriaEvaluates statistical bias in the distribution of each defined gene set over the list of ranked input genes.Slide35

Input formatInput: Expression data (gene cluster text file format, *.gct)Classes (*.cls)

File extensions matter!If using Notepad in Windows, set Save as type: to “All Files (*.*)If using TextEdit in Mac, go to Preferences > Open and Save >

and uncheck ‘Add “.txt” extension to plain text files’.Formatting instructions: http://www.broadinstitute.org

/cancer/software/gsea/wiki/index.php/Data_formatsSlide36

Tutorial 7: GSEA

Load data

Run GSEA

Input files: GSEA_example_expression_large.gct,

GSEA_example_classes.cls

Slide37

OutputSlide38

Summary

Gene expression profiling

DEGs

Cluster analysis

GSEA

Functional enrichment

Network

analysis

Biological meaning