transcriptomic data analysis Ståle Nygård Bioinformatics core facility OUSUiO staalnifiuiono Gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product ID: 919898
Download Presentation The PPT/PDF document "Part1: Large-scale gene expression (" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Part1: Large-scale gene expression (transcriptomic) data analysis
Ståle
Nygård
, Bioinformatics core facility, OUS/UiO
staaln@ifi.uio.no
Slide2Gene expressionGene expression is the process by which information from a gene is used in the synthesis of a functional gene product.
Slide3Transcriptomic data“Genome-wide” measurements of gene expression (several thousand gene transcripts)Are often used to find differentially expressed genesBetween groups of individuals (with different phenotypes, e.g. disease/healthy, long/short survival etc)Over time (e.g
as disease develop, as tissue develop)
Slide44
Development
of
transcriptomic
s
Multiple Northern blots
Macroarrays
cDNA
microarraysOligonucleotide microarraysHigh density arraysHigh througput sequencing (RNA sequencing)Next-next generation sequencing: True single molecule sequencing. E.g NanoPore technology (http://www.nanoporetech.com)
1977
1987
1995
1996
2003
2005
Future
Slide5Alternative splicing (example)
Slide6Microarrays vs RNA-Seq
Slide77
Microarray
pipeline
(
simplified
)
Amplification and
Labelling
RNA/DNA
Nucleic acid purification
Labeled RNA/DNA
Hybridisation, washing
Bioinformaticanalysis
Scan, Quantitate
Raw data
EBE`BEpBEBLEÐB@E@B@EàB@EBhEpBHE°BPEpB
‚
E
`B`EðBEBHEPB$E
BEBEB@E(EBEBPE
€
B8EàB$EàB$EPBE#°BLE`B`EàBPE°BEÐBDEB8EBBBEB$EÀBLEBE
B`E`B@E"
BTE°BE
B
€
E@B,EÀB8E%BªEÀB\E°BHE
B8E@B\E
BLE
€
B4EàB$E`BEÀB8E@B4EðB@EBE
àB$E
BDEB<EÐBTE°B,EB$EPBEB@EðB,EB<E0BHE
€
B4EBE@BEB(E
€
B,EBXE!@BXE`BDEàBdEpBHEB(E#ÀB4E`B4E
€
B4E°B4E)`BE@B4E0BDEpBdE`BHEPBE@BE@BEÀBE!PB0EpBE"°BEpB,EàBPEB`E
BHEB8EpBEpB@EB
Pre-processing
Sample
Slide8RNA sequencing
Slide9RNA sequencing
Slide10RNA sequencing
Slide11Bioinformatic analysis of RNA sequencing data – main stepsAlignment to transcriptomeAssembly (finding isoforms)Count reads
(
per isoform or gene)
Normalization
Differential expression (per isoform or gene)
Functional analysis
Slide12NormalizationGoal: remove technical artifacts, which can be due toDifferent amounts of input materialDifferent degrees of degradationDust, scratches etc on the arrays++Most normalization methods assume that the overall intensity is the same for different samples (
e.g
quantile
normlization
).
Slide1313
Quantile
normalization
Enforce equal distribution between the microarrays. Procedure
Sort the expression values for each microarray from highest to lowest
Calculate the mean value for each rank
For every array
let the highest ranked gene have the mean value of the highest ranked genes (of all arrays)Let the second highest ranked gene have the mean value of the second highest ranked genes (of all arrays)and so on for all ranks
Slide14Normalization usingTMM (Trimmed Mean of M-values)Highly expressed genes having big influence on library size
In TMM the genes with the smallest and largest ratios (i. e 40% of the genes) are not used in the normalization.
Slide15Testing for differential expression (microarrays)- Ordinary t-test:
- Variance estimates can be improved by ”borrowing strength” across genes in a technique called variance shrinkage
Many methods use this technique,
e.g
SAM.
Non-parametric methods (e.g. rank product)
NB! Ordinary
t-test works well for large sample sizes.
Slide16(RNA-seq data)
Slide17Slide18Transcriptomic data analysis - summary
Slide19Microarray vs RNA-SeqAdvantages RNA-SeqCan handle alternative splicingClaimed to be more robust to degradation
Now also cheaper
Advantages microarrays
Claimed higher accuracy for lowly expressed genes
Analysis tools are more mature
From:
Differential
analysis of gene regulation at transcript resolution with RNA-seq
(Trapnell et al, Nature Biotechnology,2013).
Slide20Correction for multiple testingIn ordinary microarray studies (looking at all genes), use false discovery rates instead of ordinary p-values
Slide2121
Hierarchical
clustering
Genes and samples
can
be
clustered
at
the same timeAgglomerative: start with one element as a cluster (bottom-up). Most commonDivisive: start with all elements in one large cluster (top-down)Dendrogram: a cluster tree
Why
cluster genes?Reduce complexity
Generate hypothesis, e.g. h
ypothesize that a group of
genes with similar expression
profiles interact or are involved
in the same process
Why
cluster
samples?
Identify
known
sub-
groups
Find
new
or more
detailed
subgroups
Quality
check
(
detect
outliers
)
Slide22Functional analysisOver-representation analysis (ORA). Finding pre-defined gene sets overrepresented by regulated genes. The gene sets can beGene Ontology categories (molecular functions, biological processes, cellular componentsPathways (signalling, metabolic)Map (pair-wise) molecular interactions onto the set of regulated genes using e.g
Protein-protein interactions
Transcription factor binding information
Slide23GO
structure
Terms are related within a hierarchy
Describes
multiple levels of detail of gene function
Terms can have more than one parent or child
Slide24Pathway analysis - example
Slide2525
Fisher
’
s
exact
test
Background population:
500 black
genes (diff.
expr
genes)
,
5000 red
genes (not diff.
expr
. genes)
Gene
group (GO term, pathway)
Gene A
Gene B
Gene C
Gene D
Gene E
P-value
Null distribution
Answer = 4.6 x 10
-4
Slide26Network construction based on microarray dataNetwork construction from genomic data is difficult. Many possible combinations of interactions.Network construction could be guided by including external information
about interactions
.
Examples
Seeded Bayesian networks (
Djebbari
et al, 2008
)Bioconductor package BionetBionet example