Joel Parker PhD LCCC Biomedical Informatics UNCseq Cancer genome analysis of 1000 UNC Hospital patients TCGA Processed and distributed 8K cancer transcriptomes gt1PB Cancer Survivorship ID: 687113
Download Presentation The PPT/PDF document "Introduction to RNA- seq" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to RNA-seq
Joel Parker, Ph.D.Slide2
LCCC Biomedical Informatics
UNCseq
:
Cancer genome analysis of 1000+ UNC Hospital patients
TCGA: Processed and distributed 8K+ cancer
transcriptomes
(>1PB)
Cancer Survivorship
Cohort:
Recruit, track, and follow-up 4k+ patients
Genome and transcriptome analytics in multiple clinical trials
Management and analytics supporting 10+ clinical programs and 20+ faculty labs
Contributing authors to 150+ manuscriptsSlide3
Introduction to RNA-seq
Advantages and challenges of RNA-
seq
Experimental Design
Raw data
Mapping
Quantification
Differential ExpressionSlide4
Why mRNAseq?
There are at least four compelling reasons for choosing mRNA-
seq
instead of microarray based technologies
Specificity of what is being measured
Reduced technical (batch) bias
Increased dynamic range and log ratio (FC) estimates
More sensitive detection of genes, transcripts, and differential expression
Other reasons
Detection of expressed SNVs
Detection of fusions and other structural variations
No
transcriptome
definition is needed
No probes need to be designed or manufactured
Cost (will soon be equivalent on a per assay basis with microarray)Slide5Slide6
Why mRNAseq? – Reduced Bias
Cell types separate biologically
CD19
CD8
CD14
CD4Slide7
Why mRNAseq? – Reduced Processing Bias
Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components
GAIIx
HS-01
HS-02
HS-ILSlide8Slide9Slide10
Others: Stranded,
Exome
or target enrichment
Blood, MT,
etc
Library preparationSlide11Slide12Slide13
Sequencing parameters
Trapnell
et al., Nature Biotechnology
31,46–53
(2013)
Read LengthSlide14
Detection is Dependent on Depth
Detection
in this case is
defined as at least 10 fragments per million (FPM) assigned to the gene or
isoform
in at least 20% of samples.
As the number of clusters increases, so does the number of genes (left) or
isoforms
(right
), but not greatly over 10M. However, 50M is 5x10M but only yields about a 15% increase in detection level.
Genes
IsoformsSlide15
Liu et al., Bioinformatics (2014) 30 (3): 301-304. Slide16
Computational Processing
Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth
The raw results of sequencing require significant computational processing
Alignment
: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,
TopHat
Abundance estimation
: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,
IsoEM
,
IsoInfer
,
Rseq
, . . .
Normalization of read counts
: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,
quantile
, density
Different algorithmic and computational strategies,
especially the transcriptome definition
, impact performance much more than SE vs. PE, 50
bp
vs. 100
bp
.Slide17Slide18
Alignment
TopHat
,
MapSplice
, STAR
Trinity,
Trans-AbyssSlide19
Read
Exon
Intron
(1) Read
r
may be incorrectly mapped to the intron between exons
e1
and
e2
.
(2) Here, the read shown in red, which spans a splice junction, can be aligned end-to-end to a processed pseudogene.
e1
e2
r
GTXX
GTXX
GTXX
AG GTXX
Incorrect mapping (non-gapped alignment)
r
GTXX
C
orrect mapping (spliced alignment)
e1
e2
e3
e1
e2
e3
Read
Gene
Processed pseudogene
m
appedSlide20Slide21
1)
transcriptome
mapping, which is used only when annotation is provided
2
) genome mapping
3
)
Split read alignment of step 2 unmapped
-novel splice sites are differentiated from
indels
and fusions using known
junction signals (GT-AG, GC-AG, and AT-AC
) supported by ‘islands’ and spliced alignments
4) Remapping of unaligned and previously poor mappings
5) Statistical assessment to assign most likely alignment of multi-mappersSlide22Slide23Slide24
Example
Concordant
Gene
V2
V1
http://www.broadinstitute.org/igv/Slide25
Example
Discordant 1
Gene
V2
V1Slide26
Example
Discordant 2
Gene
V2
V1Slide27Slide28Slide29Slide30
Alignment Comparison
Engstrom
et al., Nature Methods 10, 1185-1191 (2013)Slide31
Alignment Comparison
Engstrom
et al., Nature Methods 10, 1185-1191 (2013)
Splice Junction AccuracySlide32
Many RNAseq Aligners
Shown is the percentage of sequenced or simulated read pairs (fragments) mapped by each protocol. Protocols are grouped by the underlying alignment program (gray shading). Protocol names contain the suffix “ann” if annotation was used. The suffix “cons” distinguishes more conservative protocols from others based on the same aligner. The K562 data set comprises six samples, and the metrics presented here were averaged over them.
Systematic evaluation of spliced alignment programs for RNA-seq data.
Engström et al,
Nature Methods
2013Slide33
Computational Processing
Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth
The raw results of sequencing require significant computational processing
Alignment
: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,
TopHat
Abundance estimation
: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,
IsoEM
,
IsoInfer
,
Rseq
, . . .
Normalization of read counts
: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,
quantile
, density
Different algorithmic and computational strategies,
especially the transcriptome definition
, impact performance much more than SE vs. PE, 50
bp
vs. 100
bp
.Slide34
Long
Short
200
Medium
100
50
1
2
3
Relative abundance for these genes, f
1
, f
2
, f
3
350
300
200
150
Unique
Multireads
Multireads: Reads Mapping to Multiple Genes/Transcripts
NSlide35
Approach 1
: Ignore Multireads
Long
Short
200
Medium
100
50
1
2
3
Relative abundance for these genes, f
1
, f
2
, f
3
350
300
200
150
Nagalakshmi et. al. Science. 2008
Marioni
, et. al. Genome Research 2008
NSlide36
Approach 1:
Ignore Multireads
Long
Short
200
Medium
100
50
1
2
3
350
300
200
150
Over-estimates the abundance of genes with unique reads
Under-estimates the abundance of genes with multireads
Not an option at all, if interested in isoform expression
NSlide37
Approach 2:
Allocate Fraction of Multireads Using Estimates From Uniques
Long
Short
200
Medium
100
50
1
2
3
Relative abundance for these genes, f
1
, f
2
, f
3
350
300
200
150
Ali Mortazavi, et. al. Nature Methods 2008
Sailfish,
RSEM,Cufflinks
NSlide38
Multireads: Reads Mapping to Multiple Genes/Transcripts
Long
Medium
N
Wang X, Wu Z, Zhang X.
Isoform
abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J
Bioinform
Comput
Biol. 2010 Dec;8
Suppl
1:177-92.
PubMed
PMID: 21155027.Slide39
Cufflinks
PMID: 20436464Slide40
RSEM
Li and Dewey, 2011
PMID: 21816040
A) PE
isoform
; B) PE gene; C) SE
isoform
; D) SE gene
θ
i
represents the probability that a fragment is derived from transcript
i
Slide41
Current Methods
RSEM – demonstrated high accuracy as part of MAQC experiment, but run time scales exponentially with read count and transcript definitions
eXpress
– EM similar to RSEM, but processes reads in a streaming fashion
Sailfish – EM similar to RSEM, but
replace approximate alignment of reads with
exact alignment
of
k-
mers
Kallisto
– similar to sailfish, but further reduced alignment complexity (very fast!)
Initial evidence was that
eXpress
/ sailfish /
Kallisto
demonstrated a drop in accuracy relative to RSEM.
Then . . .
Salmon – uses a combination of alignment informed k-
mer
matching and a combination of the
Kallisto
variational
Bayes approach with a ‘local’ EM approachSlide42
Salmon Novelties
Streaming
variational
Bayes
(VB) inference combined with batched VB or EM
Lightweight alignment through maximal exact matches
Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this;
sam-xlate
in https://github.com/mozack/ubu/wiki]Slide43Slide44Slide45Slide46
Computational Processing
Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth
The raw results of sequencing require significant computational processing
Alignment
: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,
TopHat
Abundance estimation
: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,
IsoEM
,
IsoInfer
,
Rseq
, . . .
Normalization of read counts
: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,
quantile
, density
Different algorithmic and computational strategies,
especially the transcriptome definition
, impact performance much more than SE vs. PE, 50
bp
vs. 100
bp
.Slide47
NormalizationSlide48
FPM – fragments per million
Statistical tools like
DESeq
and
SAMseq
do not utilize these transformations
Quartile normalization (preferred) uses arbitrary units
Typical measures of expressionSlide49
Repeatability & Detection by Isoform Database
Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right)
Detection - 73% of RefSeq, 66% of UCSC, and 52% of EnsemblSlide50
Bullard JH,
Purdom
E, Hansen KD,
Dudoit
S. Evaluation of statistical methods for normalization and differential expression in mRNA-
Seq
experiments. BMC Bioinformatics. 2010 Feb 18;11:94.
doi
: 10.1186/1471-2105-11-94.
PubMed
PMID: 20167110;
PubMed
Central PMCID: PMC2838869
Differential ExpressionSlide51Slide52
Summary
Consideration of what is needed versus wanted
Relative expression of genes or isoforms
Snps
/
indels
Transcript discovery
Fusion detection
Determines optimal
library prep
seq
length
seq
quantity
SE or PE
number of replicates
processing protocolSlide53
Why mRNAseq?
There are at least four compelling reasons for choosing mRNA-
seq
instead of microarray based technologies
Specificity of what is being measured
Reduced technical (batch) bias
Increased dynamic range and log ratio (FC) estimates
More sensitive detection of genes, transcripts, and differential expression
Other reasons
Detection of expressed SNVs
Detection of fusions and other structural variations
No
transcriptome
definition is needed
No probes need to be designed or manufactured
Cost (will soon be equivalent on a per assay basis with microarray)Slide54
Fusion Genes
High expression of ALK kinase domain
~Zero expression of upstream ALK
Gene orientationSlide55
Fusions
Gene 1
Gene 2
Gene1-Gene2 Fusion
Spanning Reads
Bridging ReadsSlide56
Spanning Alignment View
EML4 exon 13 ALK exon 20
Fusion spanning readsSlide57
Fusion Genes
TCGA, LUAD, submittedSlide58
Virus Detection
HNSCC
HPV16
#1
#2
#3
#4
E7
E2
L1
E6
E1
E4
L2
E5Slide59
Detecting tumor mutations by integration of
DNA
and RNA sequencing via
UNCeqR
Matthew Wilkerson, UNC Chapel Hill
Weighted combination of DNA-Whole
exome
sequencing and RNA sequencing
RNAseq
can help increase somatic mutation detection particularly in low purity tumors
http://lbg.med.unc.edu/~mwilkers/unceqrSlide60
UNCeqR
PIK3CA
H1047R
TCGA-AR-A252
Luminal A subtype
http://lbg.med.unc.edu/~mwilkers/unceqrSlide61
RNA often provides a greater mutation signal than DNASlide62
RNA integration boosts performance in tumors have low puritySlide63
Novel mutations relative to published profiles