/
Introduction to RNA- seq Introduction to RNA- seq

Introduction to RNA- seq - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
368 views
Uploaded On 2018-10-09

Introduction to RNA- seq - PPT Presentation

Joel Parker PhD LCCC Biomedical Informatics UNCseq Cancer genome analysis of 1000 UNC Hospital patients TCGA Processed and distributed 8K cancer transcriptomes gt1PB Cancer Survivorship ID: 687113

detection alignment genes gene alignment detection gene genes sequencing expression read transcriptome reads rsem abundance seq computational rna 200

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to RNA- seq" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to RNA-seq

Joel Parker, Ph.D.Slide2

LCCC Biomedical Informatics

UNCseq

:

Cancer genome analysis of 1000+ UNC Hospital patients

TCGA: Processed and distributed 8K+ cancer

transcriptomes

(>1PB)

Cancer Survivorship

Cohort:

Recruit, track, and follow-up 4k+ patients

Genome and transcriptome analytics in multiple clinical trials

Management and analytics supporting 10+ clinical programs and 20+ faculty labs

Contributing authors to 150+ manuscriptsSlide3

Introduction to RNA-seq

Advantages and challenges of RNA-

seq

Experimental Design

Raw data

Mapping

Quantification

Differential ExpressionSlide4

Why mRNAseq?

There are at least four compelling reasons for choosing mRNA-

seq

instead of microarray based technologies

Specificity of what is being measured

Reduced technical (batch) bias

Increased dynamic range and log ratio (FC) estimates

More sensitive detection of genes, transcripts, and differential expression

Other reasons

Detection of expressed SNVs

Detection of fusions and other structural variations

No

transcriptome

definition is needed

No probes need to be designed or manufactured

Cost (will soon be equivalent on a per assay basis with microarray)Slide5
Slide6

Why mRNAseq? – Reduced Bias

Cell types separate biologically

CD19

CD8

CD14

CD4Slide7

Why mRNAseq? – Reduced Processing Bias

Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components

GAIIx

HS-01

HS-02

HS-ILSlide8
Slide9
Slide10

Others: Stranded,

Exome

or target enrichment

Blood, MT,

etc

Library preparationSlide11
Slide12
Slide13

Sequencing parameters

Trapnell

et al., Nature Biotechnology

31,46–53

(2013)

Read LengthSlide14

Detection is Dependent on Depth

Detection

in this case is

defined as at least 10 fragments per million (FPM) assigned to the gene or

isoform

in at least 20% of samples.

As the number of clusters increases, so does the number of genes (left) or

isoforms

(right

), but not greatly over 10M. However, 50M is 5x10M but only yields about a 15% increase in detection level.

Genes

IsoformsSlide15

Liu et al., Bioinformatics (2014) 30 (3): 301-304. Slide16

Computational Processing

Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth

The raw results of sequencing require significant computational processing

Alignment

: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,

TopHat

Abundance estimation

: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,

IsoEM

,

IsoInfer

,

Rseq

, . . .

Normalization of read counts

: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,

quantile

, density

Different algorithmic and computational strategies,

especially the transcriptome definition

, impact performance much more than SE vs. PE, 50

bp

vs. 100

bp

.Slide17
Slide18

Alignment

TopHat

,

MapSplice

, STAR

Trinity,

Trans-AbyssSlide19

Read

Exon

Intron

(1) Read

r

may be incorrectly mapped to the intron between exons

e1

and

e2

.

(2) Here, the read shown in red, which spans a splice junction, can be aligned end-to-end to a processed pseudogene.

e1

e2

r

GTXX

GTXX

GTXX

AG GTXX

Incorrect mapping (non-gapped alignment)

r

GTXX

C

orrect mapping (spliced alignment)

e1

e2

e3

e1

e2

e3

Read

Gene

Processed pseudogene

m

appedSlide20
Slide21

1)

transcriptome

mapping, which is used only when annotation is provided

2

) genome mapping

3

)

Split read alignment of step 2 unmapped

-novel splice sites are differentiated from

indels

and fusions using known

junction signals (GT-AG, GC-AG, and AT-AC

) supported by ‘islands’ and spliced alignments

4) Remapping of unaligned and previously poor mappings

5) Statistical assessment to assign most likely alignment of multi-mappersSlide22
Slide23
Slide24

Example

Concordant

Gene

V2

V1

http://www.broadinstitute.org/igv/Slide25

Example

Discordant 1

Gene

V2

V1Slide26

Example

Discordant 2

Gene

V2

V1Slide27
Slide28
Slide29
Slide30

Alignment Comparison

Engstrom

et al., Nature Methods 10, 1185-1191 (2013)Slide31

Alignment Comparison

Engstrom

et al., Nature Methods 10, 1185-1191 (2013)

Splice Junction AccuracySlide32

Many RNAseq Aligners

Shown is the percentage of sequenced or simulated read pairs (fragments) mapped by each protocol. Protocols are grouped by the underlying alignment program (gray shading). Protocol names contain the suffix “ann” if annotation was used. The suffix “cons” distinguishes more conservative protocols from others based on the same aligner. The K562 data set comprises six samples, and the metrics presented here were averaged over them.

Systematic evaluation of spliced alignment programs for RNA-seq data.

Engström et al,

Nature Methods

2013Slide33

Computational Processing

Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth

The raw results of sequencing require significant computational processing

Alignment

: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,

TopHat

Abundance estimation

: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,

IsoEM

,

IsoInfer

,

Rseq

, . . .

Normalization of read counts

: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,

quantile

, density

Different algorithmic and computational strategies,

especially the transcriptome definition

, impact performance much more than SE vs. PE, 50

bp

vs. 100

bp

.Slide34

Long

Short

200

Medium

100

50

1

2

3

Relative abundance for these genes, f

1

, f

2

, f

3

350

300

200

150

Unique

Multireads

Multireads: Reads Mapping to Multiple Genes/Transcripts

NSlide35

Approach 1

: Ignore Multireads

Long

Short

200

Medium

100

50

1

2

3

Relative abundance for these genes, f

1

, f

2

, f

3

350

300

200

150

Nagalakshmi et. al. Science. 2008

Marioni

, et. al. Genome Research 2008

NSlide36

Approach 1:

Ignore Multireads

Long

Short

200

Medium

100

50

1

2

3

350

300

200

150

Over-estimates the abundance of genes with unique reads

Under-estimates the abundance of genes with multireads

Not an option at all, if interested in isoform expression

NSlide37

Approach 2:

Allocate Fraction of Multireads Using Estimates From Uniques

Long

Short

200

Medium

100

50

1

2

3

Relative abundance for these genes, f

1

, f

2

, f

3

350

300

200

150

Ali Mortazavi, et. al. Nature Methods 2008

Sailfish,

RSEM,Cufflinks

NSlide38

Multireads: Reads Mapping to Multiple Genes/Transcripts

Long

Medium

N

Wang X, Wu Z, Zhang X.

Isoform

abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J

Bioinform

Comput

Biol. 2010 Dec;8

Suppl

1:177-92.

PubMed

PMID: 21155027.Slide39

Cufflinks

PMID: 20436464Slide40

RSEM

Li and Dewey, 2011

PMID: 21816040

A) PE

isoform

; B) PE gene; C) SE

isoform

; D) SE gene

θ

i

 

represents the probability that a fragment is derived from transcript 

i

 Slide41

Current Methods

RSEM – demonstrated high accuracy as part of MAQC experiment, but run time scales exponentially with read count and transcript definitions

eXpress

– EM similar to RSEM, but processes reads in a streaming fashion

Sailfish – EM similar to RSEM, but

replace approximate alignment of reads with 

exact alignment

of 

k-

mers

Kallisto

– similar to sailfish, but further reduced alignment complexity (very fast!)

Initial evidence was that

eXpress

/ sailfish /

Kallisto

demonstrated a drop in accuracy relative to RSEM.

Then . . .

Salmon – uses a combination of alignment informed k-

mer

matching and a combination of the

Kallisto

variational

Bayes approach with a ‘local’ EM approachSlide42

Salmon Novelties

Streaming

variational

Bayes

(VB) inference combined with batched VB or EM

Lightweight alignment through maximal exact matches

Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this;

sam-xlate

in https://github.com/mozack/ubu/wiki]Slide43
Slide44
Slide45
Slide46

Computational Processing

Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth

The raw results of sequencing require significant computational processing

Alignment

: Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA,

TopHat

Abundance estimation

: Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO,

IsoEM

,

IsoInfer

,

Rseq

, . . .

Normalization of read counts

: Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile,

quantile

, density

Different algorithmic and computational strategies,

especially the transcriptome definition

, impact performance much more than SE vs. PE, 50

bp

vs. 100

bp

.Slide47

NormalizationSlide48

FPM – fragments per million

Statistical tools like

DESeq

and

SAMseq

do not utilize these transformations

Quartile normalization (preferred) uses arbitrary units

Typical measures of expressionSlide49

Repeatability & Detection by Isoform Database

Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right)

Detection - 73% of RefSeq, 66% of UCSC, and 52% of EnsemblSlide50

Bullard JH,

Purdom

E, Hansen KD,

Dudoit

S. Evaluation of statistical methods for normalization and differential expression in mRNA-

Seq

experiments. BMC Bioinformatics. 2010 Feb 18;11:94.

doi

: 10.1186/1471-2105-11-94.

PubMed

PMID: 20167110;

PubMed

Central PMCID: PMC2838869

Differential ExpressionSlide51
Slide52

Summary

Consideration of what is needed versus wanted

Relative expression of genes or isoforms

Snps

/

indels

Transcript discovery

Fusion detection

Determines optimal

library prep

seq

length

seq

quantity

SE or PE

number of replicates

processing protocolSlide53

Why mRNAseq?

There are at least four compelling reasons for choosing mRNA-

seq

instead of microarray based technologies

Specificity of what is being measured

Reduced technical (batch) bias

Increased dynamic range and log ratio (FC) estimates

More sensitive detection of genes, transcripts, and differential expression

Other reasons

Detection of expressed SNVs

Detection of fusions and other structural variations

No

transcriptome

definition is needed

No probes need to be designed or manufactured

Cost (will soon be equivalent on a per assay basis with microarray)Slide54

Fusion Genes

High expression of ALK kinase domain

~Zero expression of upstream ALK

Gene orientationSlide55

Fusions

Gene 1

Gene 2

Gene1-Gene2 Fusion

Spanning Reads

Bridging ReadsSlide56

Spanning Alignment View

EML4 exon 13 ALK exon 20

Fusion spanning readsSlide57

Fusion Genes

TCGA, LUAD, submittedSlide58

Virus Detection

HNSCC

HPV16

#1

#2

#3

#4

E7

E2

L1

E6

E1

E4

L2

E5Slide59

Detecting tumor mutations by integration of

DNA

and RNA sequencing via

UNCeqR

Matthew Wilkerson, UNC Chapel Hill

Weighted combination of DNA-Whole

exome

sequencing and RNA sequencing

RNAseq

can help increase somatic mutation detection particularly in low purity tumors

http://lbg.med.unc.edu/~mwilkers/unceqrSlide60

UNCeqR

PIK3CA

H1047R

TCGA-AR-A252

Luminal A subtype

http://lbg.med.unc.edu/~mwilkers/unceqrSlide61

RNA often provides a greater mutation signal than DNASlide62

RNA integration boosts performance in tumors have low puritySlide63

Novel mutations relative to published profiles