/
Adrian Adrian

Adrian - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
383 views
Uploaded On 2016-08-01

Adrian - PPT Presentation

Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul GSU Ion Mandoiu UCONN Alex Zelikovsky GSU RNA Seq based discovery and reconstruction of ID: 429076

2011 transcripts transcript atlanta transcripts 2011 atlanta transcript exons reads annotations virtual frequency partial frequencies read reconstruction isoform exon genome complete guided

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Adrian" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Adrian

CaciulaDepartment of Computer ScienceGeorgia State UniversityJoint work with Serghei Mangul (GSU) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes

CAME 2011, Atlanta, GASlide2

Genome-Guided RNA-Seq Protocol

Make

cDNA & shatter into fragments

Sequence fragment ends

A

B

C

D

E

Map reads

Gene Expression (GE)

Isoform

Expression (IE)

A

B

C

A

C

D

E

Isoform

Discovery (ID)

CAME 2011, Atlanta, GASlide3

Using Partial Annotation

Existing tools for genome-guided RNA-seq protocolCufflinks [Trapnell et al. 2010] reports minimal set of transcriptsScripture [Guttman et al. 2010] reports all possible transcriptsBesides genomes, partial annotation exists for many speciesAlmost all human genes/exons and majority of transcripts are already known (genomes libraries - UCSC) How can one incorporate it into genome-guided RNA-seq protocol?Cufflinks – RABT [Roberts et al. 2011] – added fake reads uniformly distributed among transcriptsThis talk: Subtract reads (exon counts) explained by known transcriptsUse Virtual String EM (VSEM) [Mangul et al. 2011]Reconstruct transcripts containing unexplained exonsCAME 2011, Atlanta, GASlide4

EM

for Isoform Expression Estimation Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results

ConclusionsCAME 2011, Atlanta, GAOutlineSlide5

transcripts

T1T2T3R1

R2

R4

reads

R3

LEFT

: transcripts

unknown frequencies

RIGHT

: reads

observed frequencies

EDGES

: weight ~ probability of the read

to

be

emitted

by the

transcript

w

eights are calculated based on the

mapping of the reads to the transcripts

Given:

annotations (transcripts) and frequencies of the reads

Find:

ML estimate of transcript frequencies

Input data of EM

a panel:

bipartite graph

.

Max Likelihood ModelSlide6

Generic EM algorithm

Initialization: uniform transcript frequencies f(j)’sE-step: Compute the expected number n(j) of reads sampled from transcript j assuming current transcript frequencies f(j)M-step: For each transcript j, set f(j) = portion of reads emitted by transcript j among all reads in the sampleML estimates for f(j) = n(j)/(n(1) + . . . + n(N))CAME 2011, Atlanta, GASlide7

ML Model Quality

How well ML model explain the reads.deviation between expected (ej) and observed read frequency (oj):expected read frequency: If the deviation is high we expect that some transcripts are missing from the panel.CAME 2011, Atlanta, GA|R| - number of readshsi,j – weighted match based on mapping read rj - maximum-likelihood frequency of transcript

ti Slide8

EM for Isoform Expression Estimation

Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide9

Virtual Transcript

Expectation Maximization (VTEM)VTEM is based on a modification of Virtual String Expectation Maximization (VSEM) Algorithm [Mangul et al. 2011]).the difference is that we consider in the panel exons instead of readsCalculate observed exon counts based on read mappingeach read contribute to count of either one exon or two exons (depending if it is a unspliced spliced read or spliced read)133exon countsR1

R2

R4

reads

R3

transcripts

T

1

T

2

R1

R2

R4

reads

R3

transcripts

T

1

T

2

E1

E

2

exons

E

3Slide10

Input: Complete vs Partial Annotations

transcriptsT1T2T3E1

E2

E

4

exons

E

3

transcripts

T

1

T

2

E1

E

2

E

4

exons

E

3

Complete Annotations

Partial Annotations

O

0.25

0.25

0.25

0.25

O

0.25

0.25

0.25

0.25

CAME 2011, Atlanta, GA

Transcript T3 is missing

from annotations

Observed exon frequenciesSlide11

Adding Virtual Transcript

11transcriptsT1T2T

3E1

E

2

E

4

exons

E

3

transcripts

T

1

T

2

E1

E

2

E

4

exons

E

3

O

0.25

0.25

0.25

0.25

O

0.25

0.25

0.25

0.25

VT

VT

CAME 2011, Atlanta, GA

Virtual transcript,

with weighted probability of VT to emit

exon

j equals 0 (i.e., dashed edges)

Complete Annotations

Partial AnnotationsSlide12

After 1st EM Run

T1T2T3E1

E2E

4

exons

E

3

T

1

T

2

E1

E

2

E

4

exons

E

3

O

E

.25

.25

.25

.25

.25

.25

.25

.25

O

E

.25

.32

.25

.32

.25

.16

.25

.16

ML

.25

.5

.25

0

ML

.34

.66

0

VT

VT

CAME 2011, Atlanta, GA

transcripts

transcripts

ML-estimated

t

ranscript frequencies

Expected exon

frequencies

Complete Annotations

Partial AnnotationsSlide13

Updating Weights From Virtual Transcript

T1T2T3E1

E2E

4

exons

E

3

T

1

T

2

E1

E

2

E

4

exons

E

3

O

E

.25

.25

.25

.25

.25

.25

.25

.25

O

E

.25

.32

.25

.32

.25

.16

.25

.16

ML

.25

.5

.25

0

ML

.34

.66

0

VT

VT

CAME 2011, Atlanta, GA

transcripts

transcripts

Observed = Expected

Nothing to update

O > E

Increase VT weights

O < E

Decrease VT weights (to 0)

Complete Annotations

Partial AnnotationsSlide14

After 2nd EM Run

transcriptsT1T2T3E1

E2

E

4

exons

E

3

transcripts

T

1

T

2

E1

E

2

E

4

exons

E

3

O

E

.25

.25

.25

.25

.25

.25

.25

.25

O

E

.25

.3

.25

.3

.25

.15

.25

.15

ML

.25

.5

.25

0

ML

.32

.65

.03

VT

VT

CAME 2011, Atlanta, GA

VT frequency stays 0

VT frequency increases!

Deviation of expected from observed decreases!

Complete Annotations

Partial AnnotationsSlide15

transcripts

After the Last EM Run T1T2T3E1

E2

E

4

reads

E

3

T

1

T

2

E1

E

2

E

4

reads

E

3

O

E

.25

.25

.25

.25

.25

.25

.25

.25

O

E

.25

.25

.25

.25

.25

.25

.25

.25

ML

.25

.50

.25

0

ML

.20

.60

.20

VT

VT

CAME 2011, Atlanta, GA

transcripts

VT frequency is 0

No false positives

VT frequency (.2) ≈ T3 frequency (.25)

VT’s

exons

(E3,E4)= T3’s

exons

(E3,E4)

Complete Annotations

Partial Annotations

Overexpressed

exonsSlide16

Virtual Transcript

Expectation Maximization (VTEM)ML estimates of transcriptfrequenciesComputeexpected exonsfrequenciesUpdate weightsof reads in virtual transcript

EM

(Partially) Annotated

Genome

+ Virtual Transcript

with 0-weights

in virtual transcript

Virtual Transcript frequency change>

ε

?

Output

overexpressed

exons

(expressed by

virtual transcripts)

EM

YES

NO

Overexpressed

exons

belongs to unknown transcripts

CAME 2011, Atlanta, GASlide17

EM for Isoform Expression Estimation

Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide18

Detection and

Reconstruction of Unannotated Transcriptsa) Map reads to annotatedtranscripts (using Bowtie)b) VTEM: Identify overexpressedexons (possibly from unannotatedtranscripts)c) Assemble Transcripts (e.g., Cufflinks)using reads from “overexpressed” exonsand unmapped readsd) Output: annotated transcripts + novel transcriptsCAME 2011, Atlanta, GASlide19

EM for Isoform Expression Estimation

Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide20

Simulation Setup

human genome data (UCSC hg18) UCSC database - 66, 803 isoforms19, 372 genes.Single error-free reads: 60M of length 100bpfor partially annotated genome -> remove from every gene exactly one isoformCAME 2011, Atlanta, GASlide21

Distribution of

isoforms length and gene cluster sizes in UCSC datasetCAME 2011, Atlanta, GASlide22

Comparison Between Methods

CAME 2011, Atlanta, GASlide23

Conclusions

We proposed DRUT a novel annotation-guided method for transcriptome discovery and reconstruction in partially annotated genomes. DRUT overperforms existing genome-guided transcriptome assemblers (i.e., Cufflinks) DRUT shows similar or better performance with existing annotation-guided assemblers (Cufflinks - RABT).CAME 2011, Atlanta, GASlide24

Thanks!

CAME 2011, Atlanta, GA