Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul GSU Ion Mandoiu UCONN Alex Zelikovsky GSU RNA Seq based discovery and reconstruction of ID: 429076
Download Presentation The PPT/PDF document "Adrian" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Adrian
CaciulaDepartment of Computer ScienceGeorgia State UniversityJoint work with Serghei Mangul (GSU) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes
CAME 2011, Atlanta, GASlide2
Genome-Guided RNA-Seq Protocol
Make
cDNA & shatter into fragments
Sequence fragment ends
A
B
C
D
E
Map reads
Gene Expression (GE)
Isoform
Expression (IE)
A
B
C
A
C
D
E
Isoform
Discovery (ID)
CAME 2011, Atlanta, GASlide3
Using Partial Annotation
Existing tools for genome-guided RNA-seq protocolCufflinks [Trapnell et al. 2010] reports minimal set of transcriptsScripture [Guttman et al. 2010] reports all possible transcriptsBesides genomes, partial annotation exists for many speciesAlmost all human genes/exons and majority of transcripts are already known (genomes libraries - UCSC) How can one incorporate it into genome-guided RNA-seq protocol?Cufflinks – RABT [Roberts et al. 2011] – added fake reads uniformly distributed among transcriptsThis talk: Subtract reads (exon counts) explained by known transcriptsUse Virtual String EM (VSEM) [Mangul et al. 2011]Reconstruct transcripts containing unexplained exonsCAME 2011, Atlanta, GASlide4
EM
for Isoform Expression Estimation Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results
ConclusionsCAME 2011, Atlanta, GAOutlineSlide5
transcripts
T1T2T3R1
R2
R4
reads
R3
LEFT
: transcripts
unknown frequencies
RIGHT
: reads
observed frequencies
EDGES
: weight ~ probability of the read
to
be
emitted
by the
transcript
w
eights are calculated based on the
mapping of the reads to the transcripts
Given:
annotations (transcripts) and frequencies of the reads
Find:
ML estimate of transcript frequencies
Input data of EM
a panel:
bipartite graph
.
Max Likelihood ModelSlide6
Generic EM algorithm
Initialization: uniform transcript frequencies f(j)’sE-step: Compute the expected number n(j) of reads sampled from transcript j assuming current transcript frequencies f(j)M-step: For each transcript j, set f(j) = portion of reads emitted by transcript j among all reads in the sampleML estimates for f(j) = n(j)/(n(1) + . . . + n(N))CAME 2011, Atlanta, GASlide7
ML Model Quality
How well ML model explain the reads.deviation between expected (ej) and observed read frequency (oj):expected read frequency: If the deviation is high we expect that some transcripts are missing from the panel.CAME 2011, Atlanta, GA|R| - number of readshsi,j – weighted match based on mapping read rj - maximum-likelihood frequency of transcript
ti Slide8
EM for Isoform Expression Estimation
Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide9
Virtual Transcript
Expectation Maximization (VTEM)VTEM is based on a modification of Virtual String Expectation Maximization (VSEM) Algorithm [Mangul et al. 2011]).the difference is that we consider in the panel exons instead of readsCalculate observed exon counts based on read mappingeach read contribute to count of either one exon or two exons (depending if it is a unspliced spliced read or spliced read)133exon countsR1
R2
R4
reads
R3
transcripts
T
1
T
2
R1
R2
R4
reads
R3
transcripts
T
1
T
2
E1
E
2
exons
E
3Slide10
Input: Complete vs Partial Annotations
transcriptsT1T2T3E1
E2
E
4
exons
E
3
transcripts
T
1
T
2
E1
E
2
E
4
exons
E
3
Complete Annotations
Partial Annotations
O
0.25
0.25
0.25
0.25
O
0.25
0.25
0.25
0.25
CAME 2011, Atlanta, GA
Transcript T3 is missing
from annotations
Observed exon frequenciesSlide11
Adding Virtual Transcript
11transcriptsT1T2T
3E1
E
2
E
4
exons
E
3
transcripts
T
1
T
2
E1
E
2
E
4
exons
E
3
O
0.25
0.25
0.25
0.25
O
0.25
0.25
0.25
0.25
VT
VT
CAME 2011, Atlanta, GA
Virtual transcript,
with weighted probability of VT to emit
exon
j equals 0 (i.e., dashed edges)
Complete Annotations
Partial AnnotationsSlide12
After 1st EM Run
T1T2T3E1
E2E
4
exons
E
3
T
1
T
2
E1
E
2
E
4
exons
E
3
O
E
.25
.25
.25
.25
.25
.25
.25
.25
O
E
.25
.32
.25
.32
.25
.16
.25
.16
ML
.25
.5
.25
0
ML
.34
.66
0
VT
VT
CAME 2011, Atlanta, GA
transcripts
transcripts
ML-estimated
t
ranscript frequencies
Expected exon
frequencies
Complete Annotations
Partial AnnotationsSlide13
Updating Weights From Virtual Transcript
T1T2T3E1
E2E
4
exons
E
3
T
1
T
2
E1
E
2
E
4
exons
E
3
O
E
.25
.25
.25
.25
.25
.25
.25
.25
O
E
.25
.32
.25
.32
.25
.16
.25
.16
ML
.25
.5
.25
0
ML
.34
.66
0
VT
VT
CAME 2011, Atlanta, GA
transcripts
transcripts
Observed = Expected
Nothing to update
O > E
Increase VT weights
O < E
Decrease VT weights (to 0)
Complete Annotations
Partial AnnotationsSlide14
After 2nd EM Run
transcriptsT1T2T3E1
E2
E
4
exons
E
3
transcripts
T
1
T
2
E1
E
2
E
4
exons
E
3
O
E
.25
.25
.25
.25
.25
.25
.25
.25
O
E
.25
.3
.25
.3
.25
.15
.25
.15
ML
.25
.5
.25
0
ML
.32
.65
.03
VT
VT
CAME 2011, Atlanta, GA
VT frequency stays 0
VT frequency increases!
Deviation of expected from observed decreases!
Complete Annotations
Partial AnnotationsSlide15
transcripts
After the Last EM Run T1T2T3E1
E2
E
4
reads
E
3
T
1
T
2
E1
E
2
E
4
reads
E
3
O
E
.25
.25
.25
.25
.25
.25
.25
.25
O
E
.25
.25
.25
.25
.25
.25
.25
.25
ML
.25
.50
.25
0
ML
.20
.60
.20
VT
VT
CAME 2011, Atlanta, GA
transcripts
VT frequency is 0
No false positives
VT frequency (.2) ≈ T3 frequency (.25)
VT’s
exons
(E3,E4)= T3’s
exons
(E3,E4)
Complete Annotations
Partial Annotations
Overexpressed
exonsSlide16
Virtual Transcript
Expectation Maximization (VTEM)ML estimates of transcriptfrequenciesComputeexpected exonsfrequenciesUpdate weightsof reads in virtual transcript
EM
(Partially) Annotated
Genome
+ Virtual Transcript
with 0-weights
in virtual transcript
Virtual Transcript frequency change>
ε
?
Output
overexpressed
exons
(expressed by
virtual transcripts)
EM
YES
NO
Overexpressed
exons
belongs to unknown transcripts
CAME 2011, Atlanta, GASlide17
EM for Isoform Expression Estimation
Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide18
Detection and
Reconstruction of Unannotated Transcriptsa) Map reads to annotatedtranscripts (using Bowtie)b) VTEM: Identify overexpressedexons (possibly from unannotatedtranscripts)c) Assemble Transcripts (e.g., Cufflinks)using reads from “overexpressed” exonsand unmapped readsd) Output: annotated transcripts + novel transcriptsCAME 2011, Atlanta, GASlide19
EM for Isoform Expression Estimation
Virtual Transcript EM AlgorithmDRUT: Detection and Reconstruction of Unannotated TranscriptsExperimental ResultsConclusionsCAME 2011, Atlanta, GAOutlineSlide20
Simulation Setup
human genome data (UCSC hg18) UCSC database - 66, 803 isoforms19, 372 genes.Single error-free reads: 60M of length 100bpfor partially annotated genome -> remove from every gene exactly one isoformCAME 2011, Atlanta, GASlide21
Distribution of
isoforms length and gene cluster sizes in UCSC datasetCAME 2011, Atlanta, GASlide22
Comparison Between Methods
CAME 2011, Atlanta, GASlide23
Conclusions
We proposed DRUT a novel annotation-guided method for transcriptome discovery and reconstruction in partially annotated genomes. DRUT overperforms existing genome-guided transcriptome assemblers (i.e., Cufflinks) DRUT shows similar or better performance with existing annotation-guided assemblers (Cufflinks - RABT).CAME 2011, Atlanta, GASlide24
Thanks!
CAME 2011, Atlanta, GA