/
RNA- Seq  Assembly: Fundamental Limits, Algorithms and Software RNA- Seq  Assembly: Fundamental Limits, Algorithms and Software

RNA- Seq Assembly: Fundamental Limits, Algorithms and Software - PowerPoint Presentation

thomas
thomas . @thomas
Follow
70 views
Uploaded On 2023-05-23

RNA- Seq Assembly: Fundamental Limits, Algorithms and Software - PPT Presentation

TexPoint fonts used in EMF A A A A A A A A A A A A A A A A David Tse Stanford University Symposium on Turbo Codes and Iterative Information Processing Bremen Germany August 20 2014 ID: 999298

assembly transcript transcriptome transcripts transcript assembly transcripts transcriptome read length data rna time lcritical sequencing reads real limits information

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "RNA- Seq Assembly: Fundamental Limits, ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. RNA-Seq Assembly:Fundamental Limits, Algorithms and SoftwareTexPoint fonts used in EMF: AAAAAAAAAAAAAAAADavid TseStanford UniversitySymposium on Turbo Codes and Iterative Information Processing Bremen, GermanyAugust 20, 2014Joint work with Sreeram Kannan and Lior Pachter.Research supported by NSF Center for Science of Information.

2. Communication system design1) Establish fundamental limits.2) Design codes and algorithms to approach the limit.3) Implement a system. We apply this methodology to the RNA-Seq assembly problem.

3. DNA sequencing…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

4. High throughput sequencing revolution

5. Shotgun sequencingread

6. Sequencing TechnologiesSequencerSanger 3730xl454 GSIon TorrentSOLiDv4Illumina HiSeq 2000Pac BioMechanismDideoxy chain terminationPyrosequencingDetection of hydrogen ionLigation and two-base codingReversible nucleotidesSingle molecule real time Read length400-900 bp700 bp~400 bp50 + 50 bp100 bp PE1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15%Output data (per run)100 KB1 GB100 GB100 GB1 TB10 GB

7. High throughput sequencing:Microscope in the big data eraAssemblyGenomic variations, 3-D structures, transcription, translation, protein interaction, etc. Today’s focus: RNA sequencing.

8. Central dogma of molecular biologyRNA transcripts and their abundances capture the dynamic state of a cell at a given time.DNARNAProteintranscriptiontranslation

9. From DNA to RNAATCGATCATTCGATCCATTCGGATTCGDNARNA Transcript 1RNA Transcript 2IntronExonACTGAAAGCAlternative splicing yields different isoforms.1000’s to 10,000’s symbols long

10. TranscriptomeATCCATTCGGATTCG20 copies in cell30 copies in cellDifferent transcripts are present at different abundances.Transcriptome is the mixture of transcripts from all the genes.Human transcriptome has 10,000’s of transcripts from 20,000 genes.

11. RNA-Seq(Mortazavi et al,Nature Methods 08)

12. RNA-Seq assemblyATCCATTCGGATTCGATCCATTCGGATTCGGATTCGTTCGATTCGReadsAssembler reconstructsTransciptome

13. RNA assembly: state-of-the-artSource: Wei Li et al, JCB 2011, Data from ENCODE project24243755397416457448216596475588IsoLassoScriptureCufflinksPopular assemblers diverge significantly when fed the same input

14. Assembly as a software engineering problemA single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data.Primary concerns are to minimize time and memory requirements.No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

15. A new approachEstablish information theoretic limits under simplifying assumptions.Design an assembly algorithm that achieves close to the limits.Build software and test on simulated and real data.

16. Information theoretic limitsBasic question:What is the length, number and error rate of the reads needed for reliable reconstruction of a transcriptome?A simplified question:What is the minimum read length Lcritical needed, assuming infinite noiseless reads?(cf. earlier work on DNA assembly: Bresler, Bresler and T. 2013 BMC Bioinformatics, Motahari et al 2013 ISIT)

17. Sequencing TechnologiesSequencerSanger 3730xl454 GSIon TorrentSOLiDv4Illumina HiSeq 2000Pac BioMechanismDideoxy chain terminationPyrosequencingDetection of hydrogen ionLigation and two-base codingReversible NucleotidesSingle molecule real time Read length400-900 bp700 bp~400 bp50 + 50 bp100 bp PE1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15%Output data (per run)100 KB1 GB100 GB100 GB1 TB10 GB

18. Lcrit depends on repeats Lcritical is a measure of repeat complexity of the transcriptome from the point of view of assembly.

19. What is Lcritical for a transcriptome?Lcritical depends on:intra-transcript repeatsinter-transcript repeatson the transcriptome.

20. Intra-transcript repeats:interleaved repeatsL-1L-1L-1L-1La single transcriptLcritical is lower bounded by the length of the longest intra-transcript interleaved repeat.

21. Inter-transcript repeatsLcritical is typically much larger due to inter-transcript repeats of exons across isoforms.ATCCATTCGGATTCGATCCATTCGGATTCGGATTCG100’s of symbols

22. s1s3s4s2s3s5s1s3s4s2s3s5Ambiguity due to inter-transcript repeatsL-1L-1transcript 1transcript 2L = read length

23. s1s3s4s2s3s5s1s3s4s1s3s5Ambiguity due to inter-transcript repeatsL-1L-1transcript 1transcript 2transcript 3transcript 4L = read length

24. Abundance diversitylymphoblastoid cell lineGeuvadis dataset

25. s1s3s4s2s3s5s3s4s2s1s3s5Equal abundanceGeneric abundancess1s3s4s2s3s5s3s4s2s1s3s5babc?Unique generic solution, also sparseL-1L-1L-1L-1

26. Unresolvable intra-transcript repeats with generic abundances abundancesYields a lower bound for Lcritical for a given transcriptome.s1s3s4s2s3s4s2s3s5abcs2s3s4s1s3s4s1s3s5a-cb+ccalternativesolution:

27. Algorithm: reduction to sparsest flowCreate a splice graph where each node is an exon.Read copy counts give edge flowsTranscripts are extracted via solving a sparsest flow problem.s1s2s3s4s50.120.880.120.88s1s3s4s2s3s50.120.88

28. Sparsest Flow DecompositionProblem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] Closer look at hard instances: most paths have same flowEquivalent to: Most transcripts have same abundance (!)This is not characteristic of the biological problemOur Result:Assume that abundances are genericPropose a provably correct algorithm that reconstructs when L > LsuffAlgorithm is linear time under this condition.

29. Informational limits: summaryLcritical of a transcriptome:Read Length, L0Lcritical No algo. can reconstructProposed algo. can reconstruct in linear time On many reference transcriptomes, these two bounds match, establishing Lcritical !

30. From theory to software

31. ShannonRNA: simulated readsCoverage depth of transcriptsSensitivity (fraction of transcripts recovered)Specificity (false positive rate)Chr 15 Gencode reference transcriptome, 1700 transcripts L= 100, 1M reads, 1% error rate

32. Performance on real readsRNA sample from Human Embryonic Stem CellSimultaneously sequenced using long Pacbio reads and short Illumina readsLong reads are fewer in numberRead length=50, 20 Million readsLong read assembly as a proxy for ground truth.[Au et al, PNAS 2013]

33. Coverage Depth of TranscriptsShannonRNA: real readsRunning TimeTrinity: 3 hrsShannonRNA: 5 hrsNo. Transcripts : 800ShannonRNA : 527Trinity : 476Sensitivity (fraction of transcripts recovered)

34. Abundance of Transcripts Reconstructed (Segregated by number of Isoforms)Zooming In

35. SummaryAn approach to RNA assembly design based on principles of information theory.Driven by and tested on transcriptomics data.Goal is to build robust, scalable software with performance guarantees.

36. AcknowledgementsSreeram KannanBerkeleyLior PachterBerkeleyJoseph HuiBerkeleyKayvon MazoojiBerkeley