/
A.A. 2020-2021 CORSO DI METODI MOLECOLARI E A.A. 2020-2021 CORSO DI METODI MOLECOLARI E

A.A. 2020-2021 CORSO DI METODI MOLECOLARI E - PowerPoint Presentation

ximena
ximena . @ximena
Follow
27 views
Uploaded On 2024-02-02

A.A. 2020-2021 CORSO DI METODI MOLECOLARI E - PPT Presentation

BIOINFORMATICA per il CLM in BIOLOGIA EVOLUZIONISTICA Scuola di Scienze Università di Padova Prof STEFANIA BORTOLUZZI Outline Transcriptomics today RNA seq features and advantages Transcriptome ID: 1044158

reads rna transcriptome seq rna reads seq transcriptome gene expression genome transcripts differential mapping sequencing fusion transcript sequence data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A.A. 2020-2021 CORSO DI METODI MOLECOLAR..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. A.A. 2020-2021CORSO DI METODI MOLECOLARI E BIOINFORMATICAper il CLM in BIOLOGIA EVOLUZIONISTICAScuola di Scienze, Università di PadovaProf. STEFANIA BORTOLUZZI

2. OutlineTranscriptomics todayRNA-seq features and advantagesTranscriptome sequencing goalsAnalysis pipelines

3. Transcriptome definition and dynamicsThe set of all RNA molecules produced in one cell, or in a population of cells, at a given timeAmount/concentration information is needed in addition to molecule identificationQUALITYQUANTITYExpression levelIdentification of expressed RNA moleculesThe transcriptome continuosly changes Development Differentiation Response to environment DiseaseTranscriptome variation depends on: Genetic/epigenetic changes of the genome Emergent properties of the transcriptome itself as a complex system Many regulatory RNAs Many layers of genes expression regulation

4. Scopi dell’Analisi di SequenziamentoTRASCRIPTOMELONGSHORTNON CODINGmiRNAOTHERCODINGAspetti positiviGrande mole di dati prodotti;Identificazione cause malattie sconosciute o poco conosciute;Applicazione delle conoscenze acquisite (farmacogenomica).LimitiNecessità di server capienti;Costruzione di strutture bioinformatiche complesse;Necessità di database integrati;Diverse domande biologiche.

5. < 3% del genoma umano codifica proteineEvidenze recenti ottenute con genomic-tiling array e sequenziamento del trascrittoma hanno mostrato che >70% del genoma è trascritto in maniera pervasiva in RNA codificante ( mRNA)Moltissimi prodotti trascrizionali sono RNA, piccoli e lunghi, con scarsissimo potenziale codificanteLa maggior parte del DNA eucariotico trascritto è non codificanteLa production phase di ENCODE ha mostrato che >80% del genoma è biologicamente attivo e funzionale (ruolo regolativo per la maggior parte delle sequenze)Genoma

6. Le molecole di RNA possono contemporaneamente contenere informazione di sequenza e possedere plasticità strutturale Gli RNA possono sia interagire con DNA ed altri RNA per appaiamento delle basi complementari, sia fornire siti di legame per proteineIl trascrittoma non-codificante RNA non codificanti noti da molto tempo: rRNA e tRNA nella traduzione snRNA e snoRNA nel processamento degli mRNA ribozimi Ipotesi del “mondo ad RNA”

7. TranslationproteinsmRNAProcessingRNATranscriptionncRNA <3% of the genome is important since transcribed/coding abundant “junk DNA”DNA

8. siRNAstRFs?TranslationAlternative TSSsproteinsmRNATrans-splicingEditingPolyadenylationSplicingSequestrationTurn-overEditingProcessingRNA transcripts/precursorsSequestrationTurn-overSilencingNuclear exportTranscriptionProcessingTranscriptionncRNApiRNAssnoRNAscircRNAsDNA >70% transcribed in “dark matter” Diverse functional roles for ncRNAs uncoveredlncRNAssnRNAsmiRNAs

9. RNA-seq: cDNA sequencing using NGS Computational analysisTo get information about a sample's RNA contentFor reverse engineering the genome stateCells/BiosamplesLibrary preparationSequencing

10. Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682From RNA -> sequence dataDNA (fragmented) enters here

11. Reads mapping to reference sequenceGenomeTranscriptomeSequencing raw data processing, cleaning and filteringPost-processingAssemblyDifferential expressionGene/transcript discoveryVariants ….sRNAsRaw sequencing data are only the starting point for a long journeyAnnotation by similarity

12. Log2 Array-basedLog2 RNA-seqExpression Different samples, same classes: CTR-BM, PMF Many miRNAs are weakly expressed Mean expression values per class and per techniqueARRAYS: larger sample sizesRNAseq: larger range in expression estimates Expression (Log10 scale) 7 orders of magnitude4 orders of magnitudeSpearman correlation0.59Advantages of RNA-seq: wide expression rangeRNAseq-based (read count)Array-based

13. New exon? This is not possible using microarraysA typical read mapping result after integration with gene annotationsAdvantages of RNA-seq: discovery powerRNA-seq <sample≈130,000,000 reads

14. NGS for transcriptomics: featuresMany data … Illumina HiSeq  400 million reads/laneShort reads … (≅100 nt) much shorter than most elements of interestHigh error rate … with rate and type depending on technologySeveral protocols for library preparation … with different purposes and possibly introducing biasesStrand specificity … if availableUneven sequencing coverage… compared to genome-seq

15. Experimental DesignPoly(A) enrichment or ribosomal RNA depletion?Single-end or Paired end?Stranded or not?How much sequencing data to collect?RNA-seq comes in different flavours

16. Poly(A) enrichment or ribosomal RNA depletion?Which RNA entities you are interested in…Transcriptome assembly: remove all ribosomal RNA (and maybe enrich for only polyA+ transcripts)Differential gene expression: enrich for Poly(A)EXCEPTION – If you are aiming to obtain information about long non-coding RNAs or circular RNAsMetatranscriptomics: remove all the host materialsRemove rRNA by molecular methods prior to sequencingRemove host mRNA by computational methods post-sequencing

17. Technical replicates Illumina has low technical variation unlike microarraysTechnical replicates are unnecessaryBatch effects Best to sequence everything for an experiment at the same timeIf you are preparing the libraries, be consistent & make them simultaneouslyBiological replicatesThis is essential for your experiment to have any statistical powerAt least 3, but the more the betterFor transcriptome assemblyRNA can be pooled from various sources to ensure the most robust transcriptome Pooling can also be done after sequencing, but before assemblyFor differential gene expressionPooling RNA from multiple biological replicates is usually not advisableOnly do so if you have multiple pools from each experimental condition

18. FastQC is a great tool that enables the quality assessmentGood quality!Poor quality!Transcriptome Analysis - Quality ChecksPoor quality at the ends  “quality trimmers” like trimmomatic, fastx-toolkit, …Left-over adapter sequences in the reads must be removedAfter trimming: rerun the data through FastQC to check the resulting data

19. Before quality trimmingAfter quality trimmingTranscriptome AnalysisQuality Checks

20. Transcriptome sequencing goalsmRNAs>200 nt LongCODINGNON-CODING ncRNAs200-30Medium<30ShortQUALITYQUANTITY Identification/discovery of expressed Transcript isoforms Isoform quantification and DE Alternative splicing (AS) events GENES identification/discovery and quantification Differential expression (DE) Small RNAs (sRNAs) quantification, discovery and characterisation, DE Identification of chimeric transcripts from gene fusion events Transplicing RNA Editing / Chemical modifications SNPs and variants calling

21.

22. Three RNA-seq mapping strategiesDiagrams from Cloonan & Grimmond, Nature Methods 2010De novo assemblyAlign to transcriptomeAlign to reference genome

23. RNA-seq reads can be splicedspliced reads are informativeRNA-seq read to genome mapping is complexMapping: arranging and aligning the reads to a reference sequence (reference not from the same individual)E.g. TopHat, BWA, GSNAP, STARGenomeRNAAAAAAAAA

24. Repetitive sequences: the “multiple mapping” problem 50% of the human genome is repetitive Large gene families share similar sequences Evolutionary recent paralogs Segmental duplications NGS cleaned read length <100 bp Split reads The higher the similarity within the repeats, the lower the read mapping confidenceDifferent strategies for multi-mapping reads threatment deeply change the final result

25. Single-end readRead1ATGTTCCATAAGC…Paired-end readsRead1ATGTTCCATAAGC…Read2CCGTAATGGCATG…Paired-end sequencingPaired-end reads are a couple of reads derived from the same molecule, with known distance between the reads200-800 nt100 nt100 nt

26. Paired-end sequencingPaired-end reads are a couple of reads derived from the same molecule, with known distance between the reads200-800 nt100 nt100 ntSince it is improbable that both reads align to repetitive regionspaired-end reads reduce the “multiple mapping” problemReferenceRepeats

27. P1P2P3Isoform 1Isoform 2Isoform 3Paired ends increase isoform deconvolution confidenceP1 originates from isoform 1 or 2 but not 3.P2 and P3 originate from isoform 1Paired-end reads are a couple of reads derived from the same molecule, with known distance between the reads200-800 nt100 nt100 ntPaired-end sequencing

28. Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682Reference-based transcriptome assemblyDe novo transcriptome assembly

29. Used when the genome sequence is knownTranscriptome data are not availableTranscriptome information is available but not good enough, i.e. missing isoforms of genes, or unknown non-coding regionsThe existing transcriptome information is for a different tissue typeCufflinks and Scripture are two reference-based transcriptome assemblersReference-based transcriptome assembly

30. Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682Reference-based transcriptome assembly

31. Aligned RNA-Seq reads are assembled into a parsimonious set of transcripts. The relative abundances of transcripts are estimated based on how many reads support each one Sequence-specific biases, introduced during the library preparation, that challenge the assumption of uniform coverage, are correctedMultiple mapping of reads is taken into accountCufflinks assembles transcripts, estimates their abundances, and tests for differential expression in RNA-Seq samplesCufflinks

32. Used when very little information is available for the genomeOften the first step in putting together information about an unknown genomeAmount of data needed for a good de novo assembly is higher than what is needed for a reference-based assemblyCan be used for genome annotation, once the genome is assembledTrinity, Oases, TransABySS, are examples of well-regarded transcriptome assemblersDe novo transcriptome assembly

33. Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682(De Bruijn graph construction)De novo transcriptome assembly

34. Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682De novo transcriptome assembly

35.

36. In 1946, the Dutch mathematician Nicolaas de Bruijn became interested in the Superstring Problem: find a shortest circular superstring that contains all possible substrings of length k (k-mers) in a given alphabet. There exist nk k-mers in an alphabet containing n symbols: for example, given the alphabet {A,T,G,C}, there are 43 = 64 trinucleotides. If our alphabet is instead {0,1}, then all possible 3-mers are simply given by all 3-digit binary numbers: 000, 001, 010, 011, 100, 101, 110, 111. The circular superstring “0001110100” not only contains all 3-mers but is also as short as possible, since it contains each 3-mer exactly once. But how can one construct such a superstring for all k-mers in the case of an arbitrary k and an arbitrary alphabet? De Bruijn answered this question by borrowing Euler’s solution of the Bridges of Königsberg Problem.

37.

38. Condition 1Condition 2Differential expression Best methods to discover DE are coupled with sophisticated approaches to normalization E.g. DESeq, EdgeR, Cufflinks Ignoring very low expressing genes is recommended: RPKM<1

39. Longer transcripts have higher counts  Length bias, an intra-sample bias Different experiments have different yields  inter-samples bias. Normalization is required: Reads Per Kilobase of exonic sequence per Million mapped reads(Mortazavi et al Nature methods 2008)Count-based gene expression quantification Non uniform coverage in different exons Reads sharing among transcripts Multiple mapping problem Overlapping genes

40. What genes/transcripts are differentially expressed in various test conditions? The first step is proper normalization of the dataIn general, RNA-Seq data do not follow a normal (Poisson) distribution, but follow a negative binomial distribution –> need of a statistical program that makes the correct assumptionsEdgeR, DESeq, other R/Bioconductor packages for DE detectionDifferential expression

41. The observed log-fold-change in gene/transcript expression is tested against the null hypothesis of no changeThe null distribution of LgFC is estimated using the beta negative binomial model for each transcript in each conditionUses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for significant differencesCuffDiff

42. Differential Expression testsisoform_exp.diff Transcript differential FPKM.gene_exp.diff Gene differential FPKM (using summed FPKM of transcripts of the same gene)tss_group_exp.diff Primary transcript differential FPKM (Tests differences in the summed FPKM of transcripts sharing TSS)cds_exp.diff Coding sequence differential FPKM (using summed FPKM of transcripts sharing protein_id)Differential splicing testssplicing.diff For each primary transcript, how much differential splicing exists between isoforms processed from a single primary transcript. Differential coding outputDifferential use of promotersCuffDiff

43. Gene fusion events Relevant for cancer pathogenesis, and as diagnostic and therapeutic targets (e.g. imatinib)Mitelman DB 2015Chromosome Aberrations and Gene Fusions in CancerCases = 65,388Gene fusions = 2,327 Specific fusions in thyroid tumorigenesis The genome can be highly rearranged in tumours Only a fraction of rearrangements might alter transcription

44. RNA-seq can identify expressed fusion transcripts likely to be functional or causal of diseaseIdentification of gene fusion events bejond annotated exons, in new genes, is also possible E.g. TopHat fusion, deFuse Normally splice aligners require reads mapping in the same chromosome, with a fixed maximal distance

45. Read to genome mappingReads encompassing the fusion junction Reads spanning the fusion junction Identify gene fusion candidatesDefine the exact boundary sequence IUM reads are splitted into fragments Fragments are mapped separatelyTwo terminal fragments map to different chromosomesThe unmapped central fragment is used to find the precise fusion pointThe detection of fusion events relies on unmapped readsInitially Unmapped Reads (IUM)Abate F et al. Bioinformatics 2012 Kim & Salzberg Genome Biol 2011

46. Small RNA-seq enables the discovery and profiling of microRNAs and other small RNAsRNA preparationKnown miRNA precursorsdeep sequencing (25 nt)Biosamplemap reads to genomeIllumina, SOLiD, IonTorrentBowtie, SHRiMPFilter reads mapping to repetitive DNAmiR&moReNew miRNAsKnown miRNAs quantificationNew miRNA*sNew moRNAsmiRNA precursor predictionunmappable readsmiRDeep2Size selection (< 40 nt)repeat sequences

47. Small RNA-seq: main results I-BFM2013Sister miRNA pair (miR/miR*)Expression (log10 of read count Reads to hairpin locus alignmentNew miRNA* discoveredand quantifiedKnown miRNAExpression quantificationnt of hairpin locus Discovery of hundreds of new miRs -> 2042 matures in miRBase v.19 Importance of miR* Discovery of miRNA sequence variability (isomiRs) Evidences of non-canonical miRNA biogenesis

48. Support of novel miRNAs by a Dicer KO experimentOther validationsA large-scale small RNA-seq study recently doubled the miRNA repertoire

49. NUMBER OF MIRNAS2013 2015≈2000 ≈4000 Thousands new miRNAs validatedMany are seed-paralogues of known mirbase miRNAsMany are specific to the Hominidae family of PrimatesWeakly expressedExpressed with tissue-, development-, differentiation-specific patterns

50. moRNA (miRNA-offset RNA) discovery ~20 nt long RNAs derived from tends of pre-miRNAs Novel type of miRNA-related small RNAs, discovered by RNA-seq in the ascidian Ciona intestinalis, then found in human cells, in solid tumours and in herpes viruses infected cells. Some are enriched in the nuclear fraction of RNAs, function unknown One locus -> 4 products!moRNAmoRNAmiRNAmiRNA

51. circular RNAs Covalently closed RNA molecules deriving from backsplicingRelaunched recently by RNA-seq projectsThousands in human cells discovered by RNA-seqBonizzato et al., 2016.

52. CircRNAs play important functions and have distinctive featuresHighlystableDetectable in body fluidsDisease cancer markersEvolutionarily conservedUbiquitary and highly expressedCompetitors of mRNA splicingTissue and developmental stage specificityBonizzato et al., 2016.

53. CHIASTIC MAPPING on the genome of the reads derived from the backsplice junction Co-linear mappingChiastic mappingBonizzato et al., 2016.