/
Differential expression analysis with Differential expression analysis with

Differential expression analysis with - PowerPoint Presentation

jaena
jaena . @jaena
Follow
64 views
Uploaded On 2023-12-30

Differential expression analysis with - PPT Presentation

RNASeq GOnRamp Beta Users Workshop Wilson Leung 072017 Outline Design considerations for RNASeq experiments Interpret FastQC results Optimize alignment parameters for HISAT Assess alignment statistics with CollectRnaSeqMetrics ID: 1036665

seq rna analysis expression rna seq expression analysis reads html gene differential biological org https replicates sequencing genes read

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Differential expression analysis with" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Differential expression analysis with RNA-SeqG-OnRamp Beta Users WorkshopWilson Leung07/2017

2. OutlineDesign considerations for RNA-Seq experimentsInterpret FastQC resultsOptimize alignment parameters for HISATAssess alignment statistics with CollectRnaSeqMetricsAssemble transcripts with StringTieDifferential expression analysis with DESeq2

3. RNA-Seq overviewGriffith M et al. Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. PLoS Comput Biol. 2015 Aug 6;11(8):e1004393.

4. Use the RNA Integrity Number (RIN) to assess the quality of the RNA sampleElectropherograms produced by the Agilent BioanalyzerExpects 2:1 ratio between 28S and 18S rRNARIN values range from 1 (most degraded) to 10 (least degraded)https://en.wikipedia.org/wiki/RNA_integrity_number RIN = 10FluorescenceLengthLength18S28SFast region5SMarkerRIN = 1FluorescenceLengthLengthGallego Romero I et al. RNA-Seq: impact of RNA degradation on transcript quantification. BMC Biol. 2014 May 30;12:42.

5. Common applications of RNA-SeqTranscriptome profilingIdentify novel transcripts (e.g., gene annotations)Quantify expression levelsDifferential expressionDifferent developmental stages; treatment versus controlAlternative splicingVisualization and integration with other datasetsCorrelate with epigenomic landscape Histone modifications, DNA methylation, etc.Conesa A et al. A survey of best practices for RNA-Seq data analysis. Genome Biol. 2016 Jan 26;17:13.

6. Using RNA-Seq to identify chimeric transcriptsMaher CA et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A. 2009 Jul 28;106(30):12353-8.

7. The optimal RNA-Seq sequencing and analysis protocols depend on the goals of the study

8. Design considerations for RNA-SeqExperimental designNumber of samples, number of biological and technical replicatesSequencing designSpike-in controls, randomization of library prep and sequencingQuality controlSequencing quality, mapping biasConesa A et al. A survey of best practices for RNA-Seq data analysis. Genome Biol. 2016 Jan 26;17:13.

9. Biological and technical replicatesBiological replicatesRNA from independent growth of cells and tissuesAccount for biological variationsTechnical replicatesDifferent library preparations of the same RNA-Seq sampleAccount for batch effects from library preparationsSample loading, cluster amplifications, etc. ENCODE long RNA-Seq standards:https://www.encodeproject.org/data-standards/rna-seq/long-rnas/Blainey P et al. Points of significance: replication. Nat Methods. 2014 Sep;11(9):879-80.

10. Recommended RNA-Seq sequencing depth based on genome sizeGenome SizeDifferential expression analysis (# reads)Detect rare transcripts /de novo assembly (# reads)Read lengthSmall (bacteria / fungi)5 M30 – 65 M50 bpIntermediate (D. melanogaster)10 M70 – 130 M50 – 100 bpLarge(Human)15 – 25 M100 – 200 M> 100 bphttps://genohub.com/next-generation-sequencing-guide/#depth2

11. How many biological replicates?As many as possible…Analysis of 48 biological replicates in two conditionsRequires 20 biological replicates to detect > 85% of all differentially expressed genesRecommend at least six biological replicates per conditionTwelve biological replicates needed to detect smaller fold changes (≥ 0.3-fold difference in expression)Three biological replicates per condition can usually detect genes with ≥ 2-fold difference in expressionSchurch NJ et al. How many biological replicates are needed in an RNA-Seq experiment and which differential expression tool should you use? RNA. 2016 Jun;22(6):839-51.

12. Power curves for number of biological replicates in each conditionWeb interface for RnaSeqSampleSize:https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/ Power# Samples in each condition01020304050600.00.20.40.80.6FDR = 0.05FDR = 0.01

13. Using Galaxy to perform RNA-Seq analysisQuality control with FastQCTrim low quality bases with TrimmomaticRead mapping with HISATTranscriptome assembly with StringTieDifferential expression analysis with DESeq2Based on RNA-Seq tutorial developed by Mo Heydarian and Mallory Freeberg at Johns Hopkins University:https://galaxyproject.org/tutorials/nt_rnaseq/

14. Study design for the differential expression (DE) analysis walkthroughG1E mouse cell lineES cells hemizygous for Gata1 knockoutMegakaryocytes (Mk)Large bone marrow cellProduces platelets for blood clottingGoal: Identify transcripts regulated by GATA1Wu W et al. Genome Res. 2014 Dec; 24(12):1945-62.G1EMk

15. Quality control with FastQCDetermine quality encoding of fastq filesAssess quality of sequencing sampleIdentify overrepresented sequencesAdapters, potential contamination, etc.

16. FastQC: Per base sequence quality(G1E_R1_f_ds_SRR549355)van Gurp TP et al. Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One. 2013 Dec 30;8(12):e85583.

17. Sequence bias at 5’ end caused by random hexamer primingHansen KD et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010 Jul;38(12):e131.%T%C%A%GSequence content across all bases

18. Use peaks in the Per sequence GC content panel to identify potential contaminationG1E_R1_f_ds_SRR549355GC count per readTheoretical distributionGC distribution over all sequencesMultiple peaks

19. Fth1Investigate overrepresented sequencesG1EMk

20. High RNA-Seq coverage at 5’ UTR of Fth1 overlaps with a STS markerSequence%A%C%G%TRH9440315.3%45.9%23.0%15.8%

21. RNA-Seq read mapping with HISATMany alignment parameters available…Which parameters should be changed?

22. Strand-specific RNA-Seq librariesStandard libraries do not preserve strand informationPrefer strand-specific RNA-SeqTranscript reconstruction and quantificationDetect antisense transcriptsLibrary prep kits available from Illumina:TruSeq Stranded Total RNA Sample Prep KitTruSeq Stranded mRNA Sample Prep KitZhao S et al. Comparison of stranded and non-stranded RNA-Seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 2015 Sep 3;16:675.

23. Orientations of RNA-Seq reads depend on the paired-end protocolTruSeq Strand-Specific Total RNA: First Strand (R/RF)fr-firststrand (F2R1)NuGEN Encore: Second Strand (F/FR)fr-secondstrand (F1R2)NuGEN Ovation V2: FR Unstrandedfr-unstrandedhttp://sailfish.readthedocs.io/en/master/library_type.html See Supplemental Table S5 in Griffith M et al. PLoS Comput Biol. 2015 Aug 6;11(8):e1004393.

24. Use infer_experiment.py to infer the design of the RNA-Seq experimentPart of the RSeQC package:http://rseqc.sourceforge.net/ Map RNA-Seq reads using default parametersRun infer_experiment.py to infer experimental designMap RNA-Seq reads using the inferred parametersAvailable through the Galaxy Tool Shed (rseqc):

25. Common changes to HISAT alignment parametersMinimum and maximum intron lengthsSpecify strand-specific informationGTF file with known splice sitesUse known gene annotations to guide read mapping if availableTranscriptome assembly reporting

26. Use splice site information during read mapping to improve alignment accuracyKim D et al. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015 Apr;12(4):357-60.

27. RNA-Seq alignment strategy for multiple samplesMap reads from each RNA-Seq sample separatelyUse the --novel-splicesite-outfile option to report splice sites identified in each sampleCombine splice junctions from all samples under all conditions into a single splice junction fileFilter splice junctions by scoreRetain junctions that appear in multiple biological replicatesMap reads from each RNA-Seq sample using the combined splice junctions file Use the --novel-splicesite-infile option

28. PCR amplification biases in Illumina dataMajor contributors to bias:Fragment sizeBase compositionhttp://www.illumina.com/technology/next-generation-sequencing.html Bridge amplification cyclesAird D et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12(2):R18.Relative abundanceGC content of ampliconIlluminaqPCR1001010.1020406080

29. Identify duplicate reads based on sequence alignmentsAssumption:Rare for different sequenced fragments to have the same start and end positionsThe Picard tool MarkDuplicates only considers the start positionRNA-Seq data violates this assumption:Reads map to a smaller portion of the genomeProbability of reads with the same start position depends on gene length and expression levelsRecommendation:Mark potential duplicates to verify that all samples have similar levels of “duplication” Retain duplicate reads in differential expression analysesWilliams AG et al. RNA-Seq Data: Challenges in and Recommendations for Experimental Design and Analysis. Curr Protoc Hum Genet. 2014 Oct 1;83:11.13.1-20.

30. DEMO: Mark optical and PCR duplicates using the Picard tool MarkDuplicates

31. Assess RNA-Seq read alignments with CollectRnaSeqMetricsRequires gene annotations:Gene annotations in GTF / GFF format UCSC Genes in refFlat format (from the UCSC Table Browser)BAM dataset collectionReference genome (mm10)Gene annotations in refFlat formatSecond read transcription strandrRNAs in interval list format

32. DEMO: Use CollectRnaSeqMetrics to assess RNA-Seq alignments

33. Identify coverage bias along the transcript cDNA fragmentationDNase I treatment or sonicationRNA fragmentationRNA hydrolysis or nebulizationGene Span (5,099 genes)RNA-Seq Read Count5’3’Wang Z et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.Normalized coverageNormalized distance along transcriptRNA-Seq coverage vs. transcript position(G1E_R1)2.52.01.51.00.50.00402060801001.0

34. Two common approaches to RNA-Seq assemblyReference-based assemblyMap RNA-Seq reads against a reference genomeExamples: TopHat2, HISATAssemble transcripts from mapped RNA-Seq readsExamples: Cufflinks, StringTieDe novo transcriptome assembly Assemble transcripts from RNA-Seq readsExamples: Oases, TrinityMore computationally expensiveMerge assemblies produced by different parameters

35. Augment mapped RNA-Seq reads with pre-assembled super-reads (SR)Pertea M et al. StringTie enables improved reconstruction of a transcriptome from RNA-Seq reads. Nat Biotechnol. 2015 Mar;33(3):290-5.

36. Transcriptome assembly remains an active area of researchKorf I. Genomics: the state of the art in RNA-Seq analysis. Nat Methods. 2013 Dec;10(12):1165-6.Steijger T et al. Assessment of transcript reconstruction methods for RNA-Seq. Nat Methods. 2013 Dec;10(12):1177-84.

37. DEMO: Assemble transcripts from mapped RNA-Seq reads using StringTie

38. Metrics for quantifying gene expression levelsRPKMReads Per Kilobase per Million mapped readsNormalize relative to sequencing depth and gene lengthFPKMSimilar to RPKM but count DNA fragments instead of readsUsed in paired end RNA-Seq experiments to avoid biasTPMTranscripts Per MillionNormalize for gene length, then normalize by sequencing depthWagner GP et al. Measurement of mRNA abundance using RNA-Seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012 Dec;131(4):281-5.

39. RPKM, FPKM, and TPM compare relative gene expression levels within a sampleMost differential expression analysis tools use read counts as input

40. Create a transcriptome library with StringTie mergeCombine transcripts from multiple samples into a single transcriptome databasePertea M et al. Transcript-level expression analysis of RNA-Seq experiments with HISAT, StringTie and Ballgown. Nat Protoc. 2016 Sep;11(9):1650-67.Sample 1Sample 2Sample 3Sample 4Reference annotationMerged assemblies(A)(A)(A)(A)(B)(B)(B)

41. DEMO: Use StringTie merge to construct a transcriptome database for G1E and Mk

42. Differential expression analysis toolsCompare genes expression levels:DESeq2 (http://bioconductor.org/packages/release/bioc/html/DESeq2.html)edgeR (http://bioconductor.org/packages/release/bioc/html/edgeR.html)Compare transcripts expression levels:Cuffdiff (http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/)Ballgown (http://bioconductor.org/packages/release/bioc/html/ballgown.html)MISO (http://genes.mit.edu/burgelab/miso/index.html)Salmon (https://combine-lab.github.io/salmon/)Compare both genes and transcripts expression levels:RSEM + EBSeq (http://deweylab.biostat.wisc.edu/rsem/README.html)

43. Count the number of reads that overlap with each gene using htseq-countThree modes of overlap resolution:unionintersection_strictintersection_nonemptyhtseq-count ignores multi-mapped readshttp://www-huber.embl.de/HTSeq/doc/overview.html

44. featureCounts is a faster alternative to htseq-countLiao Y et al. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014 Apr 1;30(7):923-30.

45. Issues with the GTF files produced by the UCSC Table BrowserThe gene_id and transcript_id fields in the GTF file have the same valuesGTF format stipulates that different isoforms derived from the same gene should have the same gene_idTwo potential solutions:Download genePred file from the UCSC Genome Browser web site, then use genePredToGtf to create the GTF filehttp://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format Use the GTF files from Ensemblhttp://www.ensembl.org/info/data/ftp/index.html

46. DEMO: Calculate the read count for each transcript using featureCounts

47. https://en.wikipedia.org/wiki/Poisson_distribution Use Poisson distribution to model the distribution of read countsProbability of the number of “events” in a fixed amount of time or spaceEvents = RNA-Seq readsMean = variance = λAssumptions:Events are independentRate of events are constant

48. log(variance)log(mean)-50510-55015102520Overdispersion in RNA-Seq dataMore highly expressed genes show higher variancePoissonActualOverdispersionNegative binomial distribution:Zhou YH et al. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011 Oct 1;27(19):2672-8.

49. Use biological replicates to control Type I errorsHuber W. Differential Expression for RNA-Seq.https://www.ebi.ac.uk/training/online/course/embo-practical-course-analysis-high-throughput-seqUse biological replicates to estimate variance within a conditionlog2 fold change-505Mean expression100101102103104105Identify differentially expressed genes under different conditions-50510010110210310410510610-1Mean expressionlog2 fold change

50. DEMO: Differential expression analysis using DESeq2

51. Verify results using multiple differential expression analysis toolsZhang ZH et al. A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One. 2014 Aug 13;9(8):e103207.5MImpact of read depth on differential expression analysis10M20M30MUse the intersection of differentially expressed genes identified by multiple tools in downstream analyses

52. Tools for analyzing differentially expressed genesGene Ontology (GO) terms enrichment:topGO (https://bioconductor.org/packages/release/bioc/html/topGO.html)goSTAG (https://bioconductor.org/packages/release/bioc/html/goSTAG.html)DAVID (https://david.ncifcrf.gov/)Pathway analysis:GAGE (http://bioconductor.org/packages/release/bioc/html/gage.html)Reactome (http://www.reactome.org/)Sample walkthrough:From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipelinehttps://www.bioconductor.org/help/workflows/RnaSeqGeneEdgeRQL/

53. Additional resourcesAnalysis of RNA-Seq data: gene-level exploratory analysis and differential expressionhttp://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html Informatics for RNA-Seq: A web resource for analysis on the cloudhttps://github.com/griffithlab/rnaseq_tutorial/wiki UC Davis Bioinformatics Core training coursehttp://bioinformatics.ucdavis.edu/training/documentation/ So you want to do a: RNA-Seq experiment, Differential Gene Expression Analysishttps://github.com/msettles/Workshop_RNAseq

54. SummaryStatistical power of RNA-Seq depends on the number of biological replicates and sequencing depthAssess quality of RNA-Seq reads with FastQCAssess RNA-Seq alignments with Picard toolsPerform transcriptome assembly with StringTiePerform differential expression analyses with DESeq2

55. Questions?https://flic.kr/p/bhyT8B

56.

57. Power curves for number of biological replicates in each conditionWeb interface for RnaSeqSampleSize:https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/ Power# Samples in each condition01020304050600.00.20.40.80.6FDR = 0.05FDR = 0.01Ching T et al. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014 Nov;20(11): 1684-96.