/
Learning to count: quantifying signal Learning to count: quantifying signal

Learning to count: quantifying signal - PowerPoint Presentation

ash
ash . @ash
Follow
71 views
Uploaded On 2023-05-23

Learning to count: quantifying signal - PPT Presentation

The two steps of abundance estimation Read counting Count normalization However this process is nontrivialthere are many features of transcription that can complicate this process 5 and 3 UTR boundaries ID: 999322

reads feature featurecounts read feature reads read featurecounts features gene file number count meta expression files mapped sample total

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Learning to count: quantifying signal" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Learning to count: quantifying signal

2. The two steps of abundance estimationRead countingCount normalizationHowever this process is non-trivial---there are many features of transcription that can complicate this process:5’ and 3’ UTR boundariesAlternative splicingCryptic exons and TSS’s

3. The two steps of abundance estimationRead counting:Requires that the reads have been confidently mapped to locations in the genomeRequires prior knowledge of well-defined regions over which one wants to count, i.e. features (exons, genes, isoforms, promoters, etc). We call this information the genome “annotation.”Count normalizing:There are numerous methods of normalizing, all of which have trade-offs.It is critical in order to be able to draw any biologically meaningful conclusions from one’s data.

4. Summarizing read coverage (basics):The goal: to determine the expression level of a particular genomic feature (gene/exon/transcript), using a set of reads that have been mapped to a reference genome.The (simplest) answer: define the boundaries of your genomic feature and count the total number of reads that overlap that region.genomic featuregenomeThe reads in green map to genomic locations that overlap the genomic feature, while those in red do not.

5. Comparing read coverage (expression):Gene 1 coverage: 100 readsGene 2 coverage: 250 readsIs the expression of Gene 1 < expression of Gene 2?The number of reads is (roughly) proportional to…the length of the genethe total number of reads in the libraryAND the expression level of the gene

6. Reads versus Fragments:For single-end data there is one read per fragment.Read 1Read 1cDNA fragmentFor paired-end data there are two reads per fragment.Fragments are pieces of cDNA generated from your original RNA sample---they are a direct reflection of the biological expression of your sample. Reads are the sequence of bases read from a fragment and recorded in your fastq file.We want our summarization of coverage to reflect the number of fragments present in our sample. This means, for paired-end data the two reads (forward and reverse) are redundant pieces of information.

7. Basic schema for normalizing counts:Reads Per Million mapped (RPM/CPM): The counts for each feature are divided through by the total number of millions of mapped reads. This adjusts for the differences in sequencing depth.Reads Per Kilobase per Million mapped (RPKM/FPKM): 1) Count total number of reads in a feature. 2) Divide by the total number of millions mapped in the sample (RPM). 3) Divide the RPM value by the length of the gene/feature, in kilobases.Can be used to compare between different genes within the same sample.Transcripts Per Million mapped (TPM): 1) Count total number of reads in a feature. 2) Divide count by the length of the gene/feature, in kilobases. This results in reads per kilobase (RPK). 3) Sum up all RPK values within the sample and divide by . This is your “per million” scaling factor. 4) Divide each RPK value by the “per million” scaling factor. This gives you TPM for each feature.Sum of TPM’s is the same across samples—good for comparing different genes between different samples.  

8. Effect of RPKM normalization Does transcript #4 actually have the same abundance as transcript #2 and greater abundance than transcript #3?https://www.nature.com/articles/nmeth.1613Comparing low and high read count for the same geneComparing low and high read count for different genesRPKM normalization:

9. However, these simple methods of summarizing coverage don’t apply if we’re interested in the abundance of gene isoforms (i.e., transcripts)Gene-wise vs. Isoform abundanceExon 1Exon 2Exon 3Abundance   Isoform 1Isoform 2Isoform 3Length# Reads      RPKM/TPM combine the read coverage across all exons and thus is not sensitive to the contributions of the various isoforms. In other words, for a given read, it is not clear from what isoform it originated.

10. One method for read summarizing – featureCounts()featureCounts is a part of the Rsubread package for R found on the Bioconductors repositoryhttp://master.bioconductor.org/packages/release/bioc/html/Rsubread.htmlhttps://bioconductor.org/packages/devel/bioc/manuals/Rsubread/man/Rsubread.pdffeatureCounts input:SAM or BAM file(s) containing mapped and sorted readsAnnotation file containing features and meta-features (GTF or SAF)featureCounts output:A count table (R dataframe) with rows corresponding to features and columns corresponding to samplesfeatureCounts is a counting method that performs this kind of gene-level summarizing of reads, which is then used in differential expression analysis (e.g. input into DESeq, DESeq2, edgeR, etc.)https://www.ncbi.nlm.nih.gov/pubmed/24227677

11. Defining “feature” and “overlap”:Features and meta-featuresfeatureCounts performs read summarization at feature level or meta-feature levelA feature is a continuous region in the genome, such as an exonA meta-feature is an aggregation of one or more features, such as a gene or transcriptFeatures and meta-features must be provided to featureCounts (in the form of a GTF or SAF file)Overlap between reads and featuresA read is said to overlap with a feature if there is at least 1 base overlap found between them. A read is said to overlap with a meta-feature if it overlaps with at least one of its features.Multi-overlappingA read is a multi-overlapping read if it overlaps with more than one feature when summarization is performed at feature level, or if it overlaps with more than one meta-feature when summarization is performed at meta-feature level.https://www.ncbi.nlm.nih.gov/pubmed/24227677

12. featureCounts algorithm:Features are sorted by their 5’ coordinate and arranged into a 2-level hierarchy: bins and blocksReference sequence divided into non-overlapping 128kb binsWithin each bin, equal numbers of consecutive features are group into blocksNumber of blocks in a bin is the square root of number of features in the binThus, which is optimal for hierarchical search.Reads are first assigned to their bins, then within each bin the reads are assigned to their blocksFinally, the reads in each block are assigned to any feature with which they overlap https://www.ncbi.nlm.nih.gov/pubmed/24227677

13. featureCounts inputs:Gene Transfer Format (GTF) file:Refinement of the General Feature Format (GFF) file.File format: <seq name> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]Simplified Annotation Format (SAF) file:Similar to the BED file formatRequired fields (5): <GeneID> <Chr> <Start> <End> <Strand>Contains header line: “GeneID Chr Start End Strand”As inputs, featureCounts requires, at minimum 1) an annotation file (either GTF or SAF) containing information about your features and 2) a list of SAM or BAM files to summarize.

14. featureCounts options:files= a character vector giving names of input files containing read mapping results.annot.ext= a character string giving name of a user-provided annotation file or data frame.isGTFAnnotationFile= logical indicating whether the annotation (annot.ext) is in GTF format (default: FALSE)useMetaFeatures= logical indicating whether the read summarization should be performed at the meta-feature levelallowMultiOverlap= logical indicating if a read is allowed to be assigned to more than one feature it overlapslargestOverlap= if TRUE, read (or pair) is assigned to feature with the largest overlapminOverlap= integer giving the min number of overlapped bases required for a read to be assigned to a featureisPairedEnd= logical indicating if input files contain paired-end readsrequireBothEndsMapped= logical indicating if both ends of the same read pair are required to be successfully alignednthreads= integer giving the number of threads used to run this functionfeatureCounts has many options for tailoring how it summarizes reads. Some of the most relevant ones are as follows:You will need to specify at least files, annot.ext, isFTFAnnotationFile, useMetaFeatures, and isPairedEnd

15. featureCounts execution:In this example we have 10 bam files to summarize, for which we’re specifying sample ID’s (“PH01” through “PH10”). The variable bamfilelist is a vector of character strings, each string being the total path to the corresponding BAM files. The sampleIDs are then categorized as factors, for later use in differential expression analysis (sampleIDs must be in the same order as files in bamfilelist)We call featureCounts with an SAF annotation file, for paired-end data, summarizing by meta-features, etc.Finally, we assign the sampleIDs to the columns of the output counts matrix (by default featureCounts will assign the path character strings in bamfilelist to the columns).

16. featureCounts output:Here we see the summary of the featureCounts output:If we look at the contents of output$counts, we see it is a matrix of integer values, where the column names are the samples (in this case PH01  PH10) and the row names are the features over which we summarized the reads (in this case gene IDs)The primary component of the output we’re interested in is that named counts.This count matrix (at minimum) is the necessary input for downstream differential expression analysis programs, such as DESeq or edgeR

17. featureCounts performance:https://www.ncbi.nlm.nih.gov/pubmed/24227677featureCounts is both faster and more memory efficient than other common read summarizing programs.

18. https://www.ncbi.nlm.nih.gov/pubmed/24227677featureCounts algorithm complexity:featureCounts algorithm complexity relative to other common read summarizing methods.