/
Dynamic Programming II  Gene Prediction: Similarity-Based Approaches Dynamic Programming II  Gene Prediction: Similarity-Based Approaches

Dynamic Programming II Gene Prediction: Similarity-Based Approaches - PowerPoint Presentation

finley
finley . @finley
Follow
30 views
Uploaded On 2024-02-09

Dynamic Programming II Gene Prediction: Similarity-Based Approaches - PPT Presentation

The idea of similaritybased approach to gene prediction Exon Chaining Problem Spliced Alignment Problem Gene Prediction Computational problem of predicting the locations of genes in a genome given only the genomic DNA sequence gene is broken into pieces called as exons that are separated by j ID: 1045594

gene alignment exons exon alignment gene exon exons genes sequence spliced similarity chaining set genome target interval genomic putative

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dynamic Programming II Gene Prediction:..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Dynamic Programming II Gene Prediction: Similarity-Based ApproachesThe idea of similarity-based approach to gene prediction Exon Chaining ProblemSpliced Alignment Problem

2. Gene PredictionComputational problem of predicting the locations of genes in a genome given only the genomic DNA sequence (gene is broken into pieces called as exons that are separated by junk NDA/introns).Two approachesStatistical Approaches : limited success since it tends to match frequently in the genome at non-splice sites using a profile describing the propensities of different nucleotides to occur at different positions. use stop codons, TAA, TAG, TGA, or start codon ATG AG and GT on the left- and right-hand sides of an exon are highly conserved. Similarity-Based Approaches: using previously sequenced genes and their protein products as a template for the recognition of unknown genes in newly sequenced DNA fragments. Find all local similarities between the genomic sequence and the target protein sequence (using local alignment algorithm)Shorten or extend the substrings with high similarity such that they start at AG and end at GT. The resulting set may contain overlapping substrings. Choose the best subset of non-overlapping substrings as a putative exon structure.

3. Similarity-Based Approach to Gene PredictionGenes in different organisms are similarThe similarity-based approach uses known genes in one genome to predict (unknown) genes in another genomeProblem: Given a known gene and an unannotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene

4. Comparing Genes in Two Genomes Small islands of similarity corresponding to similarities between exons

5. Reverse TranslationGiven a known protein, find a gene in the genome which codes for itOne might infer the coding DNA of the given protein by reversing the translation processInexact: amino acids map to > 1 codon UUA, UUG, CUU, CUC, CUA, CUG: leucineThis problem is essentially reduced to an alignment problem

6. Comparing Genomic DNA Against mRNAPortion of genomemRNA (codon sequence)exon3exon1exon2{{{intron1intron2{{

7. Using Similarities to Find the Exon StructureThe known frog gene is aligned to different locations in the human genomeFind the “best” path to reveal the exon structure of human geneFrog Gene (known)Human Genome

8. Finding Local AlignmentsUse local alignments to find all islands of similarity Human GenomeFrog Genes (known)

9. Chaining Local AlignmentsFind substrings that match a given gene sequence (candidate exons)Define a candidate exons as (l, r, w) (left, right, weight defined as score of local alignment)Look for a maximum chain of substringsChain: a set of non-overlapping nonadjacent intervals.

10. Exon Chaining ProblemLocate the beginning and end of each interval (2n points)Find the “best” path34119155502356111316202527283032

11. Exon Chaining ProblemP Given a set of putative exons, where each exon is represented by (l,r,w), l and r are the left- and right-hand positions, and w is the weight reflecting the likelihood that this interval is an exon (e.g., the local alignment score), find a maximum set of non-overlapping putative exons. Input: A set of weighted intervals (putative exons)Output: A maximum chain of intervals from this set. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 310451267100033555991010151515171717213617451012

12. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 31045136710ExonChaining(G,n) for i := 1 to 2n s[i] :=0; l[i] = 0; for i : = 1 to 2n if vertex v[i] in G corresponds to the right end of an interval I j := index of vertex for left end of the interval I w := weight of the interval I s[i]:= max{s[j] + w, s[i-1]} if (s[i] = s[j]+w) then l[i]= j; else s[i] := s[i-1]v003355599101015151517171821361745101300201004090500110716slIPrintChain(l, m)If m = 0 returnElse if ( l[m] ≠ 0 ) PrintChain(l, l[m]) Print “(“ l[m], m “)”else PrintChain (l, m-1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

13. Exon Chaining: DeficienciesPoor definition of the putative exon endpointsOptimal chain of intervals may not correspond to any valid alignmentFirst interval may correspond to a suffix, whereas second interval may correspond to a prefixCombination of such intervals is not a valid alignment

14. Infeasible Chains Red local similarities form two non -overlapping intervals but do not form a valid global alignment Human GenomeFrog Genes (known)

15. Gene Prediction Analogy: Selecting Putative ExonsThe cell carries DNA as a blueprint for producing proteins, like a manufacturer carries a blueprint for producing a car.

16. Using Blueprint

17. Assembling Putative Exons

18. Still Assembling Putative Exons

19. Spliced AlignmentMikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). This set is further filtered in a such a way that attempt to retain all true exons, with some false ones.

20. Spliced Alignment Problem: FormulationGoal: Find a chain of blocks in a genomic sequence that best fits a target sequenceInput: Genomic sequences G, target sequence T, and a set of candidate exons B. Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximum among all chains of blocks from B. Γ* - concatenation of all exons from chain Γ

21. Lewis Carroll Example12341243

22. Spliced Alignment: IdeaCompute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j)But what is “i-prefix” of G?There may be a few i-prefixes of G depending on which block B we are in. Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i S(i,j,B)

23. Spliced Alignment Recurrence If i is not the starting vertex of block B:S(i, j, B) = max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ(gi, tj) } If i is the starting vertex of block B:S(i, j, B) = max { S(i, j – 1, B) – indel penalty maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)}GTEnd(B’)ijj-1i-1S(i,j,B)

24. Spliced Alignment SolutionAfter computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is:maxall blocks BS(end(B), length(T), B)

25. Spliced Alignment: Complications Considering multiple i-prefixes leads to slow down. running time: O(mn2 |B|) where m is the target length, n is the genomic sequence length and |B| is the number of blocks.A mosaic effect: short exons are easily combined to fit any target protein

26. Spliced Alignment: Speedup

27. Spliced Alignment: Speedup

28. Spliced Alignment: Speedup P(i,j)=maxall blocks B preceding position i S(end(B), j, B)

29. Exon Chaining vs Spliced AlignmentIn Spliced Alignment, every path spells out string obtained by concatenation of labels of its edges. The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequenceDefines weight of entire path in graph, but not the weights for individual edges.Exon Chaining assumes the positions and weights of exons are pre-defined

30. Gene Prediction ToolsGENSCAN/Genome ScanTwinScanGlimmerGenMark

31. Gene Prediction: Aligning Genome vs. GenomeAlign entire human and mouse genomesPredict genes in both sequences simultaneously as chains of aligned blocks (exons)This approach does not assume any annotation of either human or mouse genes.

32. The GENSCAN AlgorithmAlgorithm is based on probabilistic model of gene structure similar to Hidden Markov Models (HMMs). GENSCAN uses a training set in order to estimate the HMM parameters, then the algorithm returns the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc)Covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes.

33. GENSCAN LimitationsDoes not use similarity search to predict genes. Does not address alternative splicing. Could combine two exons from consecutive genes together

34. Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity informationAlgorithm is a combination of two sources of informationProbabilistic models of exons-intronsSequence similarity informationGenomeScan

35. TwinScanAligns two sequences and marks each base as gap ( - ), mismatch (:), match (|), resulting in a new alphabet of 12 letters: Σ {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. Run Viterbi algorithm using emissions ek(b) where b ∊ {A-, A:, A|, …, T|}.http://www.standford.edu/class/cs262/Spring2003/Notes/ln10.pdf

36. TwinScan (cont’d)The emission probabilities are estimated from from human/mouse gene pairs. Ex. eI(x|) < eE(x|) since matches are favored in exons, and eI(x-) > eE(x-) since gaps (as well as mismatches) are favored in introns. Compensates for dominant occurrence of poly-A region in introns

37. GlimmerGene Locator and Interpolated Markov ModelERFinds genes in bacterial DNAUses interpolated Markov Models

38. The Glimmer AlgorithmMade of 2 programsBuildIMMTakes sequences as input and outputs the Interpolated Markov Models (IMMs)GlimmerTakes IMMs and outputs all candidate genesAutomatically resolves overlapping genes by choosing one, hence limitedMarks “suspected to truly overlap” genes for closer inspection by user

39. GenMarkBased on non-stationary Markov chain modelsResults displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence

40. Homework AssignmentSolve Exon Chaining Problem in slide 11 when the weight for interval (6,12) change to 18 and weight for interval change to (13,14) = 4 (2) Implement Exon Chaining Algorithm in slide 12. Print out the tables s, l and exon chain for the problem in slide 11.