/
Combinational Pattern Matching Combinational Pattern Matching

Combinational Pattern Matching - PowerPoint Presentation

summer
summer . @summer
Follow
69 views
Uploaded On 2023-06-22

Combinational Pattern Matching - PPT Presentation

Shashank Kadaveru Introduction In Motif Finding problem no particular pattern is given to search for We infer it from the sample In Combinational Pattern Matching we look for exact or appropriate occurrences of given patterns in a long text ID: 1001909

blast pattern text suffix pattern blast suffix text matching tree database time algorithm sequence length problem search keyword segment

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Combinational Pattern Matching" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Combinational Pattern MatchingShashank Kadaveru

2. IntroductionIn Motif Finding problem, no particular pattern is given to search for. We infer it from the sample.In Combinational Pattern Matching, we look for exact or appropriate occurrences of given patterns in a long text.

3. Pattern MatchingMight seem simpler than Motif finding because we know what we are looking for.But due to the large size of genomes, in practice it is difficult.Here, we develop number of ways to make pattern matching in a long string.

4. Repeat FindingGenetic diseases are associated with deletions, duplications and rearrangements of long chromosomal regions.For example, DiGeorge syndrome which results in impaired immune system and heart defects, is associated with a 3MB deletion on human chromosome 22.The deletion removes important genes which leads to a disease.

5. Repeat FindingRepeats in DNA hold many evolutionary secrets. A striking and still unexplained phenomenon is the large number of repeats in many genomes.For example, repeats account for about 50% of the human genome.

6. Duplicate Removal ProblemFind all unique entries in a list of integers. For example, the list (8, 1, 5, 1, 0, 4, 5, 10, 1) has the elements 1 and 5 repeated multiple times, so the resulting list would be (8, 1, 5, 0, 4, 10)

7. Duplicate Removal AlgorithmDUPLICATEREMOVAL(a, n)1 m <- largest element of a2 for i <- 0 to m3 bi <- 04 for i <- 0 to n5 bai <- 16 for i <- 0 to m7 if bi = 18 output i9 return

8. Duplicate Removal AlgorithmThis approach requires O(n log n) time to sort the list of size n.Difficult to use when the entries are arbitrarily large.Creating a list with so many values is not efficient and may not even be possible.

9. Duplicate Removal AlgorithmIf we can map any number of entries to some number between 1 and 1000 then we can use binning strategy efficiently.The function used for mapping is called Hash Function.The second list is called Hash Table.

10. Hash FunctionShould be easy to calculate and integer valued.A simple hash function takes h(x) to integer remainder of x/|b|.We want different integers in the array to map to different integers in b.But this isn’t possible when the size of array a is larger than b.

11. Hash FunctionExample:If h(x) = x/10007, 1007, 2007 collide and fall into the same bin.This is called a collision.To deal with collisions, we use Chaining.

12. ChainingElements collapsing to the same bin are organized into a linked list.Also, elements falling into the same bins reveal duplicate elements and hence solves the duplicate removal problem.

13. ChainingThis concept can apply to many other problems.Ex: to find exact repeats of l-mer in a genome.If we have to find exact repeats for large values like l = 40, then we need not construct 440 entries. We can design a suitable hashing function and use a table of much smaller size.

14. Exact Pattern MatchingIf we have a pattern p = p1 p2 . . pn and a string t = t1 t2 . . tm Pattern matching problem is used to find any or all occurrences of pattern p in the text t.Brute force algorithm is used for solving Pattern Matching problem.

15. Exact Pattern MatchingPATTERNMATCHING(p, t)1 n <- length of pattern p2 m <- length of text t3 for i <- 1 to m− n + 14 if ti = p5 output i

16. Exact Pattern MatchingWorst case running time is O (mn)Such a worst-case scenario can be seen by searching for p = AAAAT in t =AAAA. . .AAAAA. Peter Weiner invented a data structure called Suffix Tree that solves Pattern Matching problem in linear time O(m) for any text or pattern.To understand this, we need to learn about Keyword Trees.

17. Keyword TreesMultiple Pattern Matching Problem:Given a set of patterns and a text, find all occurrences of any of the patterns in the text. For example, if t = ATGGTCGGT and p1 = GGT, p2 = GGG, p3 = ATG, and p4 = CG, then the result of an algorithm that solves the Multiple Pattern Matching problem would be positions 1, 3, 6, and 7.

18. Keyword TreesMultiple Pattern Matching problem with k patterns can be reduced to k individual Pattern Matching problems and solved in O(knm) time. However, there exists an even faster way to solve this problem in O(n + m) time where n is the total length of patterns p1, p2, . . . , pk

19. Keyword TreesThe key- word tree for the set of patterns apple, apropos, banana, bandana, and orange is shown in figure.

20. Keyword TreesThe keyword tree has at most N vertices where N is the total length of all patterns, but may have fewer. One can construct the keyword tree in O(N) time by progressively extending the keyword tree.

21. Suffix TreesSuffix trees allow one to preprocess a text in such a way that for any pattern of length n, one can answer whether or not it occurs in the text, using only O(n) time.It doesn’t depend on the length of the text.

22. Suffix TreesPatterns: proud, perfect, muggle, ugly, rivet t = “mr and mrs dursley of number 4 privet drive were proud to say that they were perfectly normal thank you very much”

23. Suffix Trees

24. Suffix Trees

25. Suffix TreesA suffix tree for a text of length m can be constructed in O(m) time, which leads immediately to a linear O(n + m) algorithm for the Pattern Matching problem Construct the suffix tree for t, in O(m) time, and then check whether p occurs in the tree, which requires O(n) time.

26. Suffix TreesThe suffix tree for a text t = t1 ...tm is a rooted labeled tree with m leaves (numbered from 1 to m) satisfying the following conditions Each edge is labeled with a substring of the textEach internal vertex (except possibly the root) has atleast 2 children, Any two edges out of the same vertex start with a different letter,Every suffix of text t is spelled out on a path from the root to some leaf.

27. Suffix Trees

28. Suffix TreesSUFFIXTREEPATTERNMATCHING(p, t) Build the suffix tree for text tThread pattern p through the suffix tree. if threading is complete output positions of every p-matching leaf in the tree else output “pattern does not appear anywhere in the text”

29. Heuristic Similarity Search Algorithms The suffix tree algorithm above, while fast, can only find exact, rather than approximate, occurrences of a gene in a database. When we are trying to find an approximate match to a gene of length 103 in a database of size 1010, the programming algorithms (like the Smith- Waterman local alignment algorithm) may be too slow.

30. Approximate Pattern MatchingFinds all approximate occurrences of a pattern in a text. The naive brute force algorithm for approximate pattern matching runs in O(nm) time. The following algorithm will output all locations in t where p occurs with no more than k mismatches.

31. Approximate Pattern MatchingAPPROXIMATEPATTERNMATCHING(p, t, k) n ← length of pattern p m ← length of text t for i←1tom−n+1 dist ← 0for j←1to n if ti+j−1 != pj dist ← dist + 1 if dist ≤ k output i

32. Approximate Pattern MatchingIn 1985 Gadi Landau and Udi Vishkin found an algorithm for approximate string matching with O(km) worst-case running time. Although this algorithm yields the best known worst-case performance, it is not necessarily the best in practice.

33. Heuristic Similarity Search Algorithms Many heuristics for fast database search in molecular biology use the same filtration idea. In 1973 Donald Knuth suggested a method for pattern matching with one mismatch based on the observation that strings differing by a single mismatch must match exactly in either the first or second half.

34. Heuristic Similarity Search Algorithms In 1985 the idea of filtration in computational molecular biology was used by David Lipman and Bill Pearson, in their FASTA algorithm. It was further developed in BLAST, now the dominant database search tool in molecular biology.

35. Heuristic Similarity Search Algorithms Biologists frequently depict similarities between two sequences in the form of dot matrices. A dot matrix is simply a matrix with each entry either 0 or 1, where a 1 at position (i, j) indicates some similarity between the ith position of the first sequence and the jth position of the second sequence.

36. Heuristic Similarity Search Algorithms

37. BLAST: Comparing a Sequence against a Database Using shared l-mers for finding similarities, as FASTA does, has some disadvantages. For example, two proteins can have different amino acid sequences but still be biologically similar. A common construct in many heuristic similarity search algorithms, including BLAST, is that of a scoring matrix similar to the scoring matrices introduced in chapter 6. These scoring matrices reveal similarities between proteins even if they do not share a single l-mer.

38. BLAST: Comparing a Sequence against a Database BLAST, uses scoring matrices to improve the efficiency of filtration and to introduce more accurate rules for locating potential matches. Another powerful feature of BLAST is the use of Altschul-Dembo-Karlin statistics for estimating the statistical significance of found matches.

39. BLAST: Comparing a Sequence against a Database A segment pair is just a pair of l-mers, one from each sequence. The maximal segment pair is a segment pair with the best score over all segment pairs in the two sequences. A segment pair is locally maximal if its score cannot be improved either by extending or by shortening both segments.

40. BLAST: Comparing a Sequence against a Database BLAST attempts to find all locally maximal segment pairs in the query and database sequences with scores above some set threshold. A fast algorithm for finding such l-mers is the key ingredient of BLAST.

41. BLAST: Comparing a Sequence against a Database An important observation is that if the threshold is high enough, then the set of all l-mers that have scores above a threshold is not too large. In this case the database can be searched for exact occurrences of the strings from this set using Multiple Pattern Matching Algorithm. After the potential matches are located, BLAST attempts to extend them to see whether the resulting score is above the threshold.

42. BLAST: Comparing a Sequence against a Database In recent years BLAST has been further improved by allowing insertions and deletions and combining matches on the same and close diagonals. BLAST reports matches to sequences that either have one segment score above the threshold or that have several closely located segment pairs that are statistically significant when combined.

43. BLAST: Comparing a Sequence against a Database According to the Altschul-Dembo-Karlin statistics, the num- ber of matches with scores above θ is approximately Poisson-distributed, with mean where m and n are the lengths of compared sequences, and K is a constant. Parameter λ is a positive root of the equation

44. BLAST: Comparing a Sequence against a Database where px and py are frequencies of amino acids x and y from the twenty- letter alphabet A and δ is the scoring matrix. The probability that there is a match of score greater that θ between two “random” sequences of length n and m is computed as 1 − eE(θ).

45. Any Questions?Thank You