/
Introduction to  Dynamic Programming Introduction to  Dynamic Programming

Introduction to Dynamic Programming - PowerPoint Presentation

wang
wang . @wang
Follow
68 views
Uploaded On 2023-07-08

Introduction to Dynamic Programming - PPT Presentation

The sequence alignment problem Wilson Leung 08 2015 Outline Overview of the sequence alignment problem Calculate the optimal global alignment C haracteristics of dynamic programming algorithms ID: 1007063

subject alignment query optimal alignment subject optimal query gap score cell alignments position max length mismatch match local sequence

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to Dynamic Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Introduction to Dynamic ProgrammingThe sequence alignment problemWilson Leung 08/2015

2. OutlineOverview of the sequence alignment problemCalculate the optimal global alignmentCharacteristics of dynamic programming algorithmsCalculate the optimal local alignment

3. Learning objectivesUnderstand the theory behind sequence alignmentBecome a better informed user of NCBI BLASTThis presentation will not cover:The BLAST algorithmParameter optimizationsStatistics for similarity searches (Karlin-Altschul theory)Korf, I., Yandell, M., and Bedell, J. (2003). BLAST. O’Reilly Media, Inc.

4. Design goalsGenerate an alignment between two sequencesIdentify the “best” (most parsimonious) alignmentGenerate the best alignment “quickly”

5. Strategy #1: Visual inspectionSequences must have high percent identityApplications:PAM scoring matrix (align sequences with >= 85% identity)Align mononucleotide runs during sequence improvementATTACCAGATCACCAG|| |||||Query:Subject:

6. Strategy #2: Enumerate all alignmentsGuaranteed to find the best alignmentDoes not scaleCombinatorial explosionTwo 300 bp sequences have ~10179 possible alignments (Eddy 2004)Brute-force algorithmEstablish baseline performance and test casesIdentify patterns in the problem space

7. A--ANot homologousApply the brute force algorithm to a single column of the alignmentThree possible alignments for two 1 bp sequencesQuery length (M) = 1; Subject length (N) =1Only two biological interpretations:A in the query is homologous to A in the subjectA in the query is not homologous to A in the subjectAAQuery:Subject:Homologous-AA-

8. Six possible relationships between the query and subject for M=2, N=2Query:Subject:ATAT2 aligned bases-ATA-TA-TAT--ATAT-A-T-ATAT-A-TAT--AT1 aligned baseAT----ATA-T--A-TA--T-AT---ATAT---A-TA-T--AT-A--T0 aligned basesEach color denotes a different evolutionary relationship

9. Observations from the brute force alignment strategyMany of the possible alignments are redundantImply the same evolutionary relationshipLarge number of possible alignments13 possible alignments for sequences of length 2Can ignore many possible alignments Many are suboptimal compared to the best alignment

10. Strategy #3: Dot plotCell position (i,j):i = Query position (x-axis)j = Subject position (y-axis)Draw a dot at (i,j) if the two bases are identicalConnect the dots to make a line (alignment)Level of noise depends on repeat densityUse longer words and higher cutoff scores to reduce noiseQuery (x)Subject (y)AlignInsertion in subjectDeletion in subject

11. Assessment of the three sequence alignment strategiesInfeasible to examine all possible alignmentsNeed to reduce the search spaceOnly a small subset of alignments are “interesting”Many alignments are redundantConnect the dots in the dot plot to create an alignmentConsider the cumulative levels of similarity

12. The optimal alignment is composed of smaller optimal alignments Only the best alignment at each position could be part of the final optimal alignmentQuery:Subject:ATATAlignInsertion in subjectDeletion in subject-AA-TTAATTQuery:Subject:AA-TT--AA--TT-

13. Partition the alignment problem into smaller subproblemsAssume the query and subject sequences are the sameQuery (x)Subject (y)11001100QuerySubject

14. Three different ways to reach cell (i,j) in the alignment matrixQuery (x)Subject (y)(i,j)Gap in query(i, j-1)(i,j-1)-A(i-1,j)Gap in subject(i-1, j)A-(i-1,j-1)Align with subject(i-1, j-1)AAAAArrow = alignment

15. Construct a scoring system to measure similarity between two sequencesScoring system for the aligned state: 𝛔𝛔(a, b) = Score for aligning a in query with b in subject𝛔(A, A) = Bonus for aligning A in query with A in subject𝛔(A, T) = Penalty for aligning A in query with T in subjectPenalty for adding a gap: 𝛾More sophisticated scoring systems take transitions, transversions, affine gap penalty into accountPearson WR. Selecting the Right Similarity-Scoring Matrix. Curr Protoc Bioinformatics. 2013;43:3.5.1-3.5.9.

16. Recursive definition for the optimal cumulative alignment score S(i,j)Query (x)Subject (y)(i-1,j-1)(i,j)S(i,j) = max {}(i-1,j)(i,j-1)abS(i-1,j-1) + 𝛔(a,b)𝛔(a,b)Align𝛾S(i ,j-1) + 𝛾Gap in queryS(i-1,j ) + 𝛾𝛾Gap in subject

17. S(i,j) = max {}Determine the best way to reach cell (i,j) if it were part of the optimal alignmentQuerySubject?(i,j)Optimal alignmentAlignabGap in subjectaGap in queryb

18. Use the maximum score at each cell to eliminate entire branch of suboptimal alignmentsAlignGap in queryGap in subject(i,j)

19. Cumulative score S(i,j) encapsulates the alignment decisions up to position (i,j)All potential optimal alignments that go through cell (i,j) have the same ancestryRe-use the cumulative alignment score (memoization)Gaps are described by the cumulative scoreDo not affect the coordinates of the alignment matrixDo not know the optimal alignment until we complete the entire alignment matrixOptimal alignment has the highest cumulative score

20. Needleman-Wunsch algorithm (global alignment)(Query length: M; Subject length: N)Construct a (M+1) x (N+1) matrixExtra column and row = gaps at the beginning of the alignmentFill in the cells in the first row and first column with the cumulative gap costsCalculate the maximum score for subsequent cells (i,j)Keep track of the decision that leads to the maximum score (S)S(i,j) = max S(i-1,j-1) + 𝛔(a,b)S(i-1,j ) + 𝛾S(i ,j-1) + 𝛾Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443-53.

21. Initialize the alignment matrix (Match = +5; Mismatch = -2; Gap = -6)QuerySubject0-12-6-30-24-18-36-6-12-18-24-30-36-42-48TGCTCGTA012345678TTTACA2105436(Eddy, 2004)

22. Calculate the possible scores for the cell at position (1,1)Query (x)Subject (y)(0,0)(1,1)(0,1)(1,0)T0S(1,1) = max {}TS(0,0) + 𝛔(T,T)𝛔(T,T)S(1,0) + 𝛾𝛾S(0,1) + 𝛾𝛾-6-6AlignGap in queryGap in subject

23. Calculate the optimal score for the cell at position (1,1)Query (x)Subject (y)T(Match = +5; Mismatch = -2; Gap = -6)0S(1,1) = max {}T-6-6 0 + (+5) = 5+55-6 + (-6) = -12-6-12-6 + (-6) = -12-6-125S(1,1) = 5

24. Calculate the possible scores for the cell at position (2,1)Query (x)Subject (y)(1,0)(2,1)(1,1)(2,0)T-6S(2,1) = max {}GS(1,0) + 𝛔(T,G)𝛔(T,G)S(2,0) + 𝛾𝛾S(1,1) + 𝛾𝛾5-12TAlignGap in queryGap in subject

25. Calculate the optimal score for the cell at position (2,1)Query (x)Subject (y)T(Match = +5; Mismatch = -2; Gap = -6)-6S(2,1) = max {}G5-12-6 + (-2) = -8-2-8-12 + (-6) = -18-6-18 5 + (-6) = -1-6-1-1S(2,1) = -1T

26. Alignment matrix after two iterations (Match = +5; Mismatch = -2; Gap = -6)QuerySubject0-12-6-30-24-18-36-6-12-18-24-30-36-42-48TGCTCGTA012345678TTTACA21054365-1AlignGap in queryGap in subject

27. Calculate the optimal score for the cell at position (3,1)Query (x)Subject (y)T(Match = +5; Mismatch = -2; Gap = -6)-12S(3,1) = max {}C-1-12 + (-2) = -14-2-14-18 + (-6) = -24-18-6-24 -1 + (-6) = -7-6-7-7S(3,1) = -7G

28. Matrix after three iterations (Match = +5; Mismatch = -2; Gap = -6)QuerySubject0-12-6-30-24-18-36-6-12-18-24-30-36-42-48TGCTCGTA012345678TTTACA21054365-1AlignGap in queryGap in subject-7

29. Calculate the optimal score for the cell at position (1,2)Query (x)Subject (y)T(Match = +5; Mismatch = -2; Gap = -6)-6S(1,2) = max {}T-125 -6 + (+5) = -1+5-1 5 + (-6) = -1-6-1-12 + (-6) = -18-6-18S(1,2) = -1T-1

30. Complete alignment matrix (Match = +5; Mismatch = -2; Gap = -6)QuerySubjectTGCTCGTA012345678TTTACA21054360-12-6-30-24-18-36-6-12-18-24-30-36-42-48-15-19-13-7-253-1-15-9-3-21-3-7-428-10-2-137621-8-194035-14-25-21-32-20-316-5-90-26-370-4-1511AlignGap in queryGap in subject

31. Use traceback to recover the optimal alignmentStart from the cell within the last row and last column that has the highest score Recall the step (color) that leads to this optimal scoreReport this step in the alignment outputAll the alignment decisions have already been madeRepeat until we reached the beginning of the sequenceTwo options if multiple paths produce the same scoreReport only one of the paths (pick arbitrarily)Report all paths with the optimal score

32. TGCTCGTATTTACA-12-6-30-24-18-36-6-12-18-24-30-36-42-48-1-19-13-7-253-15-9-3-21-3-428-10-137621-8-19405-14-25-2-32-20-31-5-90-26-370-4-15Query:Subject:QuerySubjectTraceback:05-1-7-231611TTCCGATTAA

33. Query (x)Subject (y)C(Match = +5; Mismatch = -2; Gap = -6)-2S(5,3) = max {}C2-8-2 + (+5) = 3+53 -8 + (-6) = -14-6-14 2 + (-6) = -4-6-43S(5,3) = 3TCalculate the optimal score for the cell at position (5,3)

34. Traceback must follow the steps that produce the optimal cumulative global alignment scoreQuery (x)Subject (y)CC3T2-2-8T

35. TGCTCGTATTTACA-12-6-30-24-18-36-6-12-18-24-30-36-42-48-1-19-13-7-253-15-9-3-21-3-428-10-137621-8-19405-14-25-2-32-20-31-5-90-26-370-4-15Query:Subject:QuerySubjectTraceback:05-1-7-231611TTG-C-TTCCGATTAA

36. The Needleman-Wunsch algorithm is an example of a dynamic programming algorithmProblem must satisfy two criteria:Optimal substructureOptimal solution to the complete problem is composed of optimal solutions to the subproblemsOverlapping problemsRe-use the results for the subproblems (e.g., lookup table)Many bioinformatics problems satisfy these criteria Sequence alignment, gene prediction, RNA-foldingBellman B. The theory of dynamic programming. Bulletin of the American Mathematical Society. 1954; 60(6):503–516

37. Smith-Waterman algorithm (local alignment)(Query length: M; Subject length: N)Three changes to the Needleman-Wunsch algorithm:The minimum score for a cell is zeroInitiate a new alignment when the cumulative score is negativeBegin traceback from the cell within the entire matrix that has the highest scoreTerminate traceback when the score is zeroS(i,j) = max S(i-1,j-1) + 𝛔(a,b)S(i-1,j ) + 𝛾S(i ,j-1) + 𝛾0Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7.

38. Global versus local alignmentsGlobal alignmentOptimal alignment along the entire length of two sequencesCompare protein sequences to identify orthologsLocal alignmentOptimal alignment between parts of two sequencesIdentify conserved domains within protein sequencesGlocal (semi-global) alignmentOptimal global alignment for one sequence; optimal local alignment for the other sequenceMap a coding exon against a genomic sequence

39. Initialize the local alignment matrix (Match = +5; Mismatch = -2; Gap = -6)QuerySubject000000000000000TGCTCGTA012345678TTTACA2105436

40. Calculate the possible local alignment scores for the cell at position (1,1)Query (x)Subject (y)(0,0)(1,1)(0,1)(1,0)T0S(1,1) = max {}TS(0,0) + 𝛔(T,T)𝛔(T,T)S(1,0) + 𝛾𝛾S(0,1) + 𝛾𝛾0000AlignGap in queryGap in subject

41. Calculate the optimal local alignment score for the cell at position (1,1)Query (x)Subject (y)T(Match = +5; Mismatch = -2; Gap = -6)0S(1,1) = max {}T000 + (+5) = 5+550 + (-6) = -6-6-60 + (-6) = -6-6-65S(1,1) = 500

42. Local alignment matrix (Match = +5; Mismatch = -2; Gap = -6)QuerySubjectTGCTCGTA012345678TTTACA2105436000000000000000555000300033000280557621304410500284255132073075318AlignGap in queryGap in subject

43. QuerySubjectTGCTCGTATTTACA0000000000000005550003000330002805762130445002425520730753Query:Subject:Traceback:510813180TTCCGATTAA

44. Techniques to improve the performance of sequence alignmentTime and space complexity: O(MN)Double the size of the two sequences leads to a four-fold increase in the amount of time and space requiredReduce memory requirementMyers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988 Mar;4(1):11-7.Fill the matrix in parallel (SIMD, CUDA)Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2):156-61.Find high-scoring instead of the best alignmentAltschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.

45. Questions?Eddy SR. What is dynamic programming? Nat Biotechnol. 2004 Jul;22(7):909-10.

46.

47. Rationale for calculating the scores for the entire alignment matrixCannot determine the best global alignment without aligning the entire query and subject sequencesCannot evaluate all possible alignmentsIf the alignment before we reached cell (i,j) is part of the optimal alignment:Identify the next step (i.e. align, gap in query, gap in subject) that will be part of the optimal alignmentUse traceback to determine the final alignmentDifferent alignments could produce the same score

48. Overview of the BLAST algorithmHeuristic algorithm to find local regions of similarity between the query and subject sequencesConsists of four main stages:Find common subsequences (words)Extend the word matches into longer alignmentsEvaluate the significance of the high-scoring segment pairs (HSPs)Combine multiple HSPs into a longer alignment Korf, I., Yandell, M. and Bedell, J. (2003). The BLAST Algorithm. In BLAST (76-87). Sebastopol, CA: O’Reilly Media, Inc.

49. Number of alignments for two sequences with length NStirling’s approximation

50. Number of alignments for two sequences with length N

51. Number of alignments for two sequences with length N

52. Brute force alignment approach is computationally intractableSequence length (N)# possible alignments101.87E+05501.01E+291009.07E+582001.03E+1193001.35E+1794001.88E+2395002.70E+299