/
Graph Algorithms in  Bioinformatics Graph Algorithms in  Bioinformatics

Graph Algorithms in Bioinformatics - PowerPoint Presentation

wang
wang . @wang
Follow
65 views
Uploaded On 2024-01-13

Graph Algorithms in Bioinformatics - PPT Presentation

Outline Definition and Data Structures of Graph Eulerian amp Hamiltonian Cycle DNA Sequencing Shortest Superstring Problem SSP as Traveling Salesman Problem TSP Sequencing by Hybridization SBH SBH as Hamiltonian ID: 1040733

graph dna adjacent sbh dna graph sbh adjacent vertex cycle path eulerian atg length tgc gtg tgg fragment problem

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Graph Algorithms in Bioinformatics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Graph Algorithmsin Bioinformatics

2. Outline Definition and Data Structures of Graph Eulerian & Hamiltonian Cycle DNA Sequencing Shortest Superstring Problem, SSP as Traveling Salesman Problem (TSP) Sequencing by Hybridization (SBH), SBH as Hamiltonian Path Problem (HPP), SBH as Eulerian Path Problem (EPP)

3. Graph adecbfabcdefccfeaefcAdjacent listsAdjacent matrixData structure for graph algorithms: Adjacent list, Adjacent matrixA directed graph G = (V,E) is represented by a set of vertices V and a set of edges E, where (u,v) belongs to E if and only if u and v belong to V and there is an edge from u to v.

4. Adjacent list: Each vertex u has an adjacent list Adj[u](1) For a connected graph G, if G is directed graph the total size of all the adjacent lists is |E|, and if G is undirected graph then the total size of all the adjacent lists is 2|E|. Generally, the total size of adjacent lists is O(V+E). (2) For a weighted graph G, weight w(u,v) of edge (u,v) is kept in Adj[u] with vertex v. Adjacent matrix : Each vertex is given a number from 1,2,…,|V|.(1) For a undirected graph, its adjacent matrix is symmetric. (2) For a weighted graph, weight w(u,v) is kept in its adjacent matrix at row i and column j.

5. 125433432212245545112345 1 2 3 4 51 0 1 0 0 12 1 0 1 1 13 0 1 0 1 04 0 1 1 0 15 1 1 0 1 012453625624451234566 1 2 3 4 5 61 0 1 0 1 0 02 0 0 0 0 1 03 0 0 0 0 1 14 0 1 0 0 0 05 0 0 0 1 0 06 0 0 0 0 0 1Comparison between adjacent list and adjacent matrixIf |E| is much smaller than then adjacent list is better (using less memory). It costs time using adjacent lists to find if v is adjacent to u.

6. The Bridge Obsession ProblemBridges of KönigsbergFind a tour crossing every bridge just onceLeonhard Euler, 1735 Eulerian & Hamiltonian Cycle

7. Eulerian Cycle ProblemFind a cycle that visits every edge exactly onceLinear timeMore complicated Königsberg

8. Hamiltonian Cycle ProblemFind a cycle that visits every vertex exactly onceNP – complete Game invented by Sir William Hamilton in 1857Traveling Salesman Problem Find the shortest path which visits every vertex exactly onceNP – complete

9. DNA Sequencing: HistorySanger method (1977): labeled ddNTPs terminate DNA copying at random points.Both methods generate labeled fragments of varying lengths that are further electrophoresed. Gilbert method (1977): chemical method to cleave DNA at specific points (G, G+A, T+C, C).

10. Sanger Method: Generating ReadStart at primer (restriction site)Grow DNA chainInclude ddNTPs Stops reaction at all possible pointsSeparate products by length, using gel electrophoresis

11. DNA SequencingShear DNA into millions of small fragmentsRead 500 – 700 nucleotides at a time from the small fragments (Sanger method)

12. Fragment AssemblyComputational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

13. Shortest Superstring ProblemProblem: Given a set of strings, find a shortest string that contains all of themInput: Strings s1, s2,…., snOutput: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimizedComplexity: NP – complete Note: this formulation does not take into account sequencing errors

14. Shortest Superstring Problem: Example

15. Reducing SSP to TSPDefine overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa What is overlap ( si, sj ) for these strings?

16. Reducing SSP to TSPDefine overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa overlap=12

17. Reducing SSP to TSPDefine overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaaConstruct a graph with n vertices representing the n strings s1, s2,…., sn. Insert edges of length overlap ( si, sj ) between vertices si and sj. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

18. SSP to TSP: An ExampleS = { ATC, CCA, CAG, TCC, AGT } SSP AGT CCA ATC ATCCAGT TCC CAG ATCCAGTTSPATCCCATCCAGTCAG2222111011

19. Sequencing by Hybridization (SBH): History1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work1991: Light directed polymer synthesis developed by Steve Fodor and colleagues. 1994: Affymetrix develops first 64-kb DNA microarrayFirst microarray prototype (1989)First commercialDNA microarrayprototype w/16,000features (1994)500,000 featuresper chip (2002)

20. How SBH WorksAttach all possible DNA probes of length l to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array.Apply a solution containing fluorescently labeled DNA fragment to the array.The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment.

21. How SBH Works (cont’d)Using a spectroscopic detector, determine which probes hybridize to the DNA fragment to obtain the l–mer composition of the target DNA fragment.Apply the combinatorial algorithm (below) to reconstruct the sequence of the target DNA fragment from the l – mer composition.

22. Hybridization on DNA Array

23. l-mer compositionSpectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length nThe order of individual elements in Spectrum ( s, l ) does not matterFor s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

24. l-mer compositionSpectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s of length nThe order of individual elements in Spectrum ( s, l ) does not matterFor s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} We usually choose the lexicographically maximal representation as the canonical one.

25. Different sequences – the same spectrumDifferent sequences may have the same spectrum: Spectrum(GTATCT,2)= Spectrum(GTCTAT,2)= {AT, CT, GT, TA, TC}

26. The SBH ProblemGoal: Reconstruct a string from its l-mer compositionInput: A set S, representing all l-mers from an (unknown) string sOutput: String s such that Spectrum ( s,l ) = S

27. SBH: Hamiltonian Path ApproachS = { ATG AGG TGC TCC GTC GGT GCA CAG } Path visited every VERTEX onceATGAGGTGCTCCHGTCGGTGCACAGATGCAGGTCC

28. SBH: Hamiltonian Path Approach A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT }

29. SBH: Hamiltonian Path Approach S = { ATG TGG TGC GTG GGC GCA GCG CGT }Path 1: ATGCGTGGCAATGGCGTGCAPath 2:

30. SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from SATGTCGCAGCTGGG Path visited every EDGE once

31. SBH: Eulerian Path ApproachS = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: ATGGCGTGCA ATGCGTGGCAATTGGCCAGGGTCGATGTCGCAGCTGGG

32. Euler TheoremA graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v)Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

33. Algorithm for Constructing an Eulerian Cycle Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.

34. Algorithm for Constructing an Eulerian Cycle (cont’d)b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.

35. Algorithm for Constructing an Eulerian Cycle (cont’d)c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).

36. Some Difficulties with SBHFidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatchesArray Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology.Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the futurePracticality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques