/
Graph Algorithms 8.6-8.10 Graph Algorithms 8.6-8.10

Graph Algorithms 8.6-8.10 - PowerPoint Presentation

WeirdoWonder
WeirdoWonder . @WeirdoWonder
Follow
342 views
Uploaded On 2022-08-01

Graph Algorithms 8.6-8.10 - PPT Presentation

CS 6030 Bioinformatics Summer II 2012 Jason Eric Johnson Sequencing by Hybridization DNA Array gives all strings of length l How do we find the order Spectrum sl String s of length n ID: 931895

sbh path sequence reads path sbh reads sequence approach dna fragments mers sequencing overlap find graph problem tagattacacagattactga atg

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Graph Algorithms 8.6-8.10" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Graph Algorithms 8.6-8.10

CS 6030 – Bioinformatics

Summer II 2012

Jason Eric Johnson

Slide2

Sequencing by Hybridization

DNA Array gives all strings of length l

How do we find the order?

Spectrum(

s,l

)

String s of length n

Spectrum is

multiset

of n-l+1 l-

mers

in s

Slide3

Sequencing by Hybridization

s

= TATGGTGC

l = 3

Spectrum(

s,l

) = {TAT,ATG,TGG,GGT,GTG,TGC}

Problem:

Input: Set S of all l-

mers

from s

Output: String s

s.t.

Spectrum(

s,l

) = S

Slide4

Hybridization on DNA Array

Slide5

Sequencing by Hybridization

Special case of Shortest Superstring Problem

SBH is linear-time

SSP (NP-Complete) is more general

In SSP, no guaranteed overlap

In SBH, we know the length of the target sequence

Slide6

Sequencing by Hybridization

There is a problem with DNA Arrays

No good way to distinguish a match from a highly stable mismatch

Mismatch could give strong hybridization signal

Need longer probes to deal with mutations

Slide7

SBH: Hamiltonian Path Approach

Two l-

mers

overlap if overlap(

p,q

) = l – 1

Last l-1 letters of p are same as first l-1 of q

Make each l-

mer

in Spectrum(

s,l

) a node

Construct directed graph(s) that connect every p and q with a directed edge

1 to 1 correspondence between paths that visit each vertex exactly once (Hamiltonian Paths) and DNA fragments with Spectrum(

s,l

)

Slide8

SBH: Hamiltonian Path Approach

S

= { ATG AGG TGC TCC GTC GGT GCA CAG }

Path visited every VERTEX once

ATG

AGG

TGC

TCC

H

GTC

GGT

GCA

CAG

ATG

C

A

G

G

T

C

C

Slide9

SBH: Hamiltonian Path Approach

A more complicated graph:

S

= { ATG TGG TGC GTG GGC GCA GCG CGT }

Slide10

SBH: Hamiltonian Path Approach

S

= { ATG TGG TGC GTG GGC GCA GCG CGT }

Path 1:

ATGCGTGGCA

ATGGCGTGCA

Path 2:

Slide11

SBH: Hamiltonian Path Approach

Problem is that there is no efficient algorithm

As overlap graph gets larger, this is not a useful technique since the Hamiltonian Path problem is NP-Complete

Slide12

SBH: Eulerian

Path Approach

This leads to simple linear-time algorithm for sequence reconstruction

Construct graph whose edges correspond to l-

mers

Find path(s) that visit each edge exactly once

Slide13

SBH: Eulerian

Path Approach

S

= { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to (

l

– 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }

Edges correspond to

l

– mers from

S

AT

GT

CG

CA

GC

TG

GG

Path visited every EDGE once

Slide14

SBH:

Eulerian

Path Approach

S

= { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:

ATGGCGTGCA

ATGCGTGGCA

AT

TG

GC

CA

GG

GT

CG

AT

GT

CG

CA

GC

TG

GG

Slide15

SBH: Eulerian

Path Approach

If for every vertex the number

of incoming edges is equal to the number of outgoing edges, the graph is balanced

Theorem

: A connected graph is

Eulerian

if and only if each of its vertices is balanced

Theorem

: A connected graph has an

Eulerian

path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced

Slide16

Some Difficulties with SBH

Fidelity of Hybridization:

difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches

Array Size:

Effect of low fidelity can be decreased with longer

l

-mers, but array size increases exponentially in

l.

Array size is limited with current technology.

Practicality:

SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future

Practicality again

: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques

Slide17

Fragment Assembly

Now that we have our reads sequenced, we need to assemble them into the entire DNA sequence

Slide18

Fragment Assembly

We have some problems:

Errors in reads (1% to 3%)

Which strand did the read come from?

Did the read come from the target DNA sequence or its Watson-Crick complement?

Repeats in DNA (this is the major problem)

See page 278 for puzzle example

Slide19

Fragment Assembly

Very difficult to put it all together if repeats are longer than read length

Could solve this by increasing read length, but the technology isn’t there yet

Slide20

Fragment Assembly

One approach is to break the sequence into about 30,000 Bacterial Artificial Chromosomes

Sequence each BAC individually

Put them all together

Used and shown effective (if cumbersome) by the Human Genome Project

Slide21

Fragment Assembly

Another option (used in mouse genome assembly) is the Weber-Meyers approach

Pairs reads that are separated by a fixed-size gap

Gap size L is chosen to be longer than most repeats

Unlikely both reads lie in large repeat

Read that is in unique portion of DNA tells us which copy of a repeat the mate is in

Slide22

Fragment Assembly

Most algorithms consist of these steps:

Overlap

Find potentially overlapping reads

Layout:

Find order of reads along DNA

Consensus:

Derive DNA sequence from layout

Slide23

Overlap

Find the best match between the suffix of one read and the prefix of another

Due to sequencing errors, need to use dynamic programming to find the optimal

overlap alignment

Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

Slide24

Overlapping Reads

TAGATTACACAGATTAC

TAGATTACACAGATTAC

|||||||||||||||||

Sort all

k

-mers in reads (

k

~ 24)

Find pairs of reads sharing a k-mer

Extend to full alignment – throw away if not >95% similar

T GA

TAGA

| ||

TACA

TAGT

||

Slide25

Overlapping Reads and Repeats

A

k

-mer that appears N times, initiates N

2

comparisons

For an

Alu

that appears 10

6

times

 10

12

comparisons – too much

Solution:

Discard all

k-mers that appear more than t  Coverage, (

t ~ 10)

Slide26

Finding Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG TTACACAGATTATTGA

TAGATTACACAGATTACTGA

Slide27

Layout

Repeats are a major challenge

Do two aligned fragments really overlap, or are they from two copies of a repeat?

Solution: repeat masking

hide the repeats!!!

Masking results in high rate of misassembly (up to 20%)

Misassembly means alot more work at the finishing step

Slide28

Consensus

A consensus sequence is derived from a profile of the assembled fragments

A sufficient number of reads is required to ensure a statistically significant consensus

Reading errors are corrected

Slide29

Derive Consensus Sequence

Derive

multiple alignment

from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTA

T

TGACTT

C

ATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

Slide30

Protein Sequencing and Identification

Protein can be digested into peptides by proteases (such as trypsin)

Can then sequence the fragments individually and re-assemble

Mass spectrometry allows us to find proteins involved in cell death, for example

Slide31

Protein Sequencing and Identification

Tandem mass spectrometer breaks peptides into smaller fragments

These fragments have electrical charge

Fragments are spun around in an magnetic field until they hit a detector

Larger masses are harder to spin than smaller ones, so mass can be determined by the amount of energy required to fling fragments around

Slide32

Protein Sequencing and Identification

The problem we encounter is how to reconstruct the amino acid sequence of the peptide from the masses of the

broken pieces

Slide33

References

Generated from:

An Introduction to Bioinformatics Algorithms, Neil C. Jones,

Pavel

A.

Pevzner

, A Bradford Book, The MIT Press, Cambridge, Mass., London, England, 2004

Slides 4, 8-10, 13, 14, 16, 23-29 from http://

bix.ucsd.edu

/

bioalgorithms

/slides.php#Ch8