CS 6030 Bioinformatics Summer II 2012 Jason Eric Johnson Sequencing by Hybridization DNA Array gives all strings of length l How do we find the order Spectrum sl String s of length n ID: 931895
Download Presentation The PPT/PDF document "Graph Algorithms 8.6-8.10" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Graph Algorithms 8.6-8.10
CS 6030 – Bioinformatics
Summer II 2012
Jason Eric Johnson
Slide2Sequencing by Hybridization
DNA Array gives all strings of length l
How do we find the order?
Spectrum(
s,l
)
String s of length n
Spectrum is
multiset
of n-l+1 l-
mers
in s
Slide3Sequencing by Hybridization
s
= TATGGTGC
l = 3
Spectrum(
s,l
) = {TAT,ATG,TGG,GGT,GTG,TGC}
Problem:
Input: Set S of all l-
mers
from s
Output: String s
s.t.
Spectrum(
s,l
) = S
Slide4Hybridization on DNA Array
Slide5Sequencing by Hybridization
Special case of Shortest Superstring Problem
SBH is linear-time
SSP (NP-Complete) is more general
In SSP, no guaranteed overlap
In SBH, we know the length of the target sequence
Slide6Sequencing by Hybridization
There is a problem with DNA Arrays
No good way to distinguish a match from a highly stable mismatch
Mismatch could give strong hybridization signal
Need longer probes to deal with mutations
Slide7SBH: Hamiltonian Path Approach
Two l-
mers
overlap if overlap(
p,q
) = l – 1
Last l-1 letters of p are same as first l-1 of q
Make each l-
mer
in Spectrum(
s,l
) a node
Construct directed graph(s) that connect every p and q with a directed edge
1 to 1 correspondence between paths that visit each vertex exactly once (Hamiltonian Paths) and DNA fragments with Spectrum(
s,l
)
Slide8SBH: Hamiltonian Path Approach
S
= { ATG AGG TGC TCC GTC GGT GCA CAG }
Path visited every VERTEX once
ATG
AGG
TGC
TCC
H
GTC
GGT
GCA
CAG
ATG
C
A
G
G
T
C
C
Slide9SBH: Hamiltonian Path Approach
A more complicated graph:
S
= { ATG TGG TGC GTG GGC GCA GCG CGT }
Slide10SBH: Hamiltonian Path Approach
S
= { ATG TGG TGC GTG GGC GCA GCG CGT }
Path 1:
ATGCGTGGCA
ATGGCGTGCA
Path 2:
Slide11SBH: Hamiltonian Path Approach
Problem is that there is no efficient algorithm
As overlap graph gets larger, this is not a useful technique since the Hamiltonian Path problem is NP-Complete
Slide12SBH: Eulerian
Path Approach
This leads to simple linear-time algorithm for sequence reconstruction
Construct graph whose edges correspond to l-
mers
Find path(s) that visit each edge exactly once
Slide13SBH: Eulerian
Path Approach
S
= { ATG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to (
l
– 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }
Edges correspond to
l
– mers from
S
AT
GT
CG
CA
GC
TG
GG
Path visited every EDGE once
Slide14SBH:
Eulerian
Path Approach
S
= { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:
ATGGCGTGCA
ATGCGTGGCA
AT
TG
GC
CA
GG
GT
CG
AT
GT
CG
CA
GC
TG
GG
Slide15SBH: Eulerian
Path Approach
If for every vertex the number
of incoming edges is equal to the number of outgoing edges, the graph is balanced
Theorem
: A connected graph is
Eulerian
if and only if each of its vertices is balanced
Theorem
: A connected graph has an
Eulerian
path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced
Slide16Some Difficulties with SBH
Fidelity of Hybridization:
difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches
Array Size:
Effect of low fidelity can be decreased with longer
l
-mers, but array size increases exponentially in
l.
Array size is limited with current technology.
Practicality:
SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future
Practicality again
: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques
Slide17Fragment Assembly
Now that we have our reads sequenced, we need to assemble them into the entire DNA sequence
Slide18Fragment Assembly
We have some problems:
Errors in reads (1% to 3%)
Which strand did the read come from?
Did the read come from the target DNA sequence or its Watson-Crick complement?
Repeats in DNA (this is the major problem)
See page 278 for puzzle example
Slide19Fragment Assembly
Very difficult to put it all together if repeats are longer than read length
Could solve this by increasing read length, but the technology isn’t there yet
Slide20Fragment Assembly
One approach is to break the sequence into about 30,000 Bacterial Artificial Chromosomes
Sequence each BAC individually
Put them all together
Used and shown effective (if cumbersome) by the Human Genome Project
Slide21Fragment Assembly
Another option (used in mouse genome assembly) is the Weber-Meyers approach
Pairs reads that are separated by a fixed-size gap
Gap size L is chosen to be longer than most repeats
Unlikely both reads lie in large repeat
Read that is in unique portion of DNA tells us which copy of a repeat the mate is in
Slide22Fragment Assembly
Most algorithms consist of these steps:
Overlap
Find potentially overlapping reads
Layout:
Find order of reads along DNA
Consensus:
Derive DNA sequence from layout
Slide23Overlap
Find the best match between the suffix of one read and the prefix of another
Due to sequencing errors, need to use dynamic programming to find the optimal
overlap alignment
Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
Slide24Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC
|||||||||||||||||
Sort all
k
-mers in reads (
k
~ 24)
Find pairs of reads sharing a k-mer
Extend to full alignment – throw away if not >95% similar
T GA
TAGA
| ||
TACA
TAGT
||
Slide25Overlapping Reads and Repeats
A
k
-mer that appears N times, initiates N
2
comparisons
For an
Alu
that appears 10
6
times
10
12
comparisons – too much
Solution:
Discard all
k-mers that appear more than t Coverage, (
t ~ 10)
Slide26Finding Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
Slide27Layout
Repeats are a major challenge
Do two aligned fragments really overlap, or are they from two copies of a repeat?
Solution: repeat masking
–
hide the repeats!!!
Masking results in high rate of misassembly (up to 20%)
Misassembly means alot more work at the finishing step
Slide28Consensus
A consensus sequence is derived from a profile of the assembled fragments
A sufficient number of reads is required to ensure a statistically significant consensus
Reading errors are corrected
Slide29Derive Consensus Sequence
Derive
multiple alignment
from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTA
T
TGACTT
C
ATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
Slide30Protein Sequencing and Identification
Protein can be digested into peptides by proteases (such as trypsin)
Can then sequence the fragments individually and re-assemble
Mass spectrometry allows us to find proteins involved in cell death, for example
Slide31Protein Sequencing and Identification
Tandem mass spectrometer breaks peptides into smaller fragments
These fragments have electrical charge
Fragments are spun around in an magnetic field until they hit a detector
Larger masses are harder to spin than smaller ones, so mass can be determined by the amount of energy required to fling fragments around
Slide32Protein Sequencing and Identification
The problem we encounter is how to reconstruct the amino acid sequence of the peptide from the masses of the
broken pieces
Slide33References
Generated from:
An Introduction to Bioinformatics Algorithms, Neil C. Jones,
Pavel
A.
Pevzner
, A Bradford Book, The MIT Press, Cambridge, Mass., London, England, 2004
Slides 4, 8-10, 13, 14, 16, 23-29 from http://
bix.ucsd.edu
/
bioalgorithms
/slides.php#Ch8