Ranade IIT Bombay Genome Constituent of living cells that determines hereditary characteristics aka DNA Sequence of nucleobases A denine G uanine T hymine C ytosine ID: 921266
Download Presentation The PPT/PDF document "On Genome Assembly Abhiram" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
On Genome Assembly
Abhiram
Ranade
IIT Bombay
Slide2Genome
Constituent of living cells that determines hereditary characteristics a.k.a. DNA
Sequence of
nucleobases
:
A
denine,
G
uanine,
T
hymine,
C
ytosine
ACCTGGA…
Human genome: 3 billion
nucleobases
Knowledge of sequence is very useful
Slide3Genome Sequencing
Biochemical techniques can “read” genomes of length ~ 700
nucleobases
Make many copies of the genome.
Break the copies
randomly
into pieces of length ~ 700.
Read the pieces.
Try to infer what original genome must have been.
Genome Assembly
Slide4Assembly Example
Input pieces “Reads”:
abcd
,
cdefghi
,
hijkl
Assembly:Input pieces “Reads”: abcd, cdefghi, hijkl, hicdAssembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl
abcdefghijkl
Slide5Strategy
Characterize all possible assemblies of the given read set
Assign a probability to each assembly
Pick the assembly with the highest probability
Slide6All possible assemblies
A = valid assembly of reads if
each read appears in A at least once.
Nothing else appears, i.e. A is made by pasting together the reads in possibly overlapping faction
Can we compute/represent the set
{A | A is a valid assembly}
Overlap/String Graph!
Slide7Overlap Graph: Intuition
Vertices = Reads.
Edge from read
u
to read
v
if read
u, read v likely to overlap in assembly.e.g. abcd, cdef => abcdefAssembly = walk in the graph: will encourage overlaps
Slide8Overlap graph
Vertices: reads + empty read
ϕ
Edges: (
r
i
,
rj) : if long suffix of ri = prefix of rj abcd cdefghiLong = ? Real genomes: 50? 100? Any value that indicates overlap is not coincidental Edge label: portion of rj not belonging to overlap.abcd
cdefghi
efghi
(Long = 2)
Slide9Overlap graph, long=2
abcd
cdefghi
hijkl
hicd
ϕ
abcd
efghi
jkl
cd
efghi
ϕ
ϕ
hicd
ϕ
Slide10Overlap graph, long=2
abcd
cdefghi
hijkl
hicd
ϕ
abcd
efghi
jkl
cd
efghi
ϕ
ϕ
hicd
ϕ
Slide11Walk => Assembly
Assembly = Walk in the overlap graph which
Starts at
ϕ
,
Ends at
ϕPasses through every vertex at least once.Passes through every edge at least once?Assembled sequence = concatenation of labels along the walk.Every read appears in the sequenceWalk revisits ϕ: reconstruction is incomplete, in several pieces.
Slide12Assembly => Walk
Input: Assembly A
Output: Walk which will generate A
Visit vertices in the order of appearance in A
Overlap graph characterizes assemblies.
Variations on graph also studied.
Slide13Approaches to assembly
Occam’s Razor: Most likely = “Shortest”
Shortest walk that visits every vertex at least once: NP-hard
Shortest walk that visits every edge at least once: Chinese Postman problem.
Polytime
.
Pragmatic: Use some greedy approach to find above.
Model probability more accurately
Slide14A Twist: pair constraints
Sequencing process may give additional constraints:
distance from
r
i
to
r
j in assembly is about DExample: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl
Systematic estimation of probability of a given assembly
Slide16Algebraic representation of walks
Walk is cyclic:
number of times vertex entered
= number of times it is exited.
Walk = fluid flow
Total fluid coming in = Total fluid going out
Xij = fluid going from i to j. = number of times walk goes from i to jFormulate conditions on Xij and solve
Slide17Algebraic representation: X
ij
,δ
j
X
ij
= Number of times walk goes from i to j.δj = Number of times walk goes over jLij = Length of label of edge (i,j)Length of genome L =L may be known.
Slide18Maximum likelihood reconstruction
(
Medvedev-Brudno
08)
Goal: Find assembly A most likely given the observations.
maximize
Pr(A
| r1,r2,…rn)=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn
)
=Pr(r
1
,r
2
,…,
r
n
|A
) *
Pr(A
) / Pr(r
1
,r
2
,…,
r
n
)
Standard assumption: Unconditional probability
Pr(A
) same for all A.
Maximize Pr(r
1
,r
2
,…,
rn|A
) and output A that maximizes.
Slide19Computing Pr(r1
,r
2
,…,
r
n
|A
) A = abcdefghicdefghijkl r1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicdProcess of generating reads:Pick a random starting point.Pick a length at randomPr(r2) = ? = 2/Length * Pr(read length = 7) = δ
2
/Length *
Pr(read
length = 7)
Slide20Computing Pr(r1
,r
2
,…,
r
n
|A
) Generative model: For i=1 to nPick starting point for riPick length LiProbability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L
*
Pr(L
i
)
Slide21Computing Pr(r1
,r
2
,…,
r
n
|A
) Pr = Πi δi/L * Pr(Li)We want to pick A for which this probability is maximumScore(A) = Πi δi/L * Pr(Li
)
Best A will have max value of
Π
i
δ
i
/L
So now we have a program
Slide22Finding the best assembly
Maximize
Π
i
δ
i
/Ls.t.L known approximately. L = Lgeε ≈ Lg(1+ε)Lg, Lij : constants. Solve for Xij≥0, δj≥0, εConvex optimization
Slide23Concluding Remarks
Experiments seem to indicate our approach works well
.
www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf
Computationally intensive, but well founded.
May not be useful for large genomes – linear time algorithms only!
How to handle pair constraints: important open problem.
Graphs are everywhere!