/
On Genome Assembly Abhiram On Genome Assembly Abhiram

On Genome Assembly Abhiram - PowerPoint Presentation

mila-milly
mila-milly . @mila-milly
Follow
342 views
Uploaded On 2022-06-20

On Genome Assembly Abhiram - PPT Presentation

Ranade IIT Bombay Genome Constituent of living cells that determines hereditary characteristics aka DNA Sequence of nucleobases A denine G uanine T hymine C ytosine ID: 921266

walk assembly abcd length assembly walk length abcd read overlap graph genome probability cdefghi times hijkl number pieces efghi

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "On Genome Assembly Abhiram" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

On Genome Assembly

Abhiram

Ranade

IIT Bombay

Slide2

Genome

Constituent of living cells that determines hereditary characteristics a.k.a. DNA

Sequence of

nucleobases

:

A

denine,

G

uanine,

T

hymine,

C

ytosine

ACCTGGA…

Human genome: 3 billion

nucleobases

Knowledge of sequence is very useful

Slide3

Genome Sequencing

Biochemical techniques can “read” genomes of length ~ 700

nucleobases

Make many copies of the genome.

Break the copies

randomly

into pieces of length ~ 700.

Read the pieces.

Try to infer what original genome must have been.

Genome Assembly

Slide4

Assembly Example

Input pieces “Reads”:

abcd

,

cdefghi

,

hijkl

Assembly:Input pieces “Reads”: abcd, cdefghi, hijkl, hicdAssembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

abcdefghijkl

Slide5

Strategy

Characterize all possible assemblies of the given read set

Assign a probability to each assembly

Pick the assembly with the highest probability

Slide6

All possible assemblies

A = valid assembly of reads if

each read appears in A at least once.

Nothing else appears, i.e. A is made by pasting together the reads in possibly overlapping faction

Can we compute/represent the set

{A | A is a valid assembly}

Overlap/String Graph!

Slide7

Overlap Graph: Intuition

Vertices = Reads.

Edge from read

u

to read

v

if read

u, read v likely to overlap in assembly.e.g. abcd, cdef => abcdefAssembly = walk in the graph: will encourage overlaps

Slide8

Overlap graph

Vertices: reads + empty read

ϕ

Edges: (

r

i

,

rj) : if long suffix of ri = prefix of rj abcd cdefghiLong = ? Real genomes: 50? 100? Any value that indicates overlap is not coincidental Edge label: portion of rj not belonging to overlap.abcd

cdefghi

efghi

(Long = 2)

Slide9

Overlap graph, long=2

abcd

cdefghi

hijkl

hicd

ϕ

abcd

efghi

jkl

cd

efghi

ϕ

ϕ

hicd

ϕ

Slide10

Overlap graph, long=2

abcd

cdefghi

hijkl

hicd

ϕ

abcd

efghi

jkl

cd

efghi

ϕ

ϕ

hicd

ϕ

Slide11

Walk => Assembly

Assembly = Walk in the overlap graph which

Starts at

ϕ

,

Ends at

ϕPasses through every vertex at least once.Passes through every edge at least once?Assembled sequence = concatenation of labels along the walk.Every read appears in the sequenceWalk revisits ϕ: reconstruction is incomplete, in several pieces.

Slide12

Assembly => Walk

Input: Assembly A

Output: Walk which will generate A

Visit vertices in the order of appearance in A

Overlap graph characterizes assemblies.

Variations on graph also studied.

Slide13

Approaches to assembly

Occam’s Razor: Most likely = “Shortest”

Shortest walk that visits every vertex at least once: NP-hard

Shortest walk that visits every edge at least once: Chinese Postman problem.

Polytime

.

Pragmatic: Use some greedy approach to find above.

Model probability more accurately

Slide14

A Twist: pair constraints

Sequencing process may give additional constraints:

distance from

r

i

to

r

j in assembly is about DExample: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

Slide15

Systematic estimation of probability of a given assembly

Slide16

Algebraic representation of walks

Walk is cyclic:

number of times vertex entered

= number of times it is exited.

Walk = fluid flow

Total fluid coming in = Total fluid going out

Xij = fluid going from i to j. = number of times walk goes from i to jFormulate conditions on Xij and solve

Slide17

Algebraic representation: X

ij

j

X

ij

= Number of times walk goes from i to j.δj = Number of times walk goes over jLij = Length of label of edge (i,j)Length of genome L =L may be known.

Slide18

Maximum likelihood reconstruction

(

Medvedev-Brudno

08)

Goal: Find assembly A most likely given the observations.

maximize

Pr(A

| r1,r2,…rn)=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn

)

=Pr(r

1

,r

2

,…,

r

n

|A

) *

Pr(A

) / Pr(r

1

,r

2

,…,

r

n

)

Standard assumption: Unconditional probability

Pr(A

) same for all A.

Maximize Pr(r

1

,r

2

,…,

rn|A

) and output A that maximizes.

Slide19

Computing Pr(r1

,r

2

,…,

r

n

|A

) A = abcdefghicdefghijkl r1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicdProcess of generating reads:Pick a random starting point.Pick a length at randomPr(r2) = ? = 2/Length * Pr(read length = 7) = δ

2

/Length *

Pr(read

length = 7)

Slide20

Computing Pr(r1

,r

2

,…,

r

n

|A

) Generative model: For i=1 to nPick starting point for riPick length LiProbability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L

*

Pr(L

i

)

Slide21

Computing Pr(r1

,r

2

,…,

r

n

|A

) Pr = Πi δi/L * Pr(Li)We want to pick A for which this probability is maximumScore(A) = Πi δi/L * Pr(Li

)

Best A will have max value of

Π

i

δ

i

/L

So now we have a program

Slide22

Finding the best assembly

Maximize

Π

i

δ

i

/Ls.t.L known approximately. L = Lgeε ≈ Lg(1+ε)Lg, Lij : constants. Solve for Xij≥0, δj≥0, εConvex optimization

Slide23

Concluding Remarks

Experiments seem to indicate our approach works well

.

www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf

Computationally intensive, but well founded.

May not be useful for large genomes – linear time algorithms only!

How to handle pair constraints: important open problem.

Graphs are everywhere!