/
CSE182-L16 LW statistics/Assembly CSE182-L16 LW statistics/Assembly

CSE182-L16 LW statistics/Assembly - PowerPoint Presentation

miller
miller . @miller
Follow
27 views
Uploaded On 2024-02-09

CSE182-L16 LW statistics/Assembly - PPT Presentation

Silly Quiz Who are these people and what is the occasion Genome Sequencing and Assembly Sequencing A break at T is shown here Measuring the lengths using electrophoresis allows us to get the position of each T ID: 1045744

sequence genome island clones genome sequence clones island number islands contigs expected assembly alignments fragments unique mate repeats clone

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CSE182-L16 LW statistics/Assembly" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CSE182-L16LW statistics/Assembly

2. Silly QuizWho are these people, and what is the occasion?

3. Genome Sequencing and Assembly

4. SequencingA break at T is shown here.Measuring the lengths using electrophoresis allows us to get the position of each TThe same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotidesNovember 13Bafna

5. Automated detectors ‘read’ the terminating bases.The signal decays after 1000 bases.November 13Bafna

6. Sequencing Genomes: Clone by CloneClones are constructed to span the entire length of the genome. These clones are ordered and oriented correctly (Mapping)Each clone is sequenced individuallyNovember 13Bafna

7. Shotgun SequencingShotgun sequencing of clones was considered viableHowever, researchers in 1999 proposed shotgunning the entire genome.November 13Bafna

8. LibraryCreate vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.November 13Bafna

9. Whole Genome ShotgunBreak up the entire genome into piecesSequence ends, and assemble using a computerLW statistics & Repeats argue against the success of such an approachAlternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them togetherNovember 13Bafna

10. Whole Genome ShotgunBreak up the entire genome into piecesSequence ends, and assemble using a computerLW statistics & Repeats argue against the success of such an approachAlternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

11. Shotgun SequencingShotgun sequencing of clones was considered viable for small genomesHowever, researchers in 1999 proposed shotgunning the entire genome.

12. Massively parallel sequencingSanger sequencing allows us to sequence <=1000 bp in one lane, up to 96 lanes, in one run.Today, we can sequence many Mbp in a single run

13.

14.

15. QuestionsAlgorithmic: How do you put the genome back together from the pieces? Statistical? How many pieces do you need to sequence, etc.?The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.

16. Lander Waterman StatisticsGLThe fragments are falling randomly on the genomeOverlapping fragments form islands of contiguous sequence. Ideally, we want one island for each chromosome. How many fragments should we sequence?

17. Lander Waterman StatisticsGL

18. LW statistics: questionsAs the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island of overlapping contigs.Q1: What is the expected number of islands?The number should increase at first, and gradually decrease.

19. Analysis: Expected Number IslandsComputing Expected # islands.Let Xi=1 if an island ends at position i, Xi=0 otherwise.Number of islands = ∑i XiExpected # islands = E(∑i Xi) = ∑i E(Xi)

20. Prob. of an island ending at i E(Xi) = Prob (Island ends at pos. i) =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions)iLT

21. Computing # islandsAs the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.Q1: What is the expected number of islands?Ans: N exp(-c)The number increases at first, and gradually decreases.

22. Expected # of clones in an islandExpected # of clones in an island =Q: How? Why do we care? Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

23. Problem 1: size of contigsIslands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8.#Islands = N e-c = 36KSize of an island = 54KNot enough to make it an acceptable assembly!PLUS, there is the problem of Repeats, Chimerism etc.

24. Assembly BasicsThree main components:OverlapLayoutConsensus

25. OverlapGiven a pair of fragments s1 and s2, do they belong together?Yes, if a prefix of s2 matches a suffix of s1How would you compute such a match?

26. OverlapS[i,j] = optimum score of an alignment of s1[1..i] against a substring of that starts anywhere, but ends in j. s2[*..j]ijThe best prefix-suffix alignment is given by:Maxi {S[i,n] }

27. Overlap DetectionCompute the best prefix-suffix alignments between each pair of fragments.Keep the “high-scoring” ones as evidence of true overlap.What is the problem?

28. Overlap detection problemConsider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. G = 3000Mb, L=500Coverage LN/G = 10N = 10*3*109/500 = 6*107Number of comparisons needed = 3.6 * 1015Number of alignments per minute=6Number of compute nodes = 100Time needed (Number of years) = Not good! (Only a small fraction are true overlaps)

29. k-mer based overlap (Piegeonhole principle again)Consider a 25bp sequence. Expected number of occurrences in the genome3*109*4-25 = 2*10-6A 25-bp sequence appears is unique to the genome!Two overlapping sequences should share a 25-merTwo non-overlapping sequences should not!25bp

30. Sorting k-mersBuild a list of k-mers that appear in the sequences and their reverse complementsCreate a record with 4 entries:K-merSequence numberPosition in the sequenceReverse complementation flagSort a vector of these according to k-merHow many records per k-mer are expected?If number of records exceeds threshold, discard (why?)K-merS.idPos.

31. Alignment module Coalesce k-mer hits into longer, gap-free partial alignments.These extended k-mer hits are saved.For each pair of sequences, form a directed graph. For each maximal path in the graph, construct an alignment.Refine alignment via banded DP

32. Problem2: SizeIslands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8.#Islands = N e-c = 36KSize of an island = 54KNot enough to make it an acceptable assembly!PLUS, there is the problem of Repeats, Chimerism etc.

33. Solution 2: Clones can have mate-pairsRecall that we sequence about 1000bp of the end of a cloneIf we sequenced both ends, we get extra information, particularly if we know the length of the original clone.

34. Mate PairsMate-pairs allow you to merge islands (contigs) into super-contigs

35. Super-contigs are quite largeMake clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.Use the mate-pairs to order and orient the contigs, and make super-contigs.

36. Problem 3: Repeats

37. Repeats & Chimerisms 40-50% of the human genome is made up of repetitive elements.Repeats can cause great problems in the assembly!Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

38. Repeat detectionLander Waterman strikes again!The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig).Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands.Repeat

39. Detecting Repeat Contigs 1: Read DensityCompute the log-odds ratio of two hypotheses:H1: The contig is from a unique region of the genome.The contig is from a region that is repeated at least twice

40. Detecting Chimeric readsChimeric reads: Reads that contain sequence from two genomic locations.Good overlaps: G(a,b) if a,b overlap with a high scoreTransitive overlap: T(a,c) if G(a,b), and G(b,c) Find a point x across which only transitive overlaps occur. X is a point of chimerism

41. Whole genome shotgunInput: Shotgun sequence fragments (reads)Mate pairsOutput:A single sequence created by consensus of overlapping readsFirst generation of assemblers did not include mate-pairs (Phrap, CAP..)Second generation: CA, Arachne, Euler

42. AssemblyUse k-mers to detect potential overlapsUse alignments to build contig graphsDecide the unique contigs based on LW statisticsDiscard repeat contigsBreak chimeric contigsUse mate-pairs to build scaffoldsFill gaps

43. AssemblyUse k-mers to detect potential overlapsUse alignments to build contig graphsDecide the unique contigs based on LW statisticsDiscard repeat contigsBreak chimeric contigsUse mate-pairs to build scaffolds

44. Consensus DerivationConsensus sequence is created by converting pairwise read alignments into multiple-read alignments

45. SummaryWhole genome shotgun is now routine:Human, Mouse, Rat, Dog, Chimpanzee..Many Prokaryotes (One can be sequenced in a day)Plant genomes: Arabidopsis, Rice Model organisms: Worm, Fly, YeastA lot is not known about genome structure, organization and function.Comparative genomics offers low hanging fruit

46. Final exam syllabus Take homeThe entire course, but emphasis will be given to post-midterm lecturesHMMs, Gene-finding, mass spectrometry, Micro-array analysis, genome sequencing and assembly

47. What we did not cover