Purbendra yogi Introduction Genome assembly is the process of taking of many short DNA sequences and combine them to form original chromosome First generation Assembly began in the late 1980s and early 1990s ID: 775109
Download Presentation The PPT/PDF document " Genome assembly BIO 446L/546L- Fall 201..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Genome assembly
BIO 446L/546L- Fall 2019
Purbendra yogi
Slide2Introduction
Genome assembly is the process of taking of many short DNA sequences and combine them to form original chromosome.First generation Assembly began in the late 1980’s and early 1990’s.In 1977 First genome (phix174; 5.4 kb) assembled by hand.In 1980s Yeast and Worms genomes.The development of shot gun sequencing.
Slide3Introduction
In 1995 TIGR Used the shotgun sequencing technique of the bacterium Haemophilus influenzae.Hierarchical or BAC approach (50-150kbp).Sequencing of the model organisms such as bacteria, yeast, worms, flies and mice.In 2000, by the successful assembly of full genome (175Mb) of the Drosophila melanogaster.Human genome sequencing (both WGS and BAC).
Slide4Determining the DNA sequence of an organism.
Useful in Biological research.In medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases.Research into pathogens may lead to treatment for contagious diseases.
Why assembly
Slide5Steps to assemble a genome
Find overlapping reads.Merge good pairs of reads into contigs.Link contigs to form super contigs.Derive consensus sequence.
Slide6Some terminology
Reads: 500-900 words file comes from sequencer machine.Mate pair: two ends of same insert fragment.Contigs: formed by several overlapping reads.Scaffolds: ordered and oriented set of contigs.Consensus: multiple alignment of reads in a contig.
sciencedirect.com
Slide7Short and long reads
Wikipedia
Long reads high error rates.
Pacbio
/ Nanopore.Short reads high accuracy (Illumina)Length limits ability to resolve repeats.
Slide8Comparative assemblers
Reference based.
Find best overlapping reads.Merge reads into longer contigs.Form super contigsConsensus sequence.
research gate
Slide9BAC genome assembly
Create physical map of the whole genome.
Randomly cut into several pieces.Clone into BAC.Overlap is estimated by the no of STS.Read constructed based on (STS).Accurate assembly for highly repetitive and polyploid genomes.
research gate
Slide10Whole genome shotgun assembly
Genomic DNA is randomly shredded into small pieces (2,000bp).
Each fragment is inserted into a plasmid.Plasmid library are sequenced 500bp from each end of each fragments are decoded generating millions of sequence.Overlap: the sequence overlap between all available reads.Layout: the reads are arranged according to their pattern of overlap, producing a multiple alignment of the reads.Consensus: contig is generated by calculating the consensus base at each base position of the layout.Mate-pair information to order an orient the contigs and place them into larger structure called scaffolds.
Wikipedia
Slide11Human genome project
Slide12De novo assembly
De novo assembly is not reference based.Extract DNA.Using appropriate technology.DNA sequencingAssembly process: different steps.Get the read file from Sequencing machine.Quality check of the readsCleanup/ trimmingAssemble the data into contigs and scaffolds.Check for quality.
Slide13Quality check of the reads
The total number of reads.Read length.Percent of GC content.Possible contamination barcodes, adaptor sequencePresence of large number of N’s in reads.
Melbourne Bioinformatics
Slide14Cleanup/ Trimming
Trims adaptors, barcodes and other contaminations.Measure average base quality of the reads.Number of orphaned reads.No of pairs lost.Look at gaps or regions of “N”s.Check for mis- assemblies.
Slide15Assembly algorithms
The Greedy Graph
Remove any input short read.Find the ordered pair with maximum overlap.Find best matching pairs.Pairwise vs similarities.Read overlaps that conflict with already constructed contigs are ignored.
.
Wikipedia
Slide16Graph based approaches
OLC graph and De Bruijn Graphs.
Overlap: detecting overlaps among the set of reads.Overlap information is organized into a graph. Layout: The graph is constructed on the basis of an appropriate ordering and orientation of the reads.Consensus: contigs is computed from the ordered and oriented reads.De Bruijn graph: Each read is broken into a sequence of overlapping k- mers. The k- mers are added as vertices to the graph, and k- mers that originate from adjacent positions in a read are linked by an edge.
sysbio
. unl.edu
Slide17Mate pairs
Mate- pair information is used to infer links between contigs.
Contigs are determined based on mate- pair information.A linear arrangement of the contigs is determined.Mate-pair used to order and orient contigs with respect to each other- scaffolding.Mate- pair information used to resolve repeat regions.
Order and orient contigs with respect to each other Resolve repeat regions.
Theory and practice of genome sequence assembly
Slide18Assemble contigs/scaffolds
Contigs: the reads are assembled as long consensus sequences called contigs.
Scaffolding: detecting misassembled ones and correcting them to scaffolds.Paired end reads are used as a guide map to order and orient contigs.Paired end data useful for detecting chimeric contigs Gap filling: the space between contigs are carefully filled by using other independent reads to complete the assembly.
licensed under CC BY
research gate
Slide19Main problems of assembly
Repeat sequence.
Low coverage.Biased sequencing.Error rate.Chimeric reads.Adaptor and other contamination.Heterozygosity (bubbles).
UsearchV11
Wikipedia
Slide20Conclusion
Slide21Literature cited
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000. The genome Sequence of Drorsophila melanogaster. Science 287:2185-95.Jared T. Simpson and Mihai Pop. the Theory and Practice of Genome Sequence Assembly. Center for Bioinformatics and Computational Biology, University of Maryland, college park, Maryland 20742.Simon Gladman. De novo Genome Assembly for Illumina Data Melbourne Bioinformatics.Jason R Miller, Sergey Koren, Granger Sutton. Assembly Algorithms for next- Generation Sequencing Data, J Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850-3343,USA.M. Pop, Genome assembly reborn: recent computational challenges, Brief Bioinform. 10 (2009) 354-366.
Slide22Questions ?
Thank You