/
 Genome assembly BIO 446L/546L- Fall 2019  Genome assembly BIO 446L/546L- Fall 2019

Genome assembly BIO 446L/546L- Fall 2019 - PowerPoint Presentation

aaron
aaron . @aaron
Follow
342 views
Uploaded On 2020-04-03

Genome assembly BIO 446L/546L- Fall 2019 - PPT Presentation

Purbendra yogi Introduction Genome assembly is the process of taking of many short DNA sequences and combine them to form original chromosome First generation Assembly began in the late 1980s and early 1990s ID: 775109

reads contigs assembly genome reads contigs assembly genome sequence mate read graph pair overlap sequencing consensus information research scaffolds

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document " Genome assembly BIO 446L/546L- Fall 201..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Genome assembly

BIO 446L/546L- Fall 2019

Purbendra yogi

Slide2

Introduction

Genome assembly is the process of taking of many short DNA sequences and combine them to form original chromosome.First generation Assembly began in the late 1980’s and early 1990’s.In 1977 First genome (phix174; 5.4 kb) assembled by hand.In 1980s Yeast and Worms genomes.The development of shot gun sequencing.

Slide3

Introduction

In 1995 TIGR Used the shotgun sequencing technique of the bacterium Haemophilus influenzae.Hierarchical or BAC approach (50-150kbp).Sequencing of the model organisms such as bacteria, yeast, worms, flies and mice.In 2000, by the successful assembly of full genome (175Mb) of the Drosophila melanogaster.Human genome sequencing (both WGS and BAC).

Slide4

Determining the DNA sequence of an organism.

Useful in Biological research.In medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases.Research into pathogens may lead to treatment for contagious diseases.

Why assembly

Slide5

Steps to assemble a genome

Find overlapping reads.Merge good pairs of reads into contigs.Link contigs to form super contigs.Derive consensus sequence.

Slide6

Some terminology

Reads: 500-900 words file comes from sequencer machine.Mate pair: two ends of same insert fragment.Contigs: formed by several overlapping reads.Scaffolds: ordered and oriented set of contigs.Consensus: multiple alignment of reads in a contig.

sciencedirect.com

Slide7

Short and long reads

Wikipedia

Long reads high error rates.

Pacbio

/ Nanopore.Short reads high accuracy (Illumina)Length limits ability to resolve repeats.

Slide8

Comparative assemblers

Reference based.

Find best overlapping reads.Merge reads into longer contigs.Form super contigsConsensus sequence.

research gate

Slide9

BAC genome assembly

Create physical map of the whole genome.

Randomly cut into several pieces.Clone into BAC.Overlap is estimated by the no of STS.Read constructed based on (STS).Accurate assembly for highly repetitive and polyploid genomes.

research gate

Slide10

Whole genome shotgun assembly

Genomic DNA is randomly shredded into small pieces (2,000bp).

Each fragment is inserted into a plasmid.Plasmid library are sequenced 500bp from each end of each fragments are decoded generating millions of sequence.Overlap: the sequence overlap between all available reads.Layout: the reads are arranged according to their pattern of overlap, producing a multiple alignment of the reads.Consensus: contig is generated by calculating the consensus base at each base position of the layout.Mate-pair information to order an orient the contigs and place them into larger structure called scaffolds.

Wikipedia

Slide11

Human genome project

Slide12

De novo assembly

De novo assembly is not reference based.Extract DNA.Using appropriate technology.DNA sequencingAssembly process: different steps.Get the read file from Sequencing machine.Quality check of the readsCleanup/ trimmingAssemble the data into contigs and scaffolds.Check for quality.

Slide13

Quality check of the reads

The total number of reads.Read length.Percent of GC content.Possible contamination barcodes, adaptor sequencePresence of large number of N’s in reads.

Melbourne Bioinformatics

Slide14

Cleanup/ Trimming

Trims adaptors, barcodes and other contaminations.Measure average base quality of the reads.Number of orphaned reads.No of pairs lost.Look at gaps or regions of “N”s.Check for mis- assemblies.

Slide15

Assembly algorithms

The Greedy Graph

Remove any input short read.Find the ordered pair with maximum overlap.Find best matching pairs.Pairwise vs similarities.Read overlaps that conflict with already constructed contigs are ignored.

.

Wikipedia

Slide16

Graph based approaches

OLC graph and De Bruijn Graphs.

Overlap: detecting overlaps among the set of reads.Overlap information is organized into a graph. Layout: The graph is constructed on the basis of an appropriate ordering and orientation of the reads.Consensus: contigs is computed from the ordered and oriented reads.De Bruijn graph: Each read is broken into a sequence of overlapping k- mers. The k- mers are added as vertices to the graph, and k- mers that originate from adjacent positions in a read are linked by an edge.

sysbio

. unl.edu

Slide17

Mate pairs

Mate- pair information is used to infer links between contigs.

Contigs are determined based on mate- pair information.A linear arrangement of the contigs is determined.Mate-pair used to order and orient contigs with respect to each other- scaffolding.Mate- pair information used to resolve repeat regions.

Order and orient contigs with respect to each other Resolve repeat regions.

Theory and practice of genome sequence assembly

Slide18

Assemble contigs/scaffolds

Contigs: the reads are assembled as long consensus sequences called contigs.

Scaffolding: detecting misassembled ones and correcting them to scaffolds.Paired end reads are used as a guide map to order and orient contigs.Paired end data useful for detecting chimeric contigs Gap filling: the space between contigs are carefully filled by using other independent reads to complete the assembly.

licensed under CC BY

research gate

Slide19

Main problems of assembly

Repeat sequence.

Low coverage.Biased sequencing.Error rate.Chimeric reads.Adaptor and other contamination.Heterozygosity (bubbles).

UsearchV11

Wikipedia

Slide20

Conclusion

Slide21

Literature cited

Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000. The genome Sequence of Drorsophila melanogaster. Science 287:2185-95.Jared T. Simpson and Mihai Pop. the Theory and Practice of Genome Sequence Assembly. Center for Bioinformatics and Computational Biology, University of Maryland, college park, Maryland 20742.Simon Gladman. De novo Genome Assembly for Illumina Data Melbourne Bioinformatics.Jason R Miller, Sergey Koren, Granger Sutton. Assembly Algorithms for next- Generation Sequencing Data, J Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850-3343,USA.M. Pop, Genome assembly reborn: recent computational challenges, Brief Bioinform. 10 (2009) 354-366.

Slide22

Questions ?

Thank You