By Kevin Chen Lior Pachter PLoS Computational Biology 2005 David Kelley State of metagenomics In July 2005 9 projects had been completed General challenges were becoming apparent Paper focuses on computational problems ID: 357688
Download Presentation The PPT/PDF document "Bioinformatics for Whole-Genome Shotgun ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities
By Kevin Chen, Lior PachterPLoS Computational Biology, 2005
David KelleySlide2
State of metagenomics
In July 2005, 9 projects had been completed.General challenges were becoming apparentPaper focuses on computational problemsSlide3
Assembling communities
GoalRetrieval of nearly complete genomes from the environmentChallengesNeed sufficient read depth- species must be prominentAvoid mis-assembling across species while maximizing contig
sizeSlide4
Comparative assembly
Align all reads to a closely-related “reference” genomeInfer contigs from read alignmentsRearrangements limit effectiveness
Pop M. et al. Comparative genome assembly. Briefings in Bioinformatics 2004.Slide5
“Assisted” Assembly
De novo assemblyComplement by aligning reads to reference genome(s)Short overlaps can be trustedSingle mate links can be trustedMis-assemblies can be detected
Gnerre
S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Slide6
Assisted Assembly
Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Slide7
Assisted Assembly
Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Slide8
Assisted Assembly
Gnerre S. et al. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biology 2009.Slide9
Metagenomics application
Pros:Low coverage speciesIf conservative, unlikely to hurtConsExotic microbes may have no good referencesPotential to propagate mis
-assembliesSlide10
Overlap-layout-consensus
Species-levelIncreased polymorphismReads come from different individualsMissed overlapsSystem-levelHomologous sequenceFalse overlapsSlide11
Polymorphic diploid eukaryotes
Reads sequenced from 2 chromosomesSingle reference sequence expectedKeep duplications separateKeep polymorphic haplotypes togetherSlide12
Strategy 1
Form contigs aggressivelyDetect alignments between contigs and resolveAvoid merging duplications by respecting mate pair distances
Jones, T. et al. The diploid genome sequence of Candida
albicans
. PNAS 2004.Slide13
Strategy 2
Assemble chromosomes separatelyErase overlaps with splitting ruleVinson et al. Assembly of polymorphic genomes: Algorithms and application to
Ciona
savignyi
. Genome Research 2005.Slide14
Back to metagenomics
Strategy 1Assemble aggressivelyDetect mis-assemblies and fixStrategy 2Separate reads or filter overlapsSlide15
Binning
Presence of informative genesE.g. 16S rRNAMachine learningK-mersCodon biasWorked well only with big scaffolds
Lots of progress in this area since 2005Slide16
Abundances
Depth of read coverage suggests relative abundance of species in sampleDifficult if polymorphism is significantSeparate individuals too lowMerge species too high
Depends on good classificationSlide17
How much sequencing
G = genome size (or sum of genomes)c = global coveragek = local coveragenk= bp
w
/ coverage
kSlide18
Poisson model
“Interval” =[
x
–
l
r
,
x
]
“Events” = read starts
“
λ
” = coverage
x
x-l
rSlide19
Gene Finding
Focus on genes, rather than genomesBacterial gene finders are very accurateAssemble and run on scaffoldsBLAST leftover reads against protein dbSlide20
Partial genes
Tested GLIMMER on simulated 10 Kb contigsMany genes crossed bordersGLIMMER often predicted a truncated versionGene finding models could be adjusted to account for this caseSlide21
Gene-centric analysis
Cluster genes by orthologyOrthology refers to genes in different species that derive from a common ancestorExpress sample as vector of abundancesSlide22
UPGMA on KEGG vectorsSlide23
PCA on KEGG vectors
Principal components may correspond to interesting pathways or functionsSlide24
How much sequencing
N = # genes in communityf = fraction foundCoupon collector’s problemSlide25
Phylogeny
Apply multiple sequence alignment and phylogeny reconstruction to gene sequencesSlide26
Partial sequences
Bad for common msa programsSemi-global alignment is requiredSlide27
Supertree methods
Construct tree from multiple subtreesSplit gene into segments?Construct subtree on sequences that align fully to segment?Slide28
Thanks!