Nathan Clement Computational Sciences Laboratory Brigham Young University Provo Utah USA NextGeneration Sequencing Problem Statement Map nextgeneration sequence reads with variable nucleotide confidence to ID: 1045617
Download Presentation The PPT/PDF document "GNUMap : Unbiased Probabilistic Mappin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. GNUMap: Unbiased Probabilistic Mapping of Next-Generation Sequencing ReadsNathan ClementComputational Sciences LaboratoryBrigham Young UniversityProvo, Utah, USA
2. Next-Generation Sequencing
3. Problem Statement Map next-generation sequence reads with variable nucleotide confidence to a model reference genome that may be different from the subject genome.SpeedTens of millions of reads to a 3Gbp genomeAccuracyMismatches included?Repetitive regionsVisualization
4. Workflow
5. Indexing the genomeFast lookup of possible hit locations for the readsHashing groups locations in the genome that have similar sequence contentk-mer hash of exact matches in genome can be used to narrow down possible match locations for readsSorting genome locations provides for content addressing of genomeGNUMap uses indexing of all 10-mers in the genome as seed points for read mapping
6. Building the Hash TableACTGAACCATACGGGTACTGAACCATGAATGGCACCTATACGAGATACGCCATACHash TableAACCATAACCATSliding window indexes all locations in the genome
7. AlignmentGiven a possible genome match location, determine the quality of the matchIf you call bases in the readEvery base gets the same weight in the alignment, no matter what the qualityLater bases in the read that have lower quality have equal weight in the alignment with high quality bases at the start of the readGNUMap uses a Probabilistic Needleman-Wunsch to align reads found with seed points from the genome hash
8. Average Read Probability
9. Single Read Variation
10. Probabilistic Needleman WunschAllows for probabilistic mismatches and gapsGreater ability to map reads of variable confidenceProduces likelihood of alignment
11. AssignmentGiven a read that has matches to possibly multiple locations in the genome, assign the read to locations where it matchesDiscard read – Repeat masking repetitive regions Half of the human genome contains repeat regions, so you are not able to map to those regionsMany regulatory regions are repeated in the genomeMap to all locations – Repeat regions will be over-represented since one read will generate multiple hitsPick a random location – Highly variable if there are small numbers of readsGNUMap uses probabilistic mapping to allocate a share of the read to matching locations in the genome according to the quality of the match
12. AssignmentACTGAACCATACGGGTACTGAACCATGAAAACCATGGGTACAACCATTACRead from sequencerGGGTACAACCATRead is added to both repeat regions proportionally to their match quality
13. Equation for probabilistic mappingPosterior ProbabilityAllows for multiple sequences of different matching qualityIncludes probability of each read coming from any genomic position
14. Which Program to Use? Many different programs. How do they relate?ELAND (included with Solexa 1G machine)RMAP (Smith et al., BMC Bioinformatics 2008)SOAP (Li et al., Bioinformatics 2008)SeqMap (Jiang et al., Bioinformatics 2008)Slider (Malhis et al., Bioinformatics 2008)MAQ (Unpublished, http://maq.sourceforge.net/)Novocraft (Unpublished, http://www.novocraft.com)Zoom (Lin et al., Bioinformatics 2008)Bowtie (Langmead et al., Genome Biology 2009)…
15. Simulation StudiesAmbiguous reads cause:Missed (unmapped) regionsToo many mapped regions (noise)
16. Simulation Studies
17. Actual DataETS1 binding domainRepetitive region
18. Future PlansRemoval of adaptor sequencesMethylation analysisPaired-end readsSOLiD color space
19. AcknowledgementsEvan JohnsonQuinn SnellMark ClementHuntsman Cancer Institutehttp://dna.cs.byu.edu/gnumap
20.