/
GNUMap :   Unbiased Probabilistic Mapping of Next-Generation Sequencing Reads GNUMap :   Unbiased Probabilistic Mapping of Next-Generation Sequencing Reads

GNUMap : Unbiased Probabilistic Mapping of Next-Generation Sequencing Reads - PowerPoint Presentation

madeline
madeline . @madeline
Follow
27 views
Uploaded On 2024-02-09

GNUMap : Unbiased Probabilistic Mapping of Next-Generation Sequencing Reads - PPT Presentation

Nathan Clement Computational Sciences Laboratory Brigham Young University Provo Utah USA NextGeneration Sequencing Problem Statement Map nextgeneration sequence reads with variable nucleotide confidence to ID: 1045617

read genome probabilistic locations genome read locations probabilistic regions 2008 reads bioinformatics quality match repeat hash variable bases map

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "GNUMap : Unbiased Probabilistic Mappin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. GNUMap: Unbiased Probabilistic Mapping of Next-Generation Sequencing ReadsNathan ClementComputational Sciences LaboratoryBrigham Young UniversityProvo, Utah, USA

2. Next-Generation Sequencing

3. Problem Statement Map next-generation sequence reads with variable nucleotide confidence to a model reference genome that may be different from the subject genome.SpeedTens of millions of reads to a 3Gbp genomeAccuracyMismatches included?Repetitive regionsVisualization

4. Workflow

5. Indexing the genomeFast lookup of possible hit locations for the readsHashing groups locations in the genome that have similar sequence contentk-mer hash of exact matches in genome can be used to narrow down possible match locations for readsSorting genome locations provides for content addressing of genomeGNUMap uses indexing of all 10-mers in the genome as seed points for read mapping

6. Building the Hash TableACTGAACCATACGGGTACTGAACCATGAATGGCACCTATACGAGATACGCCATACHash TableAACCATAACCATSliding window indexes all locations in the genome

7. AlignmentGiven a possible genome match location, determine the quality of the matchIf you call bases in the readEvery base gets the same weight in the alignment, no matter what the qualityLater bases in the read that have lower quality have equal weight in the alignment with high quality bases at the start of the readGNUMap uses a Probabilistic Needleman-Wunsch to align reads found with seed points from the genome hash

8. Average Read Probability

9. Single Read Variation

10. Probabilistic Needleman WunschAllows for probabilistic mismatches and gapsGreater ability to map reads of variable confidenceProduces likelihood of alignment

11. AssignmentGiven a read that has matches to possibly multiple locations in the genome, assign the read to locations where it matchesDiscard read – Repeat masking repetitive regions Half of the human genome contains repeat regions, so you are not able to map to those regionsMany regulatory regions are repeated in the genomeMap to all locations – Repeat regions will be over-represented since one read will generate multiple hitsPick a random location – Highly variable if there are small numbers of readsGNUMap uses probabilistic mapping to allocate a share of the read to matching locations in the genome according to the quality of the match

12. AssignmentACTGAACCATACGGGTACTGAACCATGAAAACCATGGGTACAACCATTACRead from sequencerGGGTACAACCATRead is added to both repeat regions proportionally to their match quality

13. Equation for probabilistic mappingPosterior ProbabilityAllows for multiple sequences of different matching qualityIncludes probability of each read coming from any genomic position

14. Which Program to Use? Many different programs. How do they relate?ELAND (included with Solexa 1G machine)RMAP (Smith et al., BMC Bioinformatics 2008)SOAP (Li et al., Bioinformatics 2008)SeqMap (Jiang et al., Bioinformatics 2008)Slider (Malhis et al., Bioinformatics 2008)MAQ (Unpublished, http://maq.sourceforge.net/)Novocraft (Unpublished, http://www.novocraft.com)Zoom (Lin et al., Bioinformatics 2008)Bowtie (Langmead et al., Genome Biology 2009)…

15. Simulation StudiesAmbiguous reads cause:Missed (unmapped) regionsToo many mapped regions (noise)

16. Simulation Studies

17. Actual DataETS1 binding domainRepetitive region

18. Future PlansRemoval of adaptor sequencesMethylation analysisPaired-end readsSOLiD color space

19. AcknowledgementsEvan JohnsonQuinn SnellMark ClementHuntsman Cancer Institutehttp://dna.cs.byu.edu/gnumap

20.