/
Dr. James Abbott Read Mapping Dr. James Abbott Read Mapping

Dr. James Abbott Read Mapping - PowerPoint Presentation

reese
reese . @reese
Follow
67 views
Uploaded On 2023-07-14

Dr. James Abbott Read Mapping - PPT Presentation

jabbottdundeeacuk From your home directory run the following command jabbottlogin2 wget httpsdagcompbiodundeeacukworkshopsdownloadsreadmappingsh Now run jabbottlogin2 ID: 1009083

mapping read jabbott h395 read mapping h395 jabbott reads alignment bwa reference samtools c6320 bam h293 tmpdir sorted dag

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dr. James Abbott Read Mapping" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Dr. James AbbottRead Mappingj.abbott@dundee.ac.uk

2. From your home directory, run the following command:[jabbott@login2 ~]$ wget https://dag.compbio.dundee.ac.uk/workshops/downloads/read_mapping.shNow run[jabbott@login2 ~]$ bash read_mapping.shYou should now have a read_mapping directory - change into this…[jabbott@login2 ~]$ cd read_mappingCreate a conda environment for the session[jabbott@login2 read_mapping] conda create -n read_mapping[jabbott@login2 read_mapping] conda activate read_mappingRead Mapping session setup

3. Alignment basicsNGS alignment issuesReference sequencesBWAMinimap2File formatsPost-processingViewing alignments in genome browsersMulti-mapping readsDAG Workflows for alignmentOverview

4. Sequence reads represent isolated regions of sequence out of contextMany analysis approaches require knowledge of Read locations on reference sequenceDifferences between reads and reference sequence4Why map reads?

5. Sequence Alignment5

6. Alignment of two sequences originally carried out…To identify subsequences which are positionally similarTo uncover evolutionary relationshipsFirst optimal alignment of two sequences: Needleman and Wunsch (1970)Dynamic programmingGlobal alignment: optimal alignment of full length of sequencesAttempt to align every base between sequencesConserved domains may not be aligned --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | || | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C 6Sequence alignment basics

7. Optimal alignment of substring within sequencesUseful for identifying locally conserved regions ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA |||| |||||| ||||||||||||||| TACTCACGGATGAGGTACTTTAGAGGCSmith-Waterman (1981) modification of Needleman-Wunsch 7Local alignments

8. How do you asses if one alignment is better than another?Scoring – reward similarity, penalise differencesGCATCATCTCCG|| |||| ||||GCTTCATGTCCGi.e. (from BLAST):Match: score +1Mismatch: score -210 matches and 2 mismatches =  8Alignment scoring

9. Gaps inserted into alignments to allow for indels (insertions/deletions)For best alignment, gaps need to be minimisedGCATCCGCATGCTCCG BLAST scoring: match=+1, mismatch -2GCAT---CAT-CTCCGGap penalty applied to alignment score Constant: fixed score for each gap regardless of length. Favours fewer, longer gaps. For Penalty B of -3, score = (12x1)+(-3x2) = 6Linear: Penalty B for each insertion for gap length L. Favours shorter gaps. Penalty = Score = (12x1)+(-3x3)+(-3x1) = 0Affine: Separate penalty for opening gap A=-3, gap extension penalty B=-2. Penalty = Score = (12x1)+(-3+(-2x2))+(-3+(-2x0)) = 2Most common approach. Want closer matches? Increase A 9Gap penalties

10. Dynamic programming (DP) computationally expensiveDatabase searching requires better performanceHeuristic algorithms: not guaranteed to find optimal solutionexclude large parts of database from DP comparisonIdentify candidates with short stretches of similarity of length k Only carry out DP on candidate sequences10k-tuple/word algorithms

11. NGS Alignments11

12. Requirement to align many millions of reads against databasesExisting approaches far too slowFresh approaches requiredNow >70 NGS read alignersThese differ in SpeedAccuracyApproachPurposeChoosing which to use not always straightforward12NGS arrives…

13. 13The Mapping Problem?

14. 14The Mapping ProblemFragment size

15. 15The Mapping ProblemExon 1Exon 2Transcript

16. Want: all approximate matches between sequences where similarity score above threshold, or distance measure below a thresholdSimilarity score: defined by optimal alignment which minimises number (or weight) of edits, or has maximum scoreDistance threshold between two sequencesHamming distance: number of positions at which symbols are different (substitutions only)Levenshtein/edit distance: Number of single character deletions, insertions or substitutions required to transform one string into anotherWeighted edit distance: Different penalties for indels and substitutionsMatches: Black (12)Insertions/Deletions (INDELs): Blue (4)Mismatches: Red (1)Hamming distance = 1Edit distance = 5Different technologies need different scoringApproximate string matching16AACTAGA-AC-TACTGAAA-TACAGACTTAC-GAExample from Canzar and Salzberg (2017) http://doi.org/10.1109/JPROC.2015.2455551

17. Need to handle billions of readsTwo approaches:Filtering:exclude large sections of reference where no approximate match i.e. find matches of perfect seed of length k match and ignore remainder of reference (k-tuple/word match)Indexing:Pre-process reference, or reads, or bothSuffix array – linear time in relation to reference lengthFM-index/Burrows-Wheeler Transform - linear time in relation to reference length, economic memory usage Gruesome details: Reinert et al (2015) https://doi.org/10.1146/annurev-genom-090413-025358 Canzar and Salzberg (2017) https://doi.org/10.1109/JPROC.2015.2455551 17Approximate string matching: scaling up

18. What is the best mapper? There isn't one…Many options with different capabilitiesPaired ends? Read length?Gapped alignment?Use quality scores?Splice aware?Choice should be guided by your experimentGeneral recommendations:WGS/Exome alignment: BWA,Bowtie2Long reads: minimap2Bisulphite sequencing: bismarckRNA-Seq analysis: STAR, HISAT2Structural variation analysis: mrFASTBasic process similar for most mappersWe'll use BWA today for genomic alignmentsChoosing a read mapper18

19. Preparing your reference sequence19

20. What to align your reads to…Genome? Transcriptome?...it depends…Think about your experimentInclude full range of sequences likely to be in samplesEBV? Host genome?Your sample will almost never be identical to the reference sequenceWhat is the closest related sequence?Is this the most useful? How complete is it?Starting point is a fasta formatted representation of your referenceBeware of different genome versionsVersion and source should be reported in publications20Choosing a reference sequence

21. Maintenance of major reference genomesMajor releasesInfrequentAffect co-ordinatesMoving results between releases requires co-ordinate 'lift-over'See https://www.biostars.org/p/65558/Patch releasesMore frequentAllow fixes to be applied without affecting co-ordinatesNaming convention: GRCx00.p021Genome Reference Consortiumh=human m=mouse z=zebrafish g=chickenMajor version numberPatch numberGenomeGRC NameAlternativeUCSCHumanGRCh37GRCh38b37b38hg19hg38MouseGRCm38NCBI37mm10mm9ZebrafishGRCz11danRer11ChickenGRCg6agalGal6

22. GRC: https://www.ncbi.nlm.nih.gov/grcEnsembl: http://www.ensembl.org Wide range of annotated genomesNon-vertebrates: http://ensemblgenomes.orgFTP downloads: http://www.ensembl.org/info/data/ftp/index.htmlMany options:'dna' – unmasked genomic DNA sequence'dna_rm' – Repeats and low complexity regions masked with 'N''dna_sm' – Repeats and low complexity regions in lower case'toplevel' – chromosomes, unplaced scaffolds, alternate haplotypes'primary assembly' – as above without alternate haplotypei.e. Homo_sapiens.GRCh37.dna.toplevel.fa.gzENA: http://www.ebi.ac.uk/ena 22Where to find reference sequences

23. https://www.ebi.ac.uk/ena/browser/view/GCA_001039695.2Data in 'reference' directory is from EnsemblH293.fa: fasta formatted genome sequenceH293.gff3: GFF3 formatted annotations23Our reference: S. pyogenes H293

24. First…we need to install BWA[jabbott@login2 ~]$ qrsh[jabbott@c6320-3-4 ~]$ cd read_mapping/reference[jabbott@c6320-3-4 ~]$ conda install bwaBWA has multiple subcommands…[jabbott@c6320-3-4 ~]$ bwa indexUsage: bwa index [options] <in.fasta>Options: -a STR BWT construction algorithm: bwtsw, is or rb2 [auto] -p STR prefix of the index [same as fasta name][jabbott@login2 ~]$ bwa index –p H293 H293.faThis is a small genome so indexing is very quick[jabbott@c6320-3-4 reference] $ lsH293.fa H293.fa.amb H293.fa.ann H293.fa.bwt H293.fa.pac H293.fa.sa H293.gff324Creating a BWA index

25. Some programs need indexes of the fasta file in different formatsTwo different formats: fai and dictCan be created by samtools[jabbott@ c6320-3-4 reference]$ conda install samtoolsRun 'samtools' for a list of available subcommandsRun 'samtools [commandName]' for help on a command[jabbott@ c6320-3-4 reference]$ samtools faidx H293.fa[jabbott@ c6320-3-4 reference]$ samtools dict –o H293.dict H293.faCheck directory contents with 'ls' after running each commandBoth .dict and .fai indexes are plain text, so can be viewed with 'less’[jabbott@ c6320-3-4 reference]$ exit25Further indexes…

26. BWA

27. Not recommended for sequences with error rate > 2%Three different algorithmsBWA Backtrack (aln)Illumina reads <100bp Two stage process – find coordinates of read, followed by generation of alignments (either SE or PE)BWA-SWOutputs alignments directlySuitable for reads from 70bp upwardsBWA-MEMReads from 70bp – 1MbpConsiderably faster than BWA-SWNormally method of choice…27BWA

28. Seed and extend based local alignment methodSeeding: Finds longest maximally exact match (MEM) for each query positionRelated arguments-k: minimum seed length. Shorter matches will be missed (Default: 19)-r: trigger reseeding for MEM longer than . Reseeding aims to reduce mismappings due to missing seeds (Default 1.5)-c: discard MEM with more than occurrences (Default: 10000)ChainingGroup of colinear seeds and close to each other: chainShort chains largely contained within long chains filtered out to reduce later extensions 28BWA-MEM: How it works…

29. Seed extensionSeeds ranked on length of containing chain, then by seed lengthSeed dropped if already contained in known alignmentExtended with banded affine gap Smith-WatermanRelated arguments:-A: Match score (Default: 1)-B: Mismatch penalty (Default: 4)-O: Gap opening penalty (Default: 6)-E: Gap extension penalty (Default: 1)-d: Z-dropoff. Extension stopped when difference between best and current extension score > where i is current reference position and j is current query position. Reduces poor alignments within good ones…(Default: 100) 29BWA-MEM: How it works

30. Paired end rescueIn the event that one read of pair maps and second doesn't, rescue of unpaired read attempted using Smith-Waterman alignment, forcing alignment of poorly aligned readRelated arguments: -S: skip mate rescuePairingIf both reads are in correct orientation, distance is calculated Score for pair produced using function which takes individual alignment scores into account along with insert size and possibility of chimeric read pairs - score of alignment of reads i and j – match score – probability of observing insert longer than d – Pairing threshold if such that reads are pairedRelated arguments:-P: Disables pairing-U: Penalty for unpaired read-pair BWA MEM: How it works30

31. -t: Number of threads (default: 1)-p: Input is interleaved fastq -T: Don't output alignments with score < T (default: 30)-a: Output all found alignments for single-end/unpaired reads-M: Mark shorter split hits as secondary alignments. Results from using local alignment so may produce multiple alignments for different regions of read. Required for compatibility with some tools (i.e. picard)BWA-MEM: Other arguments31

32. bwa mem help output: bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] [-d zDropoff] [-r seedSplitRatio] [-c maxOcc] [-A matchScore] [-B mmPenalty][-O gapOpenPen] [-E gapExtPen] [-U unpairPen] db.prefix reads.fq [mates.fq]So a simple command would look like:bwa mem –t 8 –M reference/H293 reads/xxx.fq.gz reads/yyy.fq.gzChanging some default parameters:bwa mem –t 8 –M –k 25 –O 8 –E 2 reference/H293 reads/xxx.fq.gz reads/yyy.fq.gzNo output file name argument?Output written to STDOUTNeed to redirect to a file32BWA-MEM: Putting it all togetherPrefix specified during indexing

33. #!/bin/bash#$ -cwd#$ -pe smp 8#$ -j y#$ -o job_logs/$JOB_NAME.$JOB_IDcp -v reference/H293.* ${TMPDIR}cp -v Streptococcus/H395_?.fq.gz ${TMPDIR}bwa mem -t 8 -M ${TMPDIR}/H293 ${TMPDIR}/H395_1.fq.gz ${TMPDIR}/H395_2.fq.gz \ > ${TMPDIR}/H395.samcp -v ${TMPDIR}/H395.sam .Save your script as 'bwa_1.sh'Submit it to the cluster: 'qsub bwa.sh'Monitor your job with qstatDid it work? Do you have an H395.sam file in your directory? What do the jobs output logs show?33A BWA script…scripts/bwa_1.sh

34. Minimap2 - Long read alignments34

35. Designed to cope with long, noisy readsSimilar overall strategy seed - chain - align procedure as BWAPerformance optimisation through skipping alignment in repetitive regionsProvides presets optimised for particular tasksPreset: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont - PacBio/Nanopore vs reference mapping - ava-pb/ava-ont - PacBio/Nanopore read overlap - asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice/splice:hq - long-read/Pacbio-CCS spliced alignment - sr - genomic short-read mappingCan generates indexes on-the-fly[jabbott@c6320-3-4 ~]$ conda install minimap2Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]Minimap2

36. 'minion/SRR6917534_1.fastq.gz'#!/bin/bash#$ -cwd#$ -pe smp 8#$ -j y#$ -o job_logs/$JOB_NAME.$JOB_IDcp -v reference/H293.fa ${TMPDIR}cp -v minion/SRR6917534_1.fastq.gz ${TMPDIR}minimap2 -t 8 -a -x map-ont -o ${TMPDIR}/SRR6917534.sam ${TMPDIR}/H293.fa \ ${TMPDIR}/SRR6917534_1.fastq.gzcp -v ${TMPDIR}/SRR6917534.sam .36Running minimap2scripts/minimap.shRun 8 threadsOutput in SAM formatUse ONT mapping presetOutput filenameReference sequenceQuery sequences

37. Alignment file formats37

38. Three main formats used for storing alignmentsSAM (Sequence Alignment Map)Plain textUncompressedBAM (Binary Alignment Map)Binary equivalent of SAMCompressedCan be indexed to allow random accessCRAM Uses Reference-based compression45-50% space saving over BAMNot currently as well supported by toolsFormats

39. Official definition document: https://samtools.github.io/hts-specs/SAMv1.pdfFile contains two sections: header (optional) and alignmentsHuman readable(ish) text formatHeader lines start with an '@'H395.sam contains two header lines:@SQ SN:Chromosome LN:1726248@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -v 0 -t 8 -M H293 H395_1.fq.gz H395_2.fq.gz39SAM FormatTwo letter record type codeSQ: Reference sequence dictionarySN: Sequence nameLN: Sequence lengthPG:ProgramID:Unique identifierPN: Program nameVN: Program VersionCL: Command line

40. @HD: The header line – must be first line in file@HD VN:1.0 SO:coordinate@CO: One line text comment. Multiple @CO lines allowed@RG: Read Group…more on these laterMany other tags may be found for lines describedSee format specification for details40SAM Format: Other headersTagDescriptionValid valuesVNFormat version (required)Number i.e. 1.0SOSort orderunknown, unsorted, coordinate, queryname

41. Each line represents alignment of segment11 (or more) tab-delimited fieldsBeware 1-based vs 0-based coordinates…Additional optional fieldsFieldTypeDescriptionQNAMEStringQuery template nameFLAGIntegerbitwise flag – more laterRNAMEStringReference sequence namePOSInteger1-based leftmost mapping positionMAPQIntegerMapping qualityCIGARStringCIGAR string – more laterRNEXTStringRef name of paired readPNEXTIntegerPosition of paired readTLENIntegerTemplate lengthSEQStringSegment sequenceQUALStringSequence quality (Phred)SAM format: Alignment section41M00368:16:000000000-A0JCH:1:1:15079:1345 83 Chromosome 322231 60 151M = 322032 -350 CACAA C@CCA NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:0Coordinate comparison figure by Obi Griffith, Biostars

42. A diversion in binary and bitwise operations…A byte stores 8 bits of data allowing a value up to 255 to be storedWhy am I telling you this?SAM flag field is combination of 12 bitwise flags…Allows alignments to be filtered to isolate reads meeting particular criteria1286432168421Decimal value000001014+1=510101011128+32+8+2+1=17111111111128+64+32+16+8+4+2+1=255SAM Format: SAM Flags42

43. BitDescription1Read paired2Read mapped in proper pair4Read unmapped8Mate unmapped16Read on reverse strand32Mate on reverse strand64First in pair128Second in pair256Secondary alignment512Does not pass filters i.e. QC1024PCR/optical duplicate2048Supplementary alignmentFirst read on our SAM file: Flag is 83Only one way to make 83 64+16+2+1=83Read is paired, mapped in a proper pair, on the reverse strand and is first in pairNext read (pair): Flag is 163 128+32+2+1=163Read is paired, mapped in a proper pair, is on the reverse strand and is second in pairFortunately: samtools flagsSAM Format: SAM flags43204810245122561286432168421value00000101001183000010100011163

44. CIGAR string defines how read compares with referenceConcise Idiosyncratic Gapped Alignment ReportSequence of count and operators combined151M: 151 matches CACGATCA**GACCGATACGTCCGA CGATCAGAGA*CGATACigar string: 6M2I2M1D5MLimit of 65535 operationsMinimap2 -L argument stores CIGAR as supplementary tagOperatorDescriptionConsumes queryConsumes referenceMAlignment matchYesYesIInsertion to referenceYesNoDDeletion from referenceNoYesNSkipped region from referenceNoYesSSoft clippingYesNoHHard clippingNoNoPPaddingNoNo=Sequence matchYesYesXSequence mismatchYesYesSAM Format: CIGAR string44

45. Clipped alignmentsLocal alignments may not be aligned for full length of sequenceSubsequences at end may be clipped off GCTTAATGCGTGTGACAGTCGATGTACTGAC gtaGCGTGTGACAGTCGATcaCIGAR string: 3S16M2SSpliced alignmentscDNA to genomic alignments need to differentiate spanning introns from deletionsN: Skipped region from reference GCTTAATGCGTGTGACAGTCGATGTACTGAC AATG.........GTCGATCIGAR string: 4M9N6MSAM Format: CIGAR string45

46. Additional fields can be appended to each alignment lineTake the format of TAG:Type:ValueTag is string of two lettersStandard tag definitions:https://samtools.github.io/hts-specs/SAMtags.pdfTypeDescriptionACharacterBArrayfReal numberHHexadecimal arrayiIntegerZStringSAM Format: Additional tags46NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:0TagDescriptionNMEdit distanceMDString for mismatching positionsMCCIGAR string for mateASAlignment scoreXSScore of second best alignment (BWA specific tag)

47. Read groups allow combination of reads from different sources within one fileMap individual reads to e.g. sampleRequired by some toolsEntry in header for each read groupAdditional RG tag on each alignment mapping to read group i.e. RG:Z:1@RG ID:1 PL:ILLUMINA PU:1 BC:CGCTCAGTTC SM:H293@RG ID:2 PL:ILLUMINA PU:1 BC:TATCTGACCT SM:H355SAM Format: Read groups47TagDescriptionIDUnique identifierPLPlatformPUPlatform unit i.e. laneBCBarcodeSMSample

48. Read groups can be added by some aligners at run-timeBWA: -R argumentbwa –R "@RG\tID:1\tSM:H395\tPL:ILLUMINA\tPU:1" H293 H395_1.fq.gz H395_2.fq.gzPicard tools: set of utilities for manipulating NGS data[jabbott@login2 read_mapping]$ qrsh -pe smp 4[jabbott@lc6320-6-3 ~]$ cd read_mapping[jabbott@c6320-6-3 read_mapping]$ conda install picard[jabbott@c6320-6-3 read_mapping]$ picard[jabbott@c6320-6-3 read_mapping]$ picard AddOrReplaceReadGroups[jabbott@c6320-6-3 read_mapping]$ picard AddOrReplaceReadGroups I=H395.sam \ O=H395_RG.sam RGID=1 RGLB=Lib1 RGPL=ILLUMINA RGSM=H395 RGPU=148Adding read groups to a SAM file

49. Binary equivalent of SAMMuch more compact – should always be used in preference to SAMNot human readablesamtools view can be used to convert SAM/BAM/CRAM files[jabbott@c6320-6-3 read_mapping]$ samtools view[jabbott@c6320-6-3 read_mapping]$ samtools view -b -o H395_RG.bam -@ 4 H395_RG.sam[jabbott@c6320-6-3 read_mapping]$ du –h H395_RG.*124M H395_RG.bam325M H395_RG.samSamtools view outputs on STDOUT by defaultCan be used to view BAM files as SAMEverything: [jabbott@c6320-6-3 read_mapping]$ samtools view –h H395_RG.bamHeaders only: [jabbott@c6320-6-3 read_mapping]$ samtools view –H H395_RG.bamAlignments: [jabbott@c6320-6-3 read_mapping]$ samtools view H395_RG.bamBAM Format49Output bam formatOutput filenameRun 4 threads

50. 50BAM Formatsamtools can also read from STDINCan pipe SAM output from BWA straight to samtoolsUpdate your bwa script:#!/bin/bash#$ -cwd#$ -pe smp 8#$ -j y#$ -o job_logs/$JOB_NAME.$JOB_IDcp -v reference/H293.* ${TMPDIR}cp -v Streptococcus/H395_?.fq.gz ${TMPDIR}bwa mem -t 8 -M ${TMPDIR}/H293 ${TMPDIR}/H395_1.fq.gz ${TMPDIR}/H395_2.fq.gz | samtools view -b -o ${TMPDIR}/H395.bam -cp -v ${TMPDIR}/H395.bam .Now rerun it and look for your H395.bam file appearingTake a look at this with ‘samtools view’‘-’ in place of input filename - read from STDINscripts/bwa_2.sh

51. Post-processing51

52. BAM files generally need to be sorted by co-ordinateOutput of aligners will not generally be sortedCoordinate sorted file sorted first by reference name followed by base position[jabbott@login1 read_mapping]$ qrsh[jabbott@c6320-6-3 ~]$ cd read_mapping[jabbott@c6320-6-3 read_mapping]$ samtools sort [jabbott@c6320-6-3 read_mapping]$ samtools sort --threads 4 –o H395.sorted.bam H395.bamCheck the headers of H395.bam and H395.sorted.bam[jabbott@c6320-6-3 read_mapping]$ samtools view –H H395.bam[jabbott@c6320-6-3 read_mapping]$ samtools view –H H395.sorted.bamNow look at the sorted alignments – pay attention to the POS column[jabbott@c6320-6-3 read_mapping]$ samtools view H395.sorted.bam|less52BAM files: sorting and indexing

53. Indexes allow software to randomly access data within a BAM fileIndex is binary file with .bai filename suffixA BAM index can be created with samtools index[jabbott@c6320-6-3 read_mapping]$ samtools index [jabbott@c6320-6-3 read_mapping]$ samtools index -@ 4 H395.sorted.bamCheck the index is created (H395.sorted.bam.bai) with 'ls’[jabbott@c6320-6-3 read_mapping]$ rm H395.sorted.bam*Update bwa.sh to sort and index our bam file as we create it…#!/bin/bash#$ -cwd#$ -pe smp 8#$ -j y#$ -o job_logs/$JOB_NAME.$JOB_IDcp -v reference/H293.* ${TMPDIR}cp -v Streptococcus/H395_?.fq.gz ${TMPDIR}bwa mem -t 8 -M ${TMPDIR}/H293 ${TMPDIR}/H395_1.fq.gz ${TMPDIR}/H395_2.fq.gz | \ samtools sort -o ${TMPDIR}/H395.sorted.bam -samtools index ${TMPDIR}/H395.sorted.bamcp -v ${TMPDIR}/H395.sorted.* .BAM Format: Indexing53scripts/bwa_3.shAdd 'samtools index' commandreplace 'samtools view' command with 'samtools sort'Update copy command

54. (ngs_tools) ningal:read_mapping $ samtools flagstat H395.sorted.bam791395 + 0 in total (QC-passed reads + QC-failed reads)2947 + 0 secondary0 + 0 supplementary0 + 0 duplicates672450 + 0 mapped (84.97% : N/A)788448 + 0 paired in sequencing394224 + 0 read1394224 + 0 read2665336 + 0 properly paired (84.39% : N/A)668550 + 0 with itself and mate mapped953 + 0 singletons (0.12% : N/A)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5)Some basic statistics

55. Duplicate reads may occur due toOptical duplicates: One cluster being called as twoClustering: Occupying two adjacent wells during cluster generationPCR duplication: from amplification in sample prep.Duplicate reads can bias some analysis i.e. SNV identificationDon't provide any additional informationCan be either marked as duplicates (via the SAM flag 1024), or removed[jabbott@c6320-6-3 read_mapping]$ picard MarkDuplicates I=H395.sorted.bam \ O=H395.sorted.nodup.bam M=H395.duplicate_metrics.txtuse 'samtools flagstat' to compare the H395.sorted.nodup.bam stats with H395.sorted.bam55BAM files: marking duplicates

56. BAM files can be filtered to separate reads meeting particular criteriaInterested in the set of reads in your data which didn’t align?Flags: Read unmapped = 4, Mate unmapped = 8Therefore to extract read pairs where neither read is mapped: 8+4=12Samtools view can filter with the following arguments:-f INT only include reads with all of the FLAGs in INT present [0]-F INT only include reads with none of the FLAGS in INT present [0]-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]We want to only include reads with our flag present…[jabbott@c6320-6-3 read_mapping]$ samtools view –f 12 –b –o H395.unmapped.bam H395.sorted.bamCheck the contents of the new bam file with ‘samtools flagstat’We can convert these reads back to fastq format with picard[jabbott@c6320-6-3 read_mapping]$ picard SamToFastq I=H395.unmapped.bam \ F=H395.unmapped_1.fq.gz F2=H395.unmapped_2.fq.gz56BAM files: filtering

57. Want to separate alignments into chunks to allow parallel analysis?Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]Region defined as REF – select all alignments to specified referenceREF:start-end – select all alignments spanning coordinates on specified referenceSelect the reads from our alignment from between coordinates 1000-2000:[jabbott@c6320-6-3 read_mapping]$ samtools view H395.sorted.bam ‘Chromosome:1000-2000’See ‘samtools view’ for list of other filtering options57BAM files: filtering

58. Picard can report stats on the library insert size distribution:[jabbott@c6320-6-3 read_mapping]$ picard CollectInsertSizeMetrics[jabbott@c6320-6-3 read_mapping]$ picard CollectInsertSizeMetrics H=H395.pdf \ I=H395.sorted.bam O=H395.txtH395.txt includes stats such as median insert size, median absolute deviationWGS metrics (coverage)[jabbott@c6320-6-3 read_mapping]$ picard CollectRawWgsMetrics \ R=reference/H293.fa I=H395.sorted.bam O=H395.wgs_stats.txt58Other statistics of interest…

59. Viewing alignments in a genome browser59

60. Various browsers freely availableDiffer in capabilitiesPopular choices IGV (https://software.broadinstitute.org/software/igv/) IGB (http://bioviz.org) Tablet (https://ics.hutton.ac.uk/tablet/) Available through Apps Anywhere Search for tablet and start it up….Typically require:Reference sequence: fasta format and faidx indexedReference annotations: gff3 formatBam format alignmentsGenome browsers

61. 61Tablet – loading your dataIndexed, sorted bam filefaidx indexed fasta reference

62. 62Tablet - browsing

63. Multimapping Reads63

64. Interspersed repeats constitute~50% of human genomeUp to ~80% of graminae genomesHow can an aligner determine which copy of an identical repeat a read originates from?It can’t…Provide measure of confidence of mapping being correct – MAPQPhred-scaled likelihood Distinct from alignment scoreSequence repeats

65. Behavior of aligners differs when handling multimapping readsFor BWA:One mapping better scoring than others? MAPQ reduced accordinglyIdentically scoring mappings?MAPQ set to 0Additional tags added to alignment lineWhich mapping reported? Chosen at random…HWI-ST1398:100:C43B8ACXX:7:2207:16475:95789 163 1 152276198 0 100M = 152276256 158 ACTCA ?@=A? X0:i:2 X1:i:0 XA:Z:1,+152280086,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R65Multimapping readsX0: number of best hitsXA: Alternative hitsREF,POS,CIGAR,NM;

66. 66Multimapping reads: A real world example…

67. Bowtie2Default: Search for multiple alignments, report bestRandom selection of reported alignment-k: Search for one to k alignments, report eachAdditional alignments flagged as secondaryAlignments not found in particular order – not necessarily finding ‘best’ alignment before stopping-a: Search for and report all alignmentsReported in descending order by alignment scoreSlow…mrFAST:Designed to find all mappings as primary aimWorks with TARDIS SV detection package67Multimapping reads: Other Aligners

68. Mapping Workflows68

69. Workflow for alignment of paired-end Illumina style reads Creation of BWA/fasta indicesPer-sample:alignment with BWA-MEMDuplicates marked with Picard MarkDuplicatesGather alignment metrics (Picard):AlignmentSummaryMetricsBaseDistributionByCycleGcBiasMetricsInsertSizeMetricsOxoGMetricsWgsMetricsCreate MultiQC reportCreate summary HTML reportDAG Workflows: map-bwa-mem-pe

70. Setting up a workflow – dag-wf-setup:(dag-wf) ningal:read_mapping $ dag-wf-setup -i reads -d dag-wf-mapping -n Strep_mapping \ -w map-bwa-mem-pe -r reference/H293.faCreating dag-wf-map-bwa-mem-pe conda environment<snip>Populating dag-wf-mapping/fastq directory with data files... /homes/jabbott/read_mapping/reads/H395_1.fq.gz -> dag-wf-mapping/fastq/H395_1.fq.gz /homes/jabbott/read_mapping/reads/H395_2.fq.gz -> dag-wf-mapping/fastq/H395_2.fq.gz /homes/jabbott/read_mapping/reads/H411_1.fq.gz -> dag-wf-mapping/fastq/H411_1.fq.gz /homes/jabbott/read_mapping/reads/H411_2.fq.gz -> dag-wf-mapping/fastq/H411_2.fq.gz reference/H293.fa -> dag-wf-mapping/reference/H293.faPlease inspect the Snakefile in the 'dag-wf-mapping' directory and modify as required.The workflow can be run by running 'cd dag-wf-mapping; dag-wf-run'70DAG map-bwa-mem-pe workflow setupDirectory containing fastq readsDirectory to create for running workflowJob nameWorkflow requiredReference sequence fasta file

71. (dag-wf) ningal:read_mapping $ cd dag-wf-mapping(dag-wf) ningal:dag-wf-mapping $ dag-wf-runTypically takes ~7 mins to runAn HTML report file should be created in dag-wf-mappingBrowse to this via your network mount and double-click to open in a web browser71DAG map-bwa-mem-pe workflow: running

72. Any Questions?72