/
Comparison of University of Washington and Broad Institute exome sequencing of NA12878 Comparison of University of Washington and Broad Institute exome sequencing of NA12878

Comparison of University of Washington and Broad Institute exome sequencing of NA12878 - PowerPoint Presentation

carny
carny . @carny
Follow
66 views
Uploaded On 2023-06-25

Comparison of University of Washington and Broad Institute exome sequencing of NA12878 - PPT Presentation

ESP Meeting April 6 2010 1 Executive Summary Sequenced sample NA12878 at University of Washington UW and the Broad Institute BI shared the resulting data and compared SNP calls Once a common postsequencingandalignment analysis strategy is applied UW and BI callsets ID: 1003196

data broad calls indel broad data indel calls snps quality coverage variants snp reads sequencing false target sites cleaned

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Comparison of University of Washington a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Comparison of University of Washington and Broad Institute exome sequencing of NA12878ESP Meeting, April 6, 20101

2. Executive SummarySequenced sample NA12878 at University of Washington (UW) and the Broad Institute (BI), shared the resulting data and compared SNP calls.Once a common, post-sequencing-and-alignment analysis strategy is applied, UW and BI callsets overlap 98% for all calls, 90% for novel calls.Differences in novel call overlap are primarily due to alignment artifacts and lack of depth in the non-calling set.2

3. Data and Definitions: reads and alignments are generated by similar but distinct meansSample: NA12878Data:UW reads and alignmentsNimbleGen solution-based capture2 paired-end lanes76-bp readsAligned with BWABI reads and alignmentsSolution-based hybrid selection4 paired-end lanes3 76-bp read lanes and 1 101-bp read laneAligned with MAQ3

4. Data and Definitions: UW and BI downstream data-processing pipelines are synchronizedUW and BI data:PCR duplicates marked with Picard MarkDuplicates enables downstream applications to ignore these readsQuality-score recalibrated with GATK recalibrator more accurate estimates of variation probability can be computedIndel artifacts cleaned with GATK indel realigner prevents mismodeled indels for be misinterpreted as false-positive SNPsSubsetted to consensus intervalsUW and BI SNP calls:Generated with GATK UnifiedGenotyperMinimum base quality of 10Minimum mapping quality of 10Post-calling filters applied, rejected SNPs meeting any of the following criteriaQUAL ≤ 50.0 (discovery confidence is too low)AB > 0.75 (proportion of pileup’s reference bases is too high)QD < 5.0 (ratio of discovery confidence to depth is too low)HRun > 3 (SNP is found in > 3-base homopolymer run with matching alternate allele4

5. Evaluating callset quality with several metricsDid I get the right number of calls?The number of SNP calls should be close to the average human heterozygosity in exons (somewhere between 1 variant per 1600-1850 bases)Only detects gross under/over callingWhat fraction of my calls are already known?dbSNP catalogs most common variation, so most of the true variants found will be in dbSNPFor single sample calls, ~90 of variants should be in dbSNPNeed to adjust expectation when considering calls across samplesConcordance with other genotyping/sequencing efforts?Can use Hapmap to measure performance at millions of known sites (should be >99.5% consistent with Hapmap results and >99% of the variable sites should be found)Can compare to sequencing from other efforts (1,000 Genomes, Complete Genomics, and recent GA2 WGS sequencing)Great for this analysis, but future callset evaluations may not have the benefit of such comparison data.Reasonable transition to transversion ratio (Ti/Tv)?Transitions are twice as frequent as transversions in whole genome (see Ebersberger, 2002)Validated human SNP data suggests that the Ti/Tv should be ~2.1 genome-wide and ~3.3 in whole-exomeFP SNPs should have Ti/Tv around 0.5Ti/Tv is a good metric for assessing SNP call quality when genotyping/sequencing comparison data is unavailable.AGTCtransversionstransitions

6. 80-90 million reads in exome after filtering, constituting median coverage of ~80x per center6Many reads drop out for being off-target or low-quality, but reads that remain still constitute extremely deep coverage over most of the target.UW and BI coverage is comparable.(from Aaron Mackey)(from Mark Rieder)For this development exercise, Broad data was run with 4 lanes. In general for ESP, Broad will be running 2-3 lanes.

7. With an identical SNP-calling pipeline, UW and BI callsets are incredibly similar7UWcleanedBroadcleanedExpectation2Target regionUW and Broad consensus target intervals (25 Mb region)Q1,2,3 coverage57x, 126x, 246x82x, 170x, 295x30-60X, WGSAll variants114,57814,575~13.5KTi/Tv (all)3.273.25~3.3Known variants313,84913,816~13KTi/Tv (known)3.303.28~3.3Novel variants696728~600Ti/Tv (novel)2.872.81~2.9After indel cleaning, UW and BI have extremely similar callsets. Broad calls ~30 fewer known sites, and ~30 more novel sites.“All variants” is not simply the sum of known and novel variants. Another section that specifies known false-positives (SNPs at known dbSNP indel sites) is not shown and accounts for the remainder.Average of 1KG, Complete Genomics, and recent GA2 sequencing of NA12878Present in dbSNP build 129 (the last version of dbSNP before the addition of calls from the 1,000 Genomes Project).

8. Union of variants: 14,104, (670 novel);~98% overlap between callsets, ~90% for novels8Broad-onlyAll variants: 163Known: 118Novel: 438 corroborated by 1KG, new GA2 WGS, or CGUW-onlyAll variants: 135Known: 102Novel: 298 corroborated by 1KG, new GA2 WGS, or CGBothAll variants: 13,806 (Ti/Tv: 3.34)Known: 13,182 (Ti/Tv: 3.35)Novel: 598 (Ti/Tv: 3.07)

9. Example: a seemingly true, novel heterozygous call, corroborated in both datasets9UW (indel cleaned)Broad (indel cleaned)chr1:85,659,732-85,659,821This SNP has evidence in both the UW and Broad alignments, suggesting the validity of this site.

10. Example: MAQ marks many reads as having MQ0, but BWA appears (over?)confident10UW (indel cleaned)Broad (indel cleaned)chr1:246,804,138-246,804,297The presence of these reads with multiple plausible homes (mapping-quality-zero, indicated by the lighter-colored reads) suggests MAQ had difficulty uniquely aligning these reads.Whether BWA really does get it right requires further study.

11. Example: a site not called in the Broad dataset due to lack of depth11chr4:1,378,433-1,379,422UW (indel cleaned)Broad (indel cleaned)This site has ~30x coverage in the UW data, but < 10x coverage in the Broad data.Though I’ve marked this as a “lack of depth” error, this coverage drop-off suggests some other sequence-motif-related issue at this locus.With so many “SNPs” in the vicinity – far more than we’d expect – this is quite likely a false-positive site.

12. ConclusionUW and Broad callsets overlap by ~98% for all calls, 90% for novel calls.Discrepancies in private novel calls exhibit multiple pathologies, including alignment artifacts, lack of sufficient depth, inhospitable sequence motifs, etc.Quality of the callsets seems very high (will be even better when multiple samples are called simultaneously)12

13. ESP: EOMIProject Summary (C315)Used samples147/200Unused samples0 low quality, 0 with no usable lanes, 53 in flightSequencing protocolHybrid selectionBait designwhole_exome_agilent_designed_120Target size32,206,937 basesSequencing SummarySequencerIllumina GA2Used lanes433/436Unused lanes3 rejected by sequencing, 0 by analysisUsed lanes/sample3.0 ± 0.2 (median=3)Lane parities433 pairedRead lengths75.8 ± 0.2Sequencing dates2010-03-03 to 2010-03-15Bases Summary (excluding unused lanes/samples)Per lanePer sampleReads65M ± 6M190M ± 24MUsed bases3.5B ± 0.4B6B ± 1BTarget coverage63x ± 11x177x ± 37x% loci > 10x covered85% ± 11%90% ± 14%% loci > 20x covered76% ± 12%86% ± 15%% loci > 30x covered68% ± 11%83% ± 15%Variant SummaryFoundEst. FP rateAll SNPs~115K~5%Known SNPs~57K<1%Novel SNPs~57K~10%Indels/CNVs(functionality not available yet)3/9/10Breakdown of functional SNPsTi/Tv ratio for known and novel SNPs

14. AppendixComparison of UW and Broad exome sequencing of NA1287814

15. Example: Broad data contains clear evidence for an indel, while UW data contains none 15This site seems to be a clear indel in the Broad data, but no evidence for it appears in the UW data.If this is a heterozygous indel, BWA may not have been able to handle the reads containing the indel, instead choosing to not align them at all.On the other hand, if that were true, one might expect a corresponding drop off in coverage at this locus.Interpretation is unclear.UW (indel cleaned)Broad (indel cleaned)chr7:128,374,559-128,374,660

16. Low error rates, high quality scores per cycle for both UWash and Broad sequencing16Mean quality score per cycle (above) and error rate per cycle (below) for each UWash and Broad NA12878 lane.The error rates are very low for each lane (there are some spikes in the Broad rates, seemingly sequencer hiccups).As expected with low error rates, quality scores remain high through length of the read.The Broad SNP caller could be set with a very high base quality score limit (Q20) and retain the bulk of the data and read length.

17. Correlation between target coverage in Uwash and Broad capture is not immediately seen17Average coverage for 10,000 randomly selected overlapping targets between UWash and Broad target definitions. Targets are sorted by average coverage per target in Broad data.A correlation between UWash and Broad coverage is not immediately apparent. It is therefore not immediately clear why some Broad targets seem to fail while those same targets in UWash data succeed, and vice versa.

18. Most private variants are missed due to alignment artifactsWhy it was exclusive1UWBINo depth in other BAM55Alignment artifact58Unknown11Genomic artifact (repeats, etc.)21Indel disagreement2018Common pathologies for private SNPs, as determined by visual inspection of the alignments and putative SNP events in IGV.Most missed items seem to be due to alignment artifacts and missing depth (see following slides).These private variants are also undoubtedly enriched for false-positives.1. This is a very subjective measurement. More careful examination will be required to determine the precise nature of these artifacts.

19. With an identical SNP-calling pipeline, UWash and Broad callsets are incredibly similar19UWashcleanedBroadcleaned1KG expectationCG expectationRecent GA2 expectationTarget regionUWash and Broad consensus target intervals (24 Mb region)Q1,2,3 coverage57x, 126x, 246x82x, 170x, 295x60X, WGS55X, WGS25X, WGSAll variants114,57814,57512,34914,71613,153Ti/Tv (all)3.273.253.413.223.29Known variants13,84913,81611,83113,96512,534Ti/Tv (known)3.303.283.433.273.31Novel variants696728503725595Ti/Tv (novel)2.87(%FP = 14%)2.81(%FP = 17%)3.19(%FP=4%)2.57(%FP=26%)3.02(%FP=10%)After indel cleaning, UWash and Broad have extremely similar callsets. Broad calls ~30 fewer known sites, and ~30 more novel sites.“All variants” is not simply the sum of known and novel variants. Another section that specifies known false-positives (SNPs at known dbSNP indel sites) is not shown and accounts for the remainder.Based on 1KG, Complete Genomics, and recent GA2 sequencing of NA12878

20. Transition/transversion ratio (Ti/Tv) can be used to estimate false-positive rateThe observed Ti/Tv is the result of a mixture of true-positive SNPs and false-positive SNPs.Assume that the Ti/Tv that should be observed if one only includes true-positives is that of the known SNPs in the target region (for whole-exome data, Ti/TvTP = 3.3).Assume that the Ti/Tv that should be observed if one only includes false-positives is that of a purely random event (for each base, 1 transition mutation and 2 transversions possible, hence Ti/TvFP = 0.5).Thus,α(Ti/TvTP) + (1-α)(Ti/TvFP) = Ti/Tvobserved, where α is the mixture coefficient specifying the proportion of SNPs that are true-positives.The term (1-α) specifies the proportion of false-positives, so(1-α) = %FP = 1 – (Ti/Tvobserved – Ti/TvFP)/(Ti/TvTP – Ti/TvFP)20

21. 21

22. 22

23. 23

24. 24

25. 25

26. Calls per center using center-specific target definitions26UWashcleanedBroadcleanedTarget region25 Mb27 Mb regionQ1,2,3 coverage57x, 126x, 246x82x, 170x, 295xAll variants115,43616,605Ti/Tv (all)3.193.25Known variants14,64215,707Ti/Tv (known)3.233.30Novel variants753859Ti/Tv (novel)2.71(%FP = 25%)2.66(%FP = 26%)1. “All variants” is not simply the sum of known and novel variants. Another section that specifies known false-positives (SNPs at known dbSNP indel sites) is not shown and accounts for the remainder.