Presented by Derek Bickhart Presentation Outline Variant classification and detection Theory on read structure and bias Simulations and real data Genetic Variation Single nucleotide variations SNP human ID: 259527
Download Presentation The PPT/PDF document "Using the whole read: Structural Variati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Using the whole read: Structural Variation detection with RPSR
Presented by Derek BickhartSlide2
Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real dataSlide3
Genetic
Variation
Single nucleotide variations – SNP (human
millions of variants)
Indels
– Insertions/Deletions (1
bp
– 1000
bp)
Mobile
Elements – SINE, LINE Transposition (300bp
- 6 kb)Genomic structural variation (1 kb – 5 Mb)Large-scale Insertions/Deletions (Copy Number Variation: CNV)Segmental Duplications (> 1kb, > 90% sequence similarity)Chromosomal Inversions, Translocations, Fusions.
How genomes change over time
Nuc
ChrSlide4
SVs
contribute to phenotype
Picture from
Seo
et al. 2007. BMC Genetics
Picture from Wright et al. 2009.
PLoS
Genet.
SOX5
KIT
ASIP
Nuc
Chr
KITSlide5
Tracking variants with LD
Nuc
Chr
In Linkage
Disequilibrium
Form a
Haplotype
A causative variant? Slide6
Underestimating the number of genetic variants
Disease variants
Or
New high productive QTV
Mostly impute these
Still some issues:
Relatively new mutations
Ref. assembly errors
Difficult to sequence locations
Can we get biological function from imputation?Slide7
Sequencing to find novel variants
Raw Data
ACGTAAAGGTACGACGATCGACG
ACGTAAAGGGACG
GTAAAGGTACGAC
GTAAAGGTACGAC
GGGACGACGATCGA
REF:
ACGTAAAGGTACGACGATCGACG
ACGTAAAGG
G
ACG
GTAAAGGTACGAC
GTAAAGGTACGAC
GG
G
ACGACGATCGA
REF:Slide8
Making use of new variants
What do you do after variant calling?
Check for functional impactCheck for frequencyVariants can be placed on chips
Deletions can be tracked
ACGACTAGACGATGGACGA
ACGACTA TGGACGA
WildType
:
Deletion:
ACGACTA
TGGACGA
ACGACTA
G
ACGACTATSlide9
Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real dataSlide10
Structural variant detection still proves to be a challenge
No Perfect Calls
Most datasets have high FDR
A majority of variants are missed
Good Performance
Pretty conservative
Moderate Performance
Too many False Positives
Poor performanceSlide11
Understanding the sequencing process
DNA sheared to fragments
Fragments follow a size distribution
Fragments sequenced from both sides
We don’t know the middle, but we know the size!Slide12
Using information from alignments
Reference Genome
Reference Genome
ACGAGATAGTAGATACCATAGACG
ACGAGATAGT ACCATAGACG
7bp
3bp
ACGAGATAGCCATAGACG
ACGAGATAG
GG
CCATAGACG
3bp
1bp
Deletions
Insertions
What the
Alignment
should look like:Slide13
Using information from alignments
Reference Genome
Reference Genome
CGATAGACGAC GGAGAGAGATAG
GGAGAGAGATAG
ACCCAGATAA
CGATAGACGAC GGAGAGAGATAG ACCCAGATAA
Tandem Duplication
What the
Alignment
should look like:Slide14
TTGCGA
TTG
CGA
Getting more from your reads
Reference Genome
Unaligned
Aligned
Split Read
Deletion call
ACGACGAGGGTGTGATTGACGATCGATA
CGACGA
Aligned Read
Unaligned Read
ACGACGAGGGTGTGATTG CGATASlide15
Overcoming read biases and creating useful information
Chemistry problems confuse detection
Alignment issues occur
Use clustering algorithm
Chimeric
Read
Repeat
RepeatSlide16
Ease of Use
Designed to process BWA-aligned BAM files
Scalable to system resourcesMulti-threaded
Tunable to reduce false positivesSlide17
RPSR: Read Pair, Split Read
Written in Java (version 8)
Currently two modes:Preprocess
Cluster
Uses map-reduce paradigms for easy threadingSlide18
Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real dataSlide19
Simulation dataset
Started with Cattle chr29
51 megabasesAcrocentric
Synthetic 10X coverage
Variant Type
Avg. Count
Avg. size
Deletion
12.3350
bp
Tandem Dup
11.8350 bpVariants per simulated chromosomeSlide20
Comparison Program
Delly
/ Duppy
Rausch et al. 2012. Bioinformatics
Combined read pair, split read caller
Discovers discordant reads, then uses split reads to validate
Run with default settings and in “split read” modeSlide21
Program results: Simulation
Duplications
Deletions
True
Positives
Actual
Positives
Total
CallsSlide22
Program results: Simulation
RPSR
vs Delly/Duppy
precision
RPSR is far more precise than
Delly
/
Duppy
Program
Precision
RPSR
Dels8.7%Delly Dels0.9%RPSR Tandem73.7%Duppy Duplications
1.8%Slide23
Real Dataset: Angus Individual
Provided by Jerry Taylor and Bob Schnabel
Sequence statistics20 X coverage of Illumina
reads
Reads quality trimmed
Aligned with BWA to UMD3.1Slide24
Program results: Angus
RPSR
Variant
type
Total Calls
Avg.
Size (
bp)Total Length
Largest callDeletions
4171
237
991 kb43.7 kbDuplications96175545,335 kb152.0 kb
Variant typeTotal CallsAvg. Size (
bp)Total Length
Largest callDeletions1867
1,304,6832.4 gb
149 MbDuplications10263
232,4722.38 gb
150 Mb
Delly/DuppySlide25
Conclusions
Structural variants are a type of mutation we can track inexpensively
Accurate assessment of SVs is neededFuture developments for RPSR
Smaller resource footprint
Automatic threshold detectionSlide26
Acknowledgements
Colleagues at the USDA
George LiuTad Sonstegard
Curt Van Tassell
AGIL
Jerry Taylor
Bob SchnabelThe Reecy Lab
Projects mentioned in this presentation were funded in part by NRI grant numbers 2007-35205-17869 and 2011-67015-30183, and by USDA NIFA grant number
2013-00831 Slide27
Questions?Slide28
Delly/
DuppyAfter removing the initial calls greater than 1
megabase and then merging:
Variant
type
Total Calls
Avg.
Size (bp)
Total LengthLargest callDeletions
1833
5219
9.6 Mb885 kbDuplications974071217118 Mb2.3 MbSlide29
Pipeline: Resource Consumption