Presented by Derek Bickhart Presentation Outline Variant classification and detection Theory on read structure and bias Simulations and real data Genetic Variation Single nucleotide variations SNP human ID: 931171
Download Presentation The PPT/PDF document "Using the whole read: Structural Variati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Using the whole read: Structural Variation detection with RPSR
Presented by Derek Bickhart
Slide2Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real data
Slide3Genetic
Variation
Single nucleotide variations – SNP (human
millions of variants)
Indels
– Insertions/Deletions (1
bp
– 1000
bp)
Mobile
Elements – SINE, LINE Transposition (300bp
- 6 kb)Genomic structural variation (1 kb – 5 Mb)Large-scale Insertions/Deletions (Copy Number Variation: CNV)Segmental Duplications (> 1kb, > 90% sequence similarity)Chromosomal Inversions, Translocations, Fusions.
How genomes change over time
Nuc
Chr
Slide4SVs
contribute to phenotype
Picture from
Seo
et al. 2007. BMC Genetics
Picture from Wright et al. 2009.
PLoS
Genet.
SOX5
KIT
ASIP
Nuc
Chr
KIT
Slide5Tracking variants with LD
Nuc
Chr
In Linkage
Disequilibrium
Form a
Haplotype
A causative variant?
Slide6Underestimating the number of genetic variants
Disease variants
Or
New high productive QTV
Mostly impute these
Still some issues:
Relatively new mutations
Ref. assembly errors
Difficult to sequence locations
Can we get biological function from imputation?
Slide7Sequencing to find novel variants
Raw Data
ACGTAAAGGTACGACGATCGACG
ACGTAAAGGGACG
GTAAAGGTACGAC
GTAAAGGTACGAC
GGGACGACGATCGA
REF:
ACGTAAAGGTACGACGATCGACG
ACGTAAAGG
G
ACG
GTAAAGGTACGAC
GTAAAGGTACGAC
GG
G
ACGACGATCGA
REF:
Slide8Making use of new variants
What do you do after variant calling?
Check for functional impactCheck for frequencyVariants can be placed on chips
Deletions can be tracked
ACGACTAGACGATGGACGA
ACGACTA TGGACGA
WildType
:
Deletion:
ACGACTA
TGGACGA
ACGACTA
G
ACGACTAT
Slide9Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real data
Slide10Structural variant detection still proves to be a challenge
No Perfect Calls
Most datasets have high FDR
A majority of variants are missed
Good Performance
Pretty conservative
Moderate Performance
Too many False Positives
Poor performance
Slide11Understanding the sequencing process
DNA sheared to fragments
Fragments follow a size distribution
Fragments sequenced from both sides
We don’t know the middle, but we know the size!
Slide12Using information from alignments
Reference Genome
Reference Genome
ACGAGATAGTAGATACCATAGACG
ACGAGATAGT ACCATAGACG
7bp
3bp
ACGAGATAGCCATAGACG
ACGAGATAG
GG
CCATAGACG
3bp
1bp
Deletions
Insertions
What the
Alignment
should look like:
Slide13Using information from alignments
Reference Genome
Reference Genome
CGATAGACGAC GGAGAGAGATAG
GGAGAGAGATAG
ACCCAGATAA
CGATAGACGAC GGAGAGAGATAG ACCCAGATAA
Tandem Duplication
What the
Alignment
should look like:
Slide14TTGCGA
TTG
CGA
Getting more from your reads
Reference Genome
Unaligned
Aligned
Split Read
Deletion call
ACGACGAGGGTGTGATTGACGATCGATA
CGACGA
Aligned Read
Unaligned Read
ACGACGAGGGTGTGATTG CGATA
Slide15Overcoming read biases and creating useful information
Chemistry problems confuse detection
Alignment issues occur
Use clustering algorithm
Chimeric
Read
Repeat
Repeat
Slide16Ease of Use
Designed to process BWA-aligned BAM files
Scalable to system resourcesMulti-threaded
Tunable to reduce false positives
Slide17RPSR: Read Pair, Split Read
Written in Java (version 8)
Currently two modes:Preprocess
Cluster
Uses map-reduce paradigms for easy threading
Slide18Presentation Outline
Variant classification and detection
Theory on read structure and bias
Simulations and real data
Slide19Simulation dataset
Started with Cattle chr29
51 megabasesAcrocentric
Synthetic 10X coverage
Variant Type
Avg. Count
Avg. size
Deletion
12.3350
bp
Tandem Dup
11.8350 bpVariants per simulated chromosome
Slide20Comparison Program
Delly
/ Duppy
Rausch et al. 2012. Bioinformatics
Combined read pair, split read caller
Discovers discordant reads, then uses split reads to validate
Run with default settings and in “split read” mode
Slide21Program results: Simulation
Duplications
Deletions
True
Positives
Actual
Positives
Total
Calls
Slide22Program results: Simulation
RPSR
vs Delly/Duppy
precision
RPSR is far more precise than
Delly
/
Duppy
Program
Precision
RPSR
Dels8.7%Delly Dels0.9%RPSR Tandem73.7%Duppy Duplications
1.8%
Slide23Real Dataset: Angus Individual
Provided by Jerry Taylor and Bob Schnabel
Sequence statistics20 X coverage of Illumina
reads
Reads quality trimmed
Aligned with BWA to UMD3.1
Slide24Program results: Angus
RPSR
Variant
type
Total Calls
Avg.
Size (
bp)Total Length
Largest callDeletions
4171
237
991 kb43.7 kbDuplications96175545,335 kb152.0 kb
Variant typeTotal CallsAvg. Size (
bp)Total Length
Largest callDeletions1867
1,304,6832.4 gb
149 MbDuplications10263
232,4722.38 gb
150 Mb
Delly/Duppy
Slide25Conclusions
Structural variants are a type of mutation we can track inexpensively
Accurate assessment of SVs is neededFuture developments for RPSR
Smaller resource footprint
Automatic threshold detection
Slide26Acknowledgements
Colleagues at the USDA
George LiuTad Sonstegard
Curt Van Tassell
AGIL
Jerry Taylor
Bob SchnabelThe Reecy Lab
Projects mentioned in this presentation were funded in part by NRI grant numbers 2007-35205-17869 and 2011-67015-30183, and by USDA NIFA grant number
2013-00831
Slide27Questions?
Slide28Delly/
DuppyAfter removing the initial calls greater than 1
megabase and then merging:
Variant
type
Total Calls
Avg.
Size (bp)
Total LengthLargest callDeletions
1833
5219
9.6 Mb885 kbDuplications974071217118 Mb2.3 Mb
Slide29Pipeline: Resource Consumption