/
Using the whole read: Structural Variation detection with R Using the whole read: Structural Variation detection with R

Using the whole read: Structural Variation detection with R - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
374 views
Uploaded On 2016-03-17

Using the whole read: Structural Variation detection with R - PPT Presentation

Presented by Derek Bickhart Presentation Outline Variant classification and detection Theory on read structure and bias Simulations and real data Genetic Variation Single nucleotide variations SNP human ID: 259527

variant read delly variants read variant variants delly deletions split genome reference size total duppy rpsr presentation gtaaaggtacgac structural

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Using the whole read: Structural Variati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Using the whole read: Structural Variation detection with RPSR

Presented by Derek BickhartSlide2

Presentation Outline

Variant classification and detection

Theory on read structure and bias

Simulations and real dataSlide3

Genetic

Variation

Single nucleotide variations – SNP (human

millions of variants)

Indels

– Insertions/Deletions (1

bp

– 1000

bp)

Mobile

Elements – SINE, LINE Transposition (300bp

- 6 kb)Genomic structural variation (1 kb – 5 Mb)Large-scale Insertions/Deletions (Copy Number Variation: CNV)Segmental Duplications (> 1kb, > 90% sequence similarity)Chromosomal Inversions, Translocations, Fusions.

How genomes change over time

Nuc

ChrSlide4

SVs

contribute to phenotype

Picture from

Seo

et al. 2007. BMC Genetics

Picture from Wright et al. 2009.

PLoS

Genet.

SOX5

KIT

ASIP

Nuc

Chr

KITSlide5

Tracking variants with LD

Nuc

Chr

In Linkage

Disequilibrium

Form a

Haplotype

A causative variant? Slide6

Underestimating the number of genetic variants

Disease variants

Or

New high productive QTV

Mostly impute these

Still some issues:

Relatively new mutations

Ref. assembly errors

Difficult to sequence locations

Can we get biological function from imputation?Slide7

Sequencing to find novel variants

Raw Data

ACGTAAAGGTACGACGATCGACG

ACGTAAAGGGACG

GTAAAGGTACGAC

GTAAAGGTACGAC

GGGACGACGATCGA

REF:

ACGTAAAGGTACGACGATCGACG

ACGTAAAGG

G

ACG

GTAAAGGTACGAC

GTAAAGGTACGAC

GG

G

ACGACGATCGA

REF:Slide8

Making use of new variants

What do you do after variant calling?

Check for functional impactCheck for frequencyVariants can be placed on chips

Deletions can be tracked

ACGACTAGACGATGGACGA

ACGACTA TGGACGA

WildType

:

Deletion:

ACGACTA

TGGACGA

ACGACTA

G

ACGACTATSlide9

Presentation Outline

Variant classification and detection

Theory on read structure and bias

Simulations and real dataSlide10

Structural variant detection still proves to be a challenge

No Perfect Calls

Most datasets have high FDR

A majority of variants are missed

Good Performance

Pretty conservative

Moderate Performance

Too many False Positives

Poor performanceSlide11

Understanding the sequencing process

DNA sheared to fragments

Fragments follow a size distribution

Fragments sequenced from both sides

We don’t know the middle, but we know the size!Slide12

Using information from alignments

Reference Genome

Reference Genome

ACGAGATAGTAGATACCATAGACG

ACGAGATAGT ACCATAGACG

7bp

3bp

ACGAGATAGCCATAGACG

ACGAGATAG

GG

CCATAGACG

3bp

1bp

Deletions

Insertions

What the

Alignment

should look like:Slide13

Using information from alignments

Reference Genome

Reference Genome

CGATAGACGAC GGAGAGAGATAG

GGAGAGAGATAG

ACCCAGATAA

CGATAGACGAC GGAGAGAGATAG ACCCAGATAA

Tandem Duplication

What the

Alignment

should look like:Slide14

TTGCGA

TTG

CGA

Getting more from your reads

Reference Genome

Unaligned

Aligned

Split Read

Deletion call

ACGACGAGGGTGTGATTGACGATCGATA

CGACGA

Aligned Read

Unaligned Read

ACGACGAGGGTGTGATTG CGATASlide15

Overcoming read biases and creating useful information

Chemistry problems confuse detection

Alignment issues occur

Use clustering algorithm

Chimeric

Read

Repeat

RepeatSlide16

Ease of Use

Designed to process BWA-aligned BAM files

Scalable to system resourcesMulti-threaded

Tunable to reduce false positivesSlide17

RPSR: Read Pair, Split Read

Written in Java (version 8)

Currently two modes:Preprocess

Cluster

Uses map-reduce paradigms for easy threadingSlide18

Presentation Outline

Variant classification and detection

Theory on read structure and bias

Simulations and real dataSlide19

Simulation dataset

Started with Cattle chr29

51 megabasesAcrocentric

Synthetic 10X coverage

Variant Type

Avg. Count

Avg. size

Deletion

12.3350

bp

Tandem Dup

11.8350 bpVariants per simulated chromosomeSlide20

Comparison Program

Delly

/ Duppy

Rausch et al. 2012. Bioinformatics

Combined read pair, split read caller

Discovers discordant reads, then uses split reads to validate

Run with default settings and in “split read” modeSlide21

Program results: Simulation

Duplications

Deletions

True

Positives

Actual

Positives

Total

CallsSlide22

Program results: Simulation

RPSR

vs Delly/Duppy

precision

RPSR is far more precise than

Delly

/

Duppy

Program

Precision

RPSR

Dels8.7%Delly Dels0.9%RPSR Tandem73.7%Duppy Duplications

1.8%Slide23

Real Dataset: Angus Individual

Provided by Jerry Taylor and Bob Schnabel

Sequence statistics20 X coverage of Illumina

reads

Reads quality trimmed

Aligned with BWA to UMD3.1Slide24

Program results: Angus

RPSR

Variant

type

Total Calls

Avg.

Size (

bp)Total Length

Largest callDeletions

4171

237

991 kb43.7 kbDuplications96175545,335 kb152.0 kb

Variant typeTotal CallsAvg. Size (

bp)Total Length

Largest callDeletions1867

1,304,6832.4 gb

149 MbDuplications10263

232,4722.38 gb

150 Mb

Delly/DuppySlide25

Conclusions

Structural variants are a type of mutation we can track inexpensively

Accurate assessment of SVs is neededFuture developments for RPSR

Smaller resource footprint

Automatic threshold detectionSlide26

Acknowledgements

Colleagues at the USDA

George LiuTad Sonstegard

Curt Van Tassell

AGIL

Jerry Taylor

Bob SchnabelThe Reecy Lab

Projects mentioned in this presentation were funded in part by NRI grant numbers 2007-35205-17869 and 2011-67015-30183, and by USDA NIFA grant number

2013-00831 Slide27

Questions?Slide28

Delly/

DuppyAfter removing the initial calls greater than 1

megabase and then merging:

Variant

type

Total Calls

Avg.

Size (bp)

Total LengthLargest callDeletions

1833

5219

9.6 Mb885 kbDuplications974071217118 Mb2.3 MbSlide29

Pipeline: Resource Consumption