Jin Zhang . and . Yufeng. Wu. Department of Computer Science and Engineering. University of Connecticut. Introduction. R. eference. A. lternative. deletion. insertion. Structural variants. low . coverage . ID: 229229
DownloadNote - The PPT/PDF document "Finding Deletions with Exact Break Point..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Finding Deletions with Exact Break Points from Noisy Low Coverage Paired-end Short Sequence Reads
Jin Zhang
and
Yufeng
Wu
Department of Computer Science and Engineering
University of Connecticut
Slide2Introduction
Reference
Alternative
deletion
insertion
Structural variants
low coverage (2-6x); Illumina (2009_08); 45 individuals in CEU population;(combine)580 G in BAM format;
Only one end mapped
24G
in fastq format250 Million reads
Abnormal
insertionsize
Sequence
error
& other errors(more 99%)
Real Split-reads(also contain error)
In region without Deletion
Deletions
with
exact
break point efficiently
Both ends mapped
Split-reads Mapping
Our problem: Finding Exact break points of deletions using low coverage noisy data efficiently
Data
( from 1000 genomes project pilot 1
)
Achieve
Abnormal insertion size
R
eference
Alternative
Deletion
Mean insertion size + 3
sds
anchor
deletion
Slide3Method
reference
alternative
deletion
anchor
Split-read
Spanning pair (abnormal insertion size)
We map Split-read on Burrows-Wheeler Transform (BWT)
Inexact mapping
CACAAT
A
CCCTCTCACACCAACGT
T
ACG
Split-read
reference
CAAT CCCTC
ACGT
A
ACG
mismatch
indel
SVs near SNPs and
indels
can be found; Reads with errors can be used
reference
Ex. Hit at 2 positions
or
reference
Hits not unique
Split not unique
We pre-build local BWTs with length 102k (2k for overlap) on each strand.
Search on which BWT is decided by the anchor.
Search locally
reference
alternative
deletion
anchor
Split-read
Report the hits and splits with the best quality
efficiency
ex. search near region of 1Mb
Slide4Method
Calling candidate deletions
TTAACCAT
TACGTTTAACCATACGGCCAAAACGTAACGT
ACGTAACGT
TTAACCATACG
TAACGT
TTAACCATACGTAACGT
or
(leftmost)
(rightmost
)
(1)Sorting split-reads leftmost break points
(2)cluster the split-reads supporting the same candidate
(3)call candidate
Cutoff
value:
at least how many split-reads support the candidate
Reference
Slide5Method
Reference
candidate
Reference
candidate
Candidates
validation(calling deletions)
Has spanning pair
No spanning pair
(not enough information, we can’t validate)
low coverage may cause no spanning pair, there is a deletion.
maybe
caused by deletion
Alternative
deletion
Abnormal insertion size
If there
exists
spanning
pairs
with abnormal insertion size, we validate the deletion
Notice that the candidate is from split-reads mapping
Split-read is not mapped right or is with error, there is No deletion.
Slide6Results
Benchmarks
The 1000 Genomes Project Consortium, (2010) A map of human genome variation from population-scale sequencing, Nature, 467, 1061-1073.
Mills,R.E., et al., (2011) Mapping copy number variation by population-scale genome sequencing, Nature 470, 5965.
Benchmark 1
Benchmark 2
We run our test with 1000 genomes project releases as benchmarks;The Deletions in these releases are found by multiple methods
Data
Low coverage (can be as low as 2x)Illumina (2009_08); 45 individuals in CEU population;(combine)580 G in BAM format;
Comparison
Pindel v1Exact matchMax Event Size 1MbPindel v2With mismatchMax Event Size 8092bp
Ye,K
., el al.,
(2009)
Pindel
: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads,
Bioinformatics
, 25, 2865-s2871
Slide7Results
Vs. Pindel v1 (Max Event 1Mb)
Total number of called deletions
True positive with precise deletions in benchmark 1
What is the percentage of true deletions?We use the 1000 genomes project releases as benchmarks
True positive with precise deletions in benchmark 2
With the benchmark, our method with cutoff 2, has more true Positives and less false positives
There might be deletions not in the benchmarks but found by our method
Chromosomes
chromosomes
chromosomes
Chromosomes
Slide8Results
Vs. Pindel v2 (Max Event 8092k)
Total number of deletions found
Comparison with v2 has the same trend with the comparison with v1
True positive with precise deletions in benchmark 2
Chromosomes
Chromosomes
True positive with
precise
deletions in benchmark 1
Slide9Results
Inexact MatchThreadsHoursPindel v1Not allow1About 10Pindel v2Allow Mismatch2030 still runningOur MethodAllow Mismatch and indel1About 3.5
Data: 45 individuals on chromosome 1Running on our Xeon server with 24 CPUsFinding SV with Maximum Event Size include 1Mb
Running time
An example of inexact mapping(P1_M_061510_1 9_22)
Experiment is run on workstations supported by NSF grant IIS-0916948
Research partly supported by NSF grant IIS-0953563
Today's Top Docs
Related Slides