Ramesh Hariharan Strand Life Sciences IISc What is Read Alignment AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC Subjects Genome AGGCTACGCAT G TCCCATAA T GACCCAC A CTTAAGTTC Reference Genome ID: 187915
Download Presentation The PPT/PDF document "Aligning Reads" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Aligning Reads
Ramesh
Hariharan
Strand Life Sciences
IIScSlide2
What is Read Alignment?Slide3
AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC
Subject’s Genome
AGGCTACGCAT
G
TCCCATAA
T
GACCCAC
A
CTTAAGTTC
Reference Genome
Where do these match in the Reference?
Close but not quite the same as the Subject’s Genome Slide4
What does “Match” mean?Slide5
AGGCTACGCAT
G
TCCCATAA
T
GACCCAC
A
CTTAAGTTC
Reference Genome
GCTACGCA
Exact Match
CATAA
A
GAC
With Mismatches
CACTT
_
AGT
With GapsSlide6
Why mismatches and gaps?Slide7
The subject genome could be different from the referenceSlide8
Reads
Reference Genome
SNP
Deletion
Mismatches and GapsSlide9
The reading process could be erroneousSlide10
How many mismatches and gaps?Slide11
Short reads ~50, few mismatches and gaps
Long reads, ~1000, many more mismatches and gapsSlide12
How do aligners fare?Slide13
BWA: Very few mismatches and gaps
CoBWeb
BWA-SW: Many mismatches and gaps
BowTie
: only mismatches, no gaps
No paired read handling
No handling of adaptor trimming for small RNA
Separate handling for
RNASeq
BowTie2Slide14
How does an Aligner work?Slide15
For simplicity, assume
Exact Match Slide16
For each read, scan the entire reference genome sequence
SLOW!!!!Slide17
C
G
A
C
G
The Reference
C
C
G
T
T
A
C
A
G
A
C
T
Index the ReferenceSlide18
How can we find Exact Matches of a read quickly with this index?Slide19
C
G
A
C
G
The Reference
C
C
G
T
T
A
C
A
G
A
C
T
C
G
CSlide20
The problem: 24GBSlide21
Can this structure be compressed?Slide22
C
G
A
C
$
A
C
$
C
G
C
G
A
C
$
C
$
C
G
A
G
A
C
$
C
$
C
G
A
C
The Reference
This column is the BWT
All its circular shifts, sorted lexicographically
The Index: now an array instead of a tree
The Burrows-Wheeler based Index
Sampled to reduce memory at the expense of speed
(
Ferragina
and
Manzini
)Slide23
How about Mismatches and Gaps?Slide24
BWA, BWA-SW and
BowTie
force mismatches and gaps into the BW Index searching procedureSlide25
CoBWeb
uses
the BW Index to find a ‘seed’ exact match and does Smith-Waterman around this seed
This 15-mer occurs at locations x1, x2…
This 15-mer occurs at locations x3, x4…
This whole 30-mer occurs at location x5Slide26
Dynamic Programming
Given a location in the reference with an read anchor, how well does the read match here?
Reference
Read
Anchor 14
mer
Smith-
Waterman (optimized
for large gaps)Slide27
Comparison with BWA
Read Length 50
Read Length 150
20% faster than BWA with comparable results
CoBWeb: 3 mismatches and 2 gaps
BWA: 2 mismatches + 1 gap of possibly multiple lengthSlide28
Comparison with BWA-SW
Read Length 400
8 mismatches plus 10 gaps
CoBWeb
BWA-SW
Reads
1m
1m
Time
taken
1130s
2242s
Incorrectly
Mapped
125989819
5650 mapped incorrecty by BWA-SW
The remainder has poor BWA mapping qualitySlide29
Avadis NGSSlide30
Avadis
NGS
Alignment, DNA
Var
Detection,
RNASeq
,
ChIPSeq
, Small
RNASeqSlide31
Thank You