informatics amp software aspects Gabor T Marth Boston College Biology Department BI543 Fall 2013 January 29 2013 Traditional DNA sequencing Genetics of living organisms DNA Chromosomes ID: 430384
Download Presentation The PPT/PDF document "High throughput sequencing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
High throughput sequencing: informatics & software aspects
Gabor T. Marth
Boston College Biology
Department
BI543 Fall 2013
January 29, 2013Slide2
Traditional DNA sequencingSlide3
Genetics of living organisms
DNA
ChromosomesSlide4
Radioactive label gel sequencingSlide5
Four-color capillary sequencing
~1 Mb
~100 Mb
>100 Mb
~3,000 Mb
ABI 3700 four-color sequence traceSlide6
Individual human resequencingSlide7
Next-generation DNA sequencingSlide8
New sequencing technologies…Slide9
… vast throughput, many applications
read length
bases per machine run
10 bp
1,000 bp
100 bp
1 Gb
100 Mb
10 Mb
10
Gb
Illumina, SOLiD
ABI
/ capillary
454
1 Mb
100
Gb
1
T
bSlide10
DNA ligation
DNA base extension
Church, 2005
Sequencing chemistriesSlide11
Template clonal amplification
Church, 2005Slide12
Massively parallel sequencing
Church, 2005Slide13
Chemistry of paired-end sequencing
Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced.
(Figure courtesy of Illumina)Slide14
Paired-end reads
fragment amplification: fragment length 100 - 600 bp
fragment length limited by amplification efficiency
circularization: 500bp - 10kb (sweet spot ~3kb)
fragment length limited by library complexity
Korbel
et al
.
Science
2007Slide15
Features of NGS data
Short sequence reads
100-200bp
25-35bp (micro-reads)
Huge amount of sequence per run
Up to gigabases per runHuge number of reads per run
Up to 100’s of millionsHigher error as compared with Sanger sequencingError profile different to SangerSlide16
Application areas of next-gen sequencingSlide17
Application areas Genome resequencing variant discovery
somatic mutation detection
mutational profiling
De novo assembly
Identification of protein-bound DNA chromatin structure methylation transcription binding sites
RNA-Seq expression transcript discoveryMikkelsen et al. Nature 2007Cloonan
et al
.
Nature Methods
, 2008Slide18
SNP and short-INDEL discoverySlide19
Structural variation detection
structural variations (
deletions, insertions, inversions and translocations
) from paired-end read map locations
copy number (for
amplifications, deletions
) from depth of read coverageSlide20
Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)
(Mikkelsen
et al
.
Nature
2007)
Transcription binding sites. (Robertson
et al
.
Nature Methods
, 2007)Slide21
Novel transcript discovery (genes)
Mortazavi
et al
. Nature Methods
novel exons
novel transcripts containing known exonsSlide22
Novel transcript discovery (miRNAs)
Ruby
et al
.
Cell
, 2006Slide23
Expression profiling
aligned reads
aligned reads
Jones-Rhoads
et al
.
PLoS Genetics
, 2007
gene
gene
tag counting (e.g. SAGE, CAGE)
shotgun transcript sequencingSlide24
De novo genome sequencing
assembled sequence contigs
short reads
longer reads
read pairs
Lander
et al
.
Nature
2001Slide25
The informatics of sequencingSlide26
Re-sequencing informatics pipeline
REF
(ii) read mapping
IND
(
i
) base calling
IND
(
iii)
SNP and short INDEL calling
(
v
)
data
viewing, hypothesis
generation
(iv
) SV callingSlide27
The variation discovery toolbox
base callers
read mappers
SNP callers
SV callers
assembly viewersSlide28
Raw data processing / base calling
Trace extraction
Base calling
These steps are usually handled well by the machine manufacturers’ software
What most analysts want to see is base calls and well-calibrated base quality valuesSlide29
Sequence traces are machine-specific
Base calling is increasingly left to machine manufacturersSlide30
…where they give you the cover on the box
Read
mapping…
Is like a jigsaw
puzzle… Slide31
Some pieces are easier to place than others…
…pieces with unique features
pieces that look like each other…Slide32
Repeats multiple mapping problem
Lander
et al.
2001Slide33
Paired-end (PE) reads
fragment length: 100 – 600bp
Korbel
et al
.
Science 2007fragment length: 1 – 10kb
PE reads are now the standard for whole-genome short-read sequencingSlide34
Mapping quality values
0.8
0.19
0.01Slide35
SNP callingSlide36
SNP calling: what goes into it?
sequencing error
true polymorphism
Base qualities
Base coverage
Prior expectationSlide37
Bayesian SNP calling
A
A
A
A
A
C
C
C
C
C
T
T
T
T
T
G
G
G
G
G
polymorphic permutation
monomorphic permutation
Bayesian posterior probability
Base call + Base quality
Expected polymorphism rate
Base composition
Depth of coverageSlide38
http://
bioinformatics.bc.edu
/~
marth
/
PolyBayes
Marth et al., Nature Genetics, 1999
First statistically rigorous SNP discovery tool
Correctly analyzes alternative cDNA splice forms
The PolyBayes softwareSlide39
SNP calling (continued)P(G
1
=
aa
|B1=aacc
; Bi=aaaac; Bn= cccc)
P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn=
cccc
)
P(G
i
=
aa
|B
1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=
aaaac; Bn= cccc)P(Gi
=ac|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=aa|B1=aacc; Bi=aaaac;
Bn= cccc)P(Gn=cc|B1=
aacc; Bi=aaaac; Bn
= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)P(SNP)
“genotype probabilities”
P(B
1
=
aa
cc
|G
1
=
aa
)
P(B
1
=
aa
cc
|G
1
=
cc
)
P(B
1
=
aa
cc
|G
1
=
a
c
)
P(B
i
=
aaaa
c
|G
i
=
aa
)
P(B
i
=
aaaa
c
|G
i
=
cc
)
P(B
i
=
aaaa
c
|G
i
=
a
c
)
P(B
n
=
cccc
|G
n
=
aa
)
P(B
n
=
cccc
|G
n
=
cc
)
P(B
n
=
cccc
|G
n
=
a
c
)
“genotype likelihoods”
Prior(G
1
,..,G
i
,..,
G
n
)
-----
a
-----
-----
a
-----
-----
c
-----
-----
c
-----
-----
a
-----
-----
a
-----
-----
a
-----
-----
a
-----
-----
c
-----
-----
c
-----
-----
c
-----
-----
c
-----
-----
c
-----Slide40
Insertion/deletion (INDEL) variantsThese variants have been on the “radar screen” for decadesAccurate automated detection is difficultDifferent mutation mechanisms
Often appear in repetitive sequence and therefore difficult to align
Often multi-allelic
Deleted allele has no base quality valuesSlide41
Alignment methods became more refinedOriginal alignment
After left realignment
After haplotype-aware realignmentSlide42
Medium length INDELs still a problem
Guillermo AngelSlide43
Structural variation detection
Feuk
et al.
Nature Reviews Genetics, 2006Slide44
Structural variant detection (cont’d)Slide45
Detection ApproachesRead Depth: good for big CNVs
Sample
Reference
Lmap
read
contig
Paired-end:
all types of SV
Split-Reads
good break-point resolution
deNovo
Assembly
~ the future
SV slides courtesy of Chip Stewart, Boston CollegeSlide46
SV detection – resolution
Expected CNVs
Karyotype
Micro-array
Sequencing
Relative
numbers of
events
CNV event
length [
bp
]Slide47
Standard data formats
Reads: FASTQ
Alignments: SAM/BAM
Variants: VCFSlide48
Tools for analyzing & manipulating 1000G data
samtools
:
http://
samtools.sourceforge.net/ BamTools:
http://sourceforge.net/projects/bamtools/ GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit VCFTools: http://vcftools.sourceforge.net/ VcfCTools: https://github.com/AlistairNWard/vcfCTools
Alignments: SAM/BAM
Variants: VCF