/
High throughput sequencing High throughput sequencing

High throughput sequencing - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
418 views
Uploaded On 2016-08-03

High throughput sequencing - PPT Presentation

informatics amp software aspects Gabor T Marth Boston College Biology Department BI543 Fall 2013 January 29 2013 Traditional DNA sequencing Genetics of living organisms DNA Chromosomes ID: 430384

base sequencing cccc reads sequencing base reads cccc aacc aaaac calling read 100 snp nature length dna detection discovery

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "High throughput sequencing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

High throughput sequencing: informatics & software aspects

Gabor T. Marth

Boston College Biology

Department

BI543 Fall 2013

January 29, 2013Slide2

Traditional DNA sequencingSlide3

Genetics of living organisms

DNA

ChromosomesSlide4

Radioactive label gel sequencingSlide5

Four-color capillary sequencing

~1 Mb

~100 Mb

>100 Mb

~3,000 Mb

ABI 3700 four-color sequence traceSlide6

Individual human resequencingSlide7

Next-generation DNA sequencingSlide8

New sequencing technologies…Slide9

… vast throughput, many applications

read length

bases per machine run

10 bp

1,000 bp

100 bp

1 Gb

100 Mb

10 Mb

10

Gb

Illumina, SOLiD

ABI

/ capillary

454

1 Mb

100

Gb

1

T

bSlide10

DNA ligation

DNA base extension

Church, 2005

Sequencing chemistriesSlide11

Template clonal amplification

Church, 2005Slide12

Massively parallel sequencing

Church, 2005Slide13

Chemistry of paired-end sequencing

Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced.

(Figure courtesy of Illumina)Slide14

Paired-end reads

fragment amplification: fragment length 100 - 600 bp

fragment length limited by amplification efficiency

circularization: 500bp - 10kb (sweet spot ~3kb)

fragment length limited by library complexity

Korbel

et al

.

Science

2007Slide15

Features of NGS data

Short sequence reads

100-200bp

25-35bp (micro-reads)

Huge amount of sequence per run

Up to gigabases per runHuge number of reads per run

Up to 100’s of millionsHigher error as compared with Sanger sequencingError profile different to SangerSlide16

Application areas of next-gen sequencingSlide17

Application areas Genome resequencing variant discovery

somatic mutation detection

mutational profiling

De novo assembly

Identification of protein-bound DNA chromatin structure methylation transcription binding sites

RNA-Seq expression transcript discoveryMikkelsen et al. Nature 2007Cloonan

et al

.

Nature Methods

, 2008Slide18

SNP and short-INDEL discoverySlide19

Structural variation detection

structural variations (

deletions, insertions, inversions and translocations

) from paired-end read map locations

copy number (for

amplifications, deletions

) from depth of read coverageSlide20

Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)

(Mikkelsen

et al

.

Nature

2007)

Transcription binding sites. (Robertson

et al

.

Nature Methods

, 2007)Slide21

Novel transcript discovery (genes)

Mortazavi

et al

. Nature Methods

novel exons

novel transcripts containing known exonsSlide22

Novel transcript discovery (miRNAs)

Ruby

et al

.

Cell

, 2006Slide23

Expression profiling

aligned reads

aligned reads

Jones-Rhoads

et al

.

PLoS Genetics

, 2007

gene

gene

tag counting (e.g. SAGE, CAGE)

shotgun transcript sequencingSlide24

De novo genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander

et al

.

Nature

2001Slide25

The informatics of sequencingSlide26

Re-sequencing informatics pipeline

REF

(ii) read mapping

IND

(

i

) base calling

IND

(

iii)

SNP and short INDEL calling

(

v

)

data

viewing, hypothesis

generation

(iv

) SV callingSlide27

The variation discovery toolbox

base callers

read mappers

SNP callers

SV callers

assembly viewersSlide28

Raw data processing / base calling

Trace extraction

Base calling

These steps are usually handled well by the machine manufacturers’ software

What most analysts want to see is base calls and well-calibrated base quality valuesSlide29

Sequence traces are machine-specific

Base calling is increasingly left to machine manufacturersSlide30

…where they give you the cover on the box

Read

mapping…

Is like a jigsaw

puzzle… Slide31

Some pieces are easier to place than others…

…pieces with unique features

pieces that look like each other…Slide32

Repeats  multiple mapping problem

Lander

et al.

2001Slide33

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel

et al

.

Science 2007fragment length: 1 – 10kb

PE reads are now the standard for whole-genome short-read sequencingSlide34

Mapping quality values

0.8

0.19

0.01Slide35

SNP callingSlide36

SNP calling: what goes into it?

sequencing error

true polymorphism

Base qualities

Base coverage

Prior expectationSlide37

Bayesian SNP calling

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic permutation

monomorphic permutation

Bayesian posterior probability

Base call + Base quality

Expected polymorphism rate

Base composition

Depth of coverageSlide38

http://

bioinformatics.bc.edu

/~

marth

/

PolyBayes

Marth et al., Nature Genetics, 1999

First statistically rigorous SNP discovery tool

Correctly analyzes alternative cDNA splice forms

The PolyBayes softwareSlide39

SNP calling (continued)P(G

1

=

aa

|B1=aacc

; Bi=aaaac; Bn= cccc)

P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn=

cccc

)

P(G

i

=

aa

|B

1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=

aaaac; Bn= cccc)P(Gi

=ac|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=aa|B1=aacc; Bi=aaaac;

Bn= cccc)P(Gn=cc|B1=

aacc; Bi=aaaac; Bn

= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)P(SNP)

“genotype probabilities”

P(B

1

=

aa

cc

|G

1

=

aa

)

P(B

1

=

aa

cc

|G

1

=

cc

)

P(B

1

=

aa

cc

|G

1

=

a

c

)

P(B

i

=

aaaa

c

|G

i

=

aa

)

P(B

i

=

aaaa

c

|G

i

=

cc

)

P(B

i

=

aaaa

c

|G

i

=

a

c

)

P(B

n

=

cccc

|G

n

=

aa

)

P(B

n

=

cccc

|G

n

=

cc

)

P(B

n

=

cccc

|G

n

=

a

c

)

“genotype likelihoods”

Prior(G

1

,..,G

i

,..,

G

n

)

-----

a

-----

-----

a

-----

-----

c

-----

-----

c

-----

-----

a

-----

-----

a

-----

-----

a

-----

-----

a

-----

-----

c

-----

-----

c

-----

-----

c

-----

-----

c

-----

-----

c

-----Slide40

Insertion/deletion (INDEL) variantsThese variants have been on the “radar screen” for decadesAccurate automated detection is difficultDifferent mutation mechanisms

Often appear in repetitive sequence and therefore difficult to align

Often multi-allelic

Deleted allele has no base quality valuesSlide41

Alignment methods became more refinedOriginal alignment

After left realignment

After haplotype-aware realignmentSlide42

Medium length INDELs still a problem

Guillermo AngelSlide43

Structural variation detection

Feuk

et al.

Nature Reviews Genetics, 2006Slide44

Structural variant detection (cont’d)Slide45

Detection ApproachesRead Depth: good for big CNVs

Sample

Reference

Lmap

read

contig

Paired-end:

all types of SV

Split-Reads

good break-point resolution

deNovo

Assembly

~ the future

SV slides courtesy of Chip Stewart, Boston CollegeSlide46

SV detection – resolution

Expected CNVs

Karyotype

Micro-array

Sequencing

Relative

numbers of

events

CNV event

length [

bp

]Slide47

Standard data formats

Reads: FASTQ

Alignments: SAM/BAM

Variants: VCFSlide48

Tools for analyzing & manipulating 1000G data

samtools

:

http://

samtools.sourceforge.net/ BamTools:

http://sourceforge.net/projects/bamtools/ GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit VCFTools: http://vcftools.sourceforge.net/ VcfCTools: https://github.com/AlistairNWard/vcfCTools

Alignments: SAM/BAM

Variants: VCF