/
Next-Generation Sequencing: Challenges and Opportunities Next-Generation Sequencing: Challenges and Opportunities

Next-Generation Sequencing: Challenges and Opportunities - PowerPoint Presentation

alida-meadow
alida-meadow . @alida-meadow
Follow
432 views
Uploaded On 2018-01-08

Next-Generation Sequencing: Challenges and Opportunities - PPT Presentation

Ion Mandoiu Computer Science and Engineering Department University of Connecticut Outline Background on highthroughput sequencing Identification of tumorspecific epitopes Estimation of gene and ID: 621213

read reads sequencing tumor reads read tumor sequencing mapped length specific throw quasispecies genome single snv isoform epitopes mapping

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Next-Generation Sequencing: Challenges a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Next-Generation Sequencing: Challenges and Opportunities

Ion

Mandoiu

Computer Science and Engineering Department

University of ConnecticutSlide2

Outline

Background on high-throughput sequencing

Identification of tumor-specific

epitopes

Estimation of gene and

isoform

expression levels

Viral

quasispecies

reconstruction

F

uture

workSlide3

http://www.economist.com/node/16349358

Advances in High-Throughput

Sequencing (HTS)

Roche/454 FLX Titanium

400-600 million

reads/run

400bp avg. length

Illumina HiSeq 2000

Up to 6 billion PE reads/run

35-100bp read length

SOLiD 4

1.4-2.4 billion PE reads/run

35-50bp read lengthSlide4

Illumina Workflow – Library Preparation

Genomic

DNA

mRNASlide5

Illumina

Workflow – Cluster

GenerationSlide6

Illumina

Workflow – Sequencing by

SynthesisSlide7

Cost of Whole Genome Sequencing

C.Venter

Sanger@7.5x

J. Watson

454@7.4x

NA18507

Illumina@36x

SOLiD@12xSlide8

HTS is a transformative technology Numerous applications besides

de novo

genome sequencing:

RNA-

Seq

Non-coding RNAs

ChIP-Seq

Epigenetics Structural variationMetagenomicsPaleogenomics

HTS applicationsSlide9

Outline

Background on high-throughput sequencing

Identification of tumor-specific

epitopes

Estimation of gene and

isoform

expression levels

Viral quasispecies reconstruction

Future workSlide10

Genomics-Guided Cancer Immunotherapy

C

T

C

AA

TT

G

A

T

G

AAA

TT

G

TT

C

T

G

AAA

C

T

G

C

A

G

A

G

A

T

A

G

C

T

AAA

GG

A

T

A

CC

GGG

TT

CC

GG

T

A

T

CC

TTT

A

G

C

T

A

T

C

T

C

T

G

CC

T

C

C

T

G

A

C

ACCATCTGTGTGGGCTACCATG

AGGCAAGCTCATGGCCAAATCATGAGA

Tumor mRNA

Sequencing

SYFPEITHI

ISETDLSLL

CALRRNESL

Tumor

Specific

Epitopes

PeptideSynthesis

Immune

System

Stimulation

Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

Tumor

RemissionSlide11

Bioinformatics Pipeline

Tumor mRNA reads

CCDS

Mapping

Genome Mapping

Read

Merging

CCDS mapped reads

Genome mapped reads

SNVs

D

etection

Mapped reads

Epitope

Prediction

Tumor specific

epitopes

Haplotyping

Tumor-specific SNVs

Close SNV

Haplotypes

Primers Design

Primers for Sanger SequencingSlide12

Bioinformatics Pipeline

Tumor mRNA reads

CCDS

Mapping

Genome Mapping

Read

Merging

CCDS mapped reads

Genome mapped reads

SNVs

D

etection

Mapped reads

Epitope

Prediction

Tumor specific

epitopes

Haplotyping

Tumor-specific SNVs

Close SNV

Haplotypes

Primers Design

Primers for Sanger SequencingSlide13

Mapping mRNA Reads

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.pngSlide14

Read Merging

Genome

CCDS

Agree?

Hard Merge

Soft Merge

Unique

Unique

Yes

KeepKeepUniqueUniqueNoThrow

Throw

UniqueMultipleNoThrowKeepUnique

Not MappedNoKeepKeepMultipleUnique

No

Throw

Keep

Multiple

Multiple

No

Throw

Throw

Multiple

Not Mapped

No

Throw

Throw

Not mapped

Unique

No

Keep

Keep

Not mapped

Multiple

No

Throw

Throw

Not mapped

Not Mapped

Yes

Throw

ThrowSlide15

SNV Detection and Genotyping

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCG

G

CAG

CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA

GCGGCCAGCCGGCTTCTGTCGGCCAGCCG

G

CAGGGA

GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT

CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG

CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCReference

Locus

i

R

i

r(

i

) : Base call of read r at locus

i

ε

r(

i

)

: Probability of error reading base call r(

i

)

G

i

: Genotype at locus

iSlide16

SNV Detection and Genotyping

Use

Bayes

rule to calculate posterior probabilities and pick the genotype with the largest oneSlide17

SNV Detection and Genotyping

Calculate conditional probabilities by multiplying contributions of individual reads

Slide18

Data FilteringSlide19

Accuracy per RPKM binsSlide20

Bioinformatics Pipeline

Tumor mRNA reads

CCDS

Mapping

Genome Mapping

Read

Merging

CCDS mapped reads

Genome mapped reads

SNVs

D

etection

Mapped reads

Epitope

Prediction

Tumor specific

epitopes

Haplotyping

Tumor-specific SNVs

Close SNV

Haplotypes

Primers Design

Primers for Sanger SequencingSlide21

Haplotyping

Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent.

ACGT

T

ACATTG

C

CACTC

A

ATC

--TGGAACGTCACATTG-CACTCGATCGCTGGA

Heterozygous variantsSlide22

Haplotyping

Locus

Event

Alleles

1

SNV

C,T

2

Deletion

C,-

3

SNV

A,G

4

Insertion

-,GC

Locus

Event

Alleles Hap 1

Alleles Hap 2

1

SNV

T

C

2

Deletion

C

-

3

SNV

A

G

4

Insertion

-

GCSlide23

RefHap Algorithm

Reduce the problem to Max-Cut.

Solve Max-Cut

Build

haplotypes

according with the cut

Locus

1

2

3

4

5

f

1

-

0

1

1

0

f

2

1

1

0

-

1

f

3

1

-

-

0

-

f

4

-

0

0

-

1

3

1

1

1

-1

-1

4

2

3

h

1

00110

h

2

11001

Slide24

Bioinformatics Pipeline

Tumor mRNA reads

CCDS

Mapping

Genome Mapping

Read

Merging

CCDS mapped reads

Genome mapped reads

SNVs

D

etection

Mapped reads

Epitope

Prediction

Tumor specific

epitopes

Haplotyping

Tumor-specific SNVs

Close SNV

Haplotypes

Primers Design

Primers for Sanger SequencingSlide25

Immunology Background

J.W.

Yedell

, E

Reits

and J

Neefjes

. Making sense of mass destruction:

quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003Slide26

Epitope Prediction

C.

Lundegaard

et al

. MHC Class I

Epitope

Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004Slide27

Results on Tumor Data

Mouse strain

BALB/C

B10.D2 TRAMP

Tumor

Meth-A

CMS5

prostate1

prostate2

prostate3

prostate4

#lanes

1

3

4

3

3

3

HQ Het

SNPs

465

77

86

17

292

193

Dd

Weak

119

17

14

12

63

70

Strong

20

2

2

0

7

12

Kd

Weak

111

21

10

0

19

54

Strong

3

1

1

0

1

3

Ld

Weak

99

12

25

4

47

75

Strong

8

0

0

0

2

9

Total

Weak

329

50

49

16

129

199

Strong

31

3

3

0

10

24Slide28

Experimental Validation

Mutations

reported by [Noguchi et al 94]

found by the pipeline

Confirmed with Sanger sequencing 18 out of 20 mutations for

MethA

and 26 out of 28 mutations for

CMS5

Immunogenic potential under experimental validation in the

Srivastava lab at UCHCSlide29

Outline

Background on high-throughput sequencing

Identification of tumor-specific

epitopes

Estimation of gene and

isoform

expression levels

Viral quasispecies reconstruction

Future workSlide30

RNA-Seq

A

B

C

D

E

Make

cDNA

& shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A

B

C

A

C

D

E

Isoform

Discovery (ID)

Isoform Expression (IE)Slide31

Alternative Splicing

[

Griffith and

Marra

07]Slide32

Challenges to Accurate Estimation of Gene Expression Levels

Read ambiguity (

multireads

)

What is the gene length?

A

B

C

D

ESlide33

Previous approaches to GE

Ignore

multireads

[

Mortazavi

et al. 08]

Fractionally allocate multireads

based on unique read estimates[Pasaniuc et al. 10]EM algorithm for solving ambiguitiesGene length: sum of lengths of

exons that appear in at least one isoform

Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]Slide34

Read Ambiguity in IE

A

B

C

D

E

A

CSlide35

Previous approaches to IE

[

Jiang&Wong

09]

Poisson model + importance sampling, single reads

[Richard et al. 10]

EM Algorithm based on Poisson model, single reads in

exons[Li et al. 10]EM Algorithm, single reads[Feng et al. 10]

Convex quadratic program, pairs used only for ID[Trapnell et al. 10]Extends Jiang’s model to paired reads

Fragment length distributionSlide36

Our contributionUnified probabilistic model and Expectation-Maximization Algorithm

for

IE considering

Single and/or paired reads

Fragment length distribution

Strand information

Base quality scoresSlide37

Read-Isoform CompatibilitySlide38

Fragment length distributionPaired reads

A

B

C

A

C

A

B

C

A

C

A

C

A

B

C

i

j

F

a

(

i

)

F

a

(j)Slide39

Fragment length distributionSingle reads

A

B

C

A

C

A

B

C

A

C

A

B

C

A

C

i

j

F

a

(

i

)

F

a

(j)Slide40

IsoEM algorithm

E-step

M-stepSlide41

Error Fraction Curves - Isoforms

30M single reads of length 25 (simulated)Slide42

Error Fraction Curves - Genes

30M single reads of length 25 (simulated)Slide43

Validation on MAQC SamplesSlide44

Outline

Background on high-throughput sequencing

Identification of tumor-specific

epitopes

Estimation of gene and

isoform

expression levels

Viral quasispecies reconstruction

Future workSlide45

Viral Quasispecies

RNA viruses (HIV, HCV)

Many replication mistakes

Quasispecies

(

qsps

)

= co-existing closely related variants

Variants differ in

virulenceability to escape the immune system

resistance to antiviral therapies

tissue tropismHow do qsps

contribute to viral persistence and evolution?Slide46

454 Pyrosequencing

Pyrosequencing

=Sequencing by Synthesis.

GS FLX Titanium

:

Fragments (

reads

): 300-800

bp

Sequence of the reads

System software assembles reads into a single genome

We need a software that assembles reads into

multiple genomes!Slide47

Quasispecies Spectrum

Reconstruction (QSR) Problem

Given

pyrosequencing reads from a quasispecies population of unknown size and distribution

Reconstruct

the quasispecies

spectrum

sequences

frequenciesSlide48

ViSpA

Viral Spectrum Assembler Slide49

454

Sequencing Errors

E

rror

rate

~0.1

%.

Fixed number of incorporated bases vs. light intensity value.

Incorrect resolution of

homopolymers

=>

over-calls (insertions)65-75% of errorsunder-calls (deletions)

20-30% of errorsSlide50

Preprocessing of Aligned Reads

Deletions in reads:

D

Replace deletion, confirmed by a single read,

with either allele value that is present in

all

other reads or

N.

Insertions into reference:

IRemove insertions, confirmed

by a single read.Imputation of missing values NSlide51

Read Graph: Vertices

Subread

= completely contained in some read

with

n mismatches

.

Superread

= not a

subread

=> the vertex in the read graph.

ACTGGTCCCTCCTGAGTGT

GGTCCCTCCT

TGGTC

A

CTC

G

TGAG

A

C

CT

CA

TC

GAAG

C

G

G

C

GT

CC

TSlide52

Read Graph: Edges

Edge b/w two vertices exists

if there is an overlap between

superreads

they agree on their overlap

with

m mismatches.

Auxiliary vertices: source and sinkSlide53

Read Graph: Edge Cost

The most probable source-sink path through each vertex

Cost: uncertainty that two

superreads

are from the same

qsps

.

Overhang

Δ

is the shift in start positions of two overlapping superreads.

ΔSlide54

Contig Assembling

Max Bandwidth Path

through vertex

path minimizing maximum edge cost for the path and each

subpath

Consensus of path’s

superreads

Each position: >70%-majority or

N

Weighted consensus obtained on all reads

Remove duplicates

Duplicated sequences = statistical evidence

read

r

of length

l

qsps

s

of length

L

k

is #mismatches

,

t/L

is a mutation

rateSlide55

Expectation Maximization

Bipartite graph:

Q

q

is a candidate with frequency

f

q

R

r

is a read with observed frequency or

Weight hq,r

= probability that read

r

is produced by

qsps

q

with

j

mismatches

E step:

M step:Slide56

HCV

Qsps

(P

.

Balfe)

30927 reads from 5.2Kb-long region of HCV-1a genomes

intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%]

27764 reads

average length=292bp

Indels

: ~77% of reads Insertions length: 1 (86%) , 3 (9.8%)Deletions length: 1 (98%)

N: ~7% of readsSlide57

HCV Data StatisticsSlide58

NJ Tree for 12 Most Frequent

Qsps

(No Insertions)

The top sequence:

26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads.

In sum:

35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.

Reconstructed sequence with highest frequency 99

% identical to one of

the

ORFs obtained by cloning the

quasispecies.Slide59

Conclusions & Future Work

Freely available

implementations of these methods

available

at

http

://

dna.engr.uconn.edu/software/

Ongoing

work Monitoring immune responses by TCR sequencingIsoform

discovery

Computational deconvolution of heterogeneous samples

Reconstruction & frequency estimation of

virus

quasispecies

from Ion

Torrent readsSlide60

Acknowledgments

Immunogenomics

Jorge

Duitama

(KU Leuven)

Pramod

K.

Srivastava

, Adam Adler, Brent Graveley

, Duan Fei (UCHC)Matt Alessandri and Kelly Gonzalez (Ambry Genetics)IsoEMMarius Nicolae

(Uconn)

Alex Zelikovsky, Serghei Mangul (GSU)ViSpA

Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei

Mangul

, (GSU), and Kelly

Westbrooks

(Life Technologies

)

Peter

Balfe (Birmingham

University, UK)

Funding

NSF

awards IIS-0546457, IIS-0916948, and DBI-0543365

UCONN

Research Foundation UCIG grant