Ion Mandoiu Computer Science and Engineering Department University of Connecticut Outline Background on highthroughput sequencing Identification of tumorspecific epitopes Estimation of gene and ID: 621213
Download Presentation The PPT/PDF document "Next-Generation Sequencing: Challenges a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Next-Generation Sequencing: Challenges and Opportunities
Ion
Mandoiu
Computer Science and Engineering Department
University of ConnecticutSlide2
Outline
Background on high-throughput sequencing
Identification of tumor-specific
epitopes
Estimation of gene and
isoform
expression levels
Viral
quasispecies
reconstruction
F
uture
workSlide3
http://www.economist.com/node/16349358
Advances in High-Throughput
Sequencing (HTS)
Roche/454 FLX Titanium
400-600 million
reads/run
400bp avg. length
Illumina HiSeq 2000
Up to 6 billion PE reads/run
35-100bp read length
SOLiD 4
1.4-2.4 billion PE reads/run
35-50bp read lengthSlide4
Illumina Workflow – Library Preparation
Genomic
DNA
mRNASlide5
Illumina
Workflow – Cluster
GenerationSlide6
Illumina
Workflow – Sequencing by
SynthesisSlide7
Cost of Whole Genome Sequencing
C.Venter
Sanger@7.5x
J. Watson
454@7.4x
NA18507
Illumina@36x
SOLiD@12xSlide8
HTS is a transformative technology Numerous applications besides
de novo
genome sequencing:
RNA-
Seq
Non-coding RNAs
ChIP-Seq
Epigenetics Structural variationMetagenomicsPaleogenomics
…
HTS applicationsSlide9
Outline
Background on high-throughput sequencing
Identification of tumor-specific
epitopes
Estimation of gene and
isoform
expression levels
Viral quasispecies reconstruction
Future workSlide10
Genomics-Guided Cancer Immunotherapy
C
T
C
AA
TT
G
A
T
G
AAA
TT
G
TT
C
T
G
AAA
C
T
G
C
A
G
A
G
A
T
A
G
C
T
AAA
GG
A
T
A
CC
GGG
TT
CC
GG
T
A
T
CC
TTT
A
G
C
T
A
T
C
T
C
T
G
CC
T
C
C
T
G
A
C
ACCATCTGTGTGGGCTACCATG
…
AGGCAAGCTCATGGCCAAATCATGAGA
Tumor mRNA
Sequencing
SYFPEITHI
ISETDLSLL
CALRRNESL
…
Tumor
Specific
Epitopes
PeptideSynthesis
Immune
System
Stimulation
Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html
Tumor
RemissionSlide11
Bioinformatics Pipeline
Tumor mRNA reads
CCDS
Mapping
Genome Mapping
Read
Merging
CCDS mapped reads
Genome mapped reads
SNVs
D
etection
Mapped reads
Epitope
Prediction
Tumor specific
epitopes
Haplotyping
Tumor-specific SNVs
Close SNV
Haplotypes
Primers Design
Primers for Sanger SequencingSlide12
Bioinformatics Pipeline
Tumor mRNA reads
CCDS
Mapping
Genome Mapping
Read
Merging
CCDS mapped reads
Genome mapped reads
SNVs
D
etection
Mapped reads
Epitope
Prediction
Tumor specific
epitopes
Haplotyping
Tumor-specific SNVs
Close SNV
Haplotypes
Primers Design
Primers for Sanger SequencingSlide13
Mapping mRNA Reads
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.pngSlide14
Read Merging
Genome
CCDS
Agree?
Hard Merge
Soft Merge
Unique
Unique
Yes
KeepKeepUniqueUniqueNoThrow
Throw
UniqueMultipleNoThrowKeepUnique
Not MappedNoKeepKeepMultipleUnique
No
Throw
Keep
Multiple
Multiple
No
Throw
Throw
Multiple
Not Mapped
No
Throw
Throw
Not mapped
Unique
No
Keep
Keep
Not mapped
Multiple
No
Throw
Throw
Not mapped
Not Mapped
Yes
Throw
ThrowSlide15
SNV Detection and Genotyping
AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC
AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCG
G
CAG
CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA
GCGGCCAGCCGGCTTCTGTCGGCCAGCCG
G
CAGGGA
GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT
CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG
CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCReference
Locus
i
R
i
r(
i
) : Base call of read r at locus
i
ε
r(
i
)
: Probability of error reading base call r(
i
)
G
i
: Genotype at locus
iSlide16
SNV Detection and Genotyping
Use
Bayes
rule to calculate posterior probabilities and pick the genotype with the largest oneSlide17
SNV Detection and Genotyping
Calculate conditional probabilities by multiplying contributions of individual reads
Slide18
Data FilteringSlide19
Accuracy per RPKM binsSlide20
Bioinformatics Pipeline
Tumor mRNA reads
CCDS
Mapping
Genome Mapping
Read
Merging
CCDS mapped reads
Genome mapped reads
SNVs
D
etection
Mapped reads
Epitope
Prediction
Tumor specific
epitopes
Haplotyping
Tumor-specific SNVs
Close SNV
Haplotypes
Primers Design
Primers for Sanger SequencingSlide21
Haplotyping
Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent.
ACGT
T
ACATTG
C
CACTC
A
ATC
--TGGAACGTCACATTG-CACTCGATCGCTGGA
Heterozygous variantsSlide22
Haplotyping
Locus
Event
Alleles
1
SNV
C,T
2
Deletion
C,-
3
SNV
A,G
4
Insertion
-,GC
Locus
Event
Alleles Hap 1
Alleles Hap 2
1
SNV
T
C
2
Deletion
C
-
3
SNV
A
G
4
Insertion
-
GCSlide23
RefHap Algorithm
Reduce the problem to Max-Cut.
Solve Max-Cut
Build
haplotypes
according with the cut
Locus
1
2
3
4
5
f
1
-
0
1
1
0
f
2
1
1
0
-
1
f
3
1
-
-
0
-
f
4
-
0
0
-
1
3
1
1
1
-1
-1
4
2
3
h
1
00110
h
2
11001
Slide24
Bioinformatics Pipeline
Tumor mRNA reads
CCDS
Mapping
Genome Mapping
Read
Merging
CCDS mapped reads
Genome mapped reads
SNVs
D
etection
Mapped reads
Epitope
Prediction
Tumor specific
epitopes
Haplotyping
Tumor-specific SNVs
Close SNV
Haplotypes
Primers Design
Primers for Sanger SequencingSlide25
Immunology Background
J.W.
Yedell
, E
Reits
and J
Neefjes
. Making sense of mass destruction:
quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003Slide26
Epitope Prediction
C.
Lundegaard
et al
. MHC Class I
Epitope
Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004Slide27
Results on Tumor Data
Mouse strain
BALB/C
B10.D2 TRAMP
Tumor
Meth-A
CMS5
prostate1
prostate2
prostate3
prostate4
#lanes
1
3
4
3
3
3
HQ Het
SNPs
465
77
86
17
292
193
Dd
Weak
119
17
14
12
63
70
Strong
20
2
2
0
7
12
Kd
Weak
111
21
10
0
19
54
Strong
3
1
1
0
1
3
Ld
Weak
99
12
25
4
47
75
Strong
8
0
0
0
2
9
Total
Weak
329
50
49
16
129
199
Strong
31
3
3
0
10
24Slide28
Experimental Validation
Mutations
reported by [Noguchi et al 94]
found by the pipeline
Confirmed with Sanger sequencing 18 out of 20 mutations for
MethA
and 26 out of 28 mutations for
CMS5
Immunogenic potential under experimental validation in the
Srivastava lab at UCHCSlide29
Outline
Background on high-throughput sequencing
Identification of tumor-specific
epitopes
Estimation of gene and
isoform
expression levels
Viral quasispecies reconstruction
Future workSlide30
RNA-Seq
A
B
C
D
E
Make
cDNA
& shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE)
A
B
C
A
C
D
E
Isoform
Discovery (ID)
Isoform Expression (IE)Slide31
Alternative Splicing
[
Griffith and
Marra
07]Slide32
Challenges to Accurate Estimation of Gene Expression Levels
Read ambiguity (
multireads
)
What is the gene length?
A
B
C
D
ESlide33
Previous approaches to GE
Ignore
multireads
[
Mortazavi
et al. 08]
Fractionally allocate multireads
based on unique read estimates[Pasaniuc et al. 10]EM algorithm for solving ambiguitiesGene length: sum of lengths of
exons that appear in at least one isoform
Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]Slide34
Read Ambiguity in IE
A
B
C
D
E
A
CSlide35
Previous approaches to IE
[
Jiang&Wong
09]
Poisson model + importance sampling, single reads
[Richard et al. 10]
EM Algorithm based on Poisson model, single reads in
exons[Li et al. 10]EM Algorithm, single reads[Feng et al. 10]
Convex quadratic program, pairs used only for ID[Trapnell et al. 10]Extends Jiang’s model to paired reads
Fragment length distributionSlide36
Our contributionUnified probabilistic model and Expectation-Maximization Algorithm
for
IE considering
Single and/or paired reads
Fragment length distribution
Strand information
Base quality scoresSlide37
Read-Isoform CompatibilitySlide38
Fragment length distributionPaired reads
A
B
C
A
C
A
B
C
A
C
A
C
A
B
C
i
j
F
a
(
i
)
F
a
(j)Slide39
Fragment length distributionSingle reads
A
B
C
A
C
A
B
C
A
C
A
B
C
A
C
i
j
F
a
(
i
)
F
a
(j)Slide40
IsoEM algorithm
E-step
M-stepSlide41
Error Fraction Curves - Isoforms
30M single reads of length 25 (simulated)Slide42
Error Fraction Curves - Genes
30M single reads of length 25 (simulated)Slide43
Validation on MAQC SamplesSlide44
Outline
Background on high-throughput sequencing
Identification of tumor-specific
epitopes
Estimation of gene and
isoform
expression levels
Viral quasispecies reconstruction
Future workSlide45
Viral Quasispecies
RNA viruses (HIV, HCV)
Many replication mistakes
Quasispecies
(
qsps
)
= co-existing closely related variants
Variants differ in
virulenceability to escape the immune system
resistance to antiviral therapies
tissue tropismHow do qsps
contribute to viral persistence and evolution?Slide46
454 Pyrosequencing
Pyrosequencing
=Sequencing by Synthesis.
GS FLX Titanium
:
Fragments (
reads
): 300-800
bp
Sequence of the reads
System software assembles reads into a single genome
We need a software that assembles reads into
multiple genomes!Slide47
Quasispecies Spectrum
Reconstruction (QSR) Problem
Given
pyrosequencing reads from a quasispecies population of unknown size and distribution
Reconstruct
the quasispecies
spectrum
sequences
frequenciesSlide48
ViSpA
Viral Spectrum Assembler Slide49
454
Sequencing Errors
E
rror
rate
~0.1
%.
Fixed number of incorporated bases vs. light intensity value.
Incorrect resolution of
homopolymers
=>
over-calls (insertions)65-75% of errorsunder-calls (deletions)
20-30% of errorsSlide50
Preprocessing of Aligned Reads
Deletions in reads:
D
Replace deletion, confirmed by a single read,
with either allele value that is present in
all
other reads or
N.
Insertions into reference:
IRemove insertions, confirmed
by a single read.Imputation of missing values NSlide51
Read Graph: Vertices
Subread
= completely contained in some read
with
≤
n mismatches
.
Superread
= not a
subread
=> the vertex in the read graph.
ACTGGTCCCTCCTGAGTGT
GGTCCCTCCT
TGGTC
A
CTC
G
TGAG
A
C
CT
CA
TC
GAAG
C
G
G
C
GT
CC
TSlide52
Read Graph: Edges
Edge b/w two vertices exists
if there is an overlap between
superreads
they agree on their overlap
with
≤
m mismatches.
Auxiliary vertices: source and sinkSlide53
Read Graph: Edge Cost
The most probable source-sink path through each vertex
Cost: uncertainty that two
superreads
are from the same
qsps
.
Overhang
Δ
is the shift in start positions of two overlapping superreads.
ΔSlide54
Contig Assembling
Max Bandwidth Path
through vertex
path minimizing maximum edge cost for the path and each
subpath
Consensus of path’s
superreads
Each position: >70%-majority or
N
Weighted consensus obtained on all reads
Remove duplicates
Duplicated sequences = statistical evidence
read
r
of length
l
qsps
s
of length
L
k
is #mismatches
,
t/L
is a mutation
rateSlide55
Expectation Maximization
Bipartite graph:
Q
q
is a candidate with frequency
f
q
R
r
is a read with observed frequency or
Weight hq,r
= probability that read
r
is produced by
qsps
q
with
j
mismatches
E step:
M step:Slide56
HCV
Qsps
(P
.
Balfe)
30927 reads from 5.2Kb-long region of HCV-1a genomes
intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%]
27764 reads
average length=292bp
Indels
: ~77% of reads Insertions length: 1 (86%) , 3 (9.8%)Deletions length: 1 (98%)
N: ~7% of readsSlide57
HCV Data StatisticsSlide58
NJ Tree for 12 Most Frequent
Qsps
(No Insertions)
The top sequence:
26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads.
In sum:
35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.
Reconstructed sequence with highest frequency 99
% identical to one of
the
ORFs obtained by cloning the
quasispecies.Slide59
Conclusions & Future Work
Freely available
implementations of these methods
available
at
http
://
dna.engr.uconn.edu/software/
Ongoing
work Monitoring immune responses by TCR sequencingIsoform
discovery
Computational deconvolution of heterogeneous samples
Reconstruction & frequency estimation of
virus
quasispecies
from Ion
Torrent readsSlide60
Acknowledgments
Immunogenomics
Jorge
Duitama
(KU Leuven)
Pramod
K.
Srivastava
, Adam Adler, Brent Graveley
, Duan Fei (UCHC)Matt Alessandri and Kelly Gonzalez (Ambry Genetics)IsoEMMarius Nicolae
(Uconn)
Alex Zelikovsky, Serghei Mangul (GSU)ViSpA
Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei
Mangul
, (GSU), and Kelly
Westbrooks
(Life Technologies
)
Peter
Balfe (Birmingham
University, UK)
Funding
NSF
awards IIS-0546457, IIS-0916948, and DBI-0543365
UCONN
Research Foundation UCIG grant