Stony Brook University httpwwwcssunysbeduskiena Optimizing the Design of Coding Sequences Sequencing vs Synthesis DNA sequencing is the technology which reads DNA molecules identifying the defining string on ACGT ID: 927096
Download Presentation The PPT/PDF document "Steven Skiena Dept. of Computer Science" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Steven SkienaDept. of Computer ScienceStony Brook Universityhttp://www.cs.sunysb.edu/~skiena
Optimizing the Design of Coding Sequences
Slide2Sequencing vs. SynthesisDNA sequencing is the technology which reads DNA molecules, identifying the defining string on {A,C,G,T}.DNA synthesis is the technology which constructed (`writes’) DNA molecules to specification, given by a defining string on {A,C,G,T}.The freedom unleashed by large-scale synthesis opens up many exciting new directions for research in molecular biology.Imagination is needed to properly exploit these technologies.
Slide3DNA Synthesis TechnologiesShort oligos (50-100 bases) are readily synthesized.Long molecules can be constructed by hybridizing short oligos, but it takes work.
The cost for large-scale synthesis is dropping rapidly, and is now about $0.20 /base for kilobase sequences, or a couple thousand dollars for a virus.Agilent’s array-based synthesis technologies produce mixtures of
55,000
custom-designed 200-base
oligos
: several
megabases
of synthesis!
Cheaper, faster, better…
Slide4Synthetic BiologyNew synthesis technologies facilitate the engineering of novel biological structures and functions.Recent successes include eliminating all occurrences of the TAG stop codon in E.coli (Isaacs,et.al 2011) or refactoring a yeast chromosome (Dymond, et.al. 2011)But large-scale synthesis promises to revolutionize how natural organisms are studied as well: “what happens if we change this?”
Slide5Why Me?
Slide6Gene Design and Computer ScienceDesigning genomic sequences under constraints is an algorithmic optimization problem.Issues arising in our sequence designs include: bipartite matching, Hamming distance, simulated annealing, Gray codes, combinatorial group testing, de Bruijn sequences, etc.My interest in genome design optimization arose well before genome synthesis became a reality.
Slide7What can we/you do with Sequence Design and Synthesis?Attenuate viruses to serve as vaccines (SAVE).Design genes to increase or decrease gene expression.Efficiently search for signals in coding sequences.Design synonymous but homologous genes which avoid recombination.Refactor viral genomes to make them easier to work with.Design large sets of interesting oligos
to exploit massive multiplex synthesis technologies.
Slide8Outline of TalkDNA Translation and the Redundancy of the Triplet
Code
Codon
-
Pair
Bias and Vaccine Design
Codon
-Bias and Translation Rates
Optimizing Secondary Structure
Optimizing Sequence Autocorrelation
Future
Work
Slide9DNA to RNA to ProteinDNA sequences act as templates for building proteins according to the triplet code
Slide10The Triplet Code
Slide11Which Encoding is Best?There are roughly 3^n possible gene sequences coding for any n-amino acid protein, e.g. 10^75 encodings of a 147-residue hemoglobin protein.
By comparison, there are only 10^79 atoms in the known universe.Why did nature select one particular encoding?
Alternately, can we exploit this redundancy to design the ‘best’ coding sequence?
We seek
to design the
DNA/RNA coding sequence for a particular protein/amino acid sequence which optimizes some particular objective.
Slide12What Drives the Evolution of Coding Sequences?Sequences exhibit organism-specific codon bias.Coding helps regulate gene expression with common/scarce codons.RNA secondary structure affects stability.Many signals can be embedded in the coding regions of genes.
Sequence design to optimize gene expression is still very much an open problem. See Plotkin and Kundla’s review in Nature Genetics, January 2011.
Slide13Design Criteria for Artificial GenesMatching a given codon/codon-pair distributionEliminating or inserting specific patterns (S. `01)Optimizing secondary structure (Cohen and S. `02)Encoding additional gene sequences in alternate reading frames (WPMS ‘06)
Slide14Incorporating/Excluding Sequence PatternsMany biological features are encoded as substring patterns: restriction sites, miRNA targets, stop codons, etc.Differing objectives mandate either including or excluding specific patterns.For example, the restriction enzyme EcoRI cuts DNA at the pattern GAATTC.
Slide15Sequence Optimization Algorithms (S. ‘01)Dynamic programming can be used to include/exclude many short patterns efficiently, in O(n p 4^k).Since the longest known cutter is only 16 bases, this is a tractible computation.It is NP-complete for long patterns with wildcards, but heuristics work.
Our algorithms can remove 90% of restriction sites of all known enzymes
Slide16Results by Cutter Length
Slide17Optimizing Secondary StructureNucleotides bind to complementary bases (A-C, G-T/U) so as to minimize their energy.Secondary structures affect molecular interactions and stabilityOur algorithms design genes with prescribed secondary structure while coding for a given protein
Slide18The Zucker-Turner RNA ModelDynamic programming optimizes binding energy over different substructures.
Slide19Designing Secondary Structure(Cohen and S. ‘02)We can adapt the Zucker-Turner recurrence relations to design a coding sequence maximizing secondary structure in O(n^3).Minimizing secondary structure in the model is NP-complete, but heuristics existThese algorithms do not preserve the
codon distribution of the input sequence.
Slide20Maximizing RNA Secondary StructureThe optimized encoding requires twice as much energy to unfold.
Slide21How Much Freedom does Nature with Respect to Secondary Structure?
Slide22Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon
-Pair Bias and Vaccine DesignCodon-Bias and Translation RatesOptimizing Secondary Structure
Optimizing Sequence Autocorrelation
Future
Work
Slide23Chemical Synthesis of PoliovirusCello et al. synthesized poliovirus cDNA de novo without a natural templateThis groundbreaking study
made international headlines in July 2002 and opened the field of synthetic virology.
Cello, et al. Science. 2002 Aug 9;297(5583):1016-8.
Molla et al. Science. 1991; 254(5038):1647-51
Slide24Cello, Paul & Wimmer, 2002
Molla, Paul & Wimmer, 1991
Reverse genetics of poliovirus.
Slide25How might we rapidly create vaccines for new pathogens?
Slide26Difficulties in Vaccine DesignSmall numbers of attenuating mutations each having a large effect can easily revert to virulence.RNA viruses (e.g. Polio, HIV, Ebola, SARS, Dengue, Hanta, Influenza) have a high mutation rate (1/10,000 bases) confers high adapability to changing conditions
Attenuation via passaging is costly and time consuming.The poliovirus vaccine strain Sabin1 was derived by 52 rounds of monkey infections and 16 rounds of monkey kidney cell culture passages, requiring several years of work at prohibitive cost:
100,000 monkeys @ $10,000 = $ 1 Billion
Slide27Synthetic Attenuated Virus Engineering (SAVE)Motivation: viral diseases like SARS, 1918 influenza; bioterrorismInput: the genome sequence of a virusOutput: a synthetic, attenuated, variant of the virus designed to generate immune response and serve as a vaccine.We seek a cumulative phenotype of many mutations each with a small effect - difficult to revert; genetically stable -
“death by a thousand cuts”
Slide28Codon-Pair Bias Certain pairs of synonymous codons for two given amino acids are found adjacent to one another more (less) frequently than should be expected.Statistically significant codon-pair bias has been observed in
human genes and other organismsThe mechanisms behind this are still unclear, but we can use it to design attenuated viruses.We measure bias with
Slide29Some Codon Pairs Are Selected Over The Others
CUU
-
CGA
CUU
-
AGG
[Observed]/[Expected]=1.95
CPS=0.67
[Observed]/[Expected]=0.44
CPS= - 0.82
Leu
Arg
UUA
UUG
CUU
CUC
CUA
CUG
CGU
CGC
CGA
CGG
AGA
AGG
Slide30Codon-Pair Bias DesignsCodon-pair optimization is essentially the traveling salesman problemWe use simulated annealing to shuffle the wildtype codons
By changing the optimization criteria, we can construct sequences with either highly favorable or highly unfavorable codon-pair scores.
P
G
G
P
P
G
P
G
G
G
Original Sequence
CPB “Altered” Sequence
18
M
M
1
Slide31Experimental Results Random scrambles of synonymous codons generally have little effect on phenotype, unless a critical signal gets stepped on.Codon-pair bias deoptimized designs have proven to be significantly attenuated for both polio and influenza virus.These attenuated strains have proven to be safe and effective vaccines in mice, conferring immunity against the wildtype virus.
Slide32Deoptimization of three Flu segments
PB1
PB1-F2
HA
NP
A minimum of 120 nt at either end is excluded from recoding (replication/packaging signals)
Slide33Codon Pair Bias of selected Influenza A/PR8/3/34 genes and their deoptimized counterparts in relationship to the human ORFeome
.
Codon Pair Bias (average codon pair score per codon pair)
Length of ORF (amino acids)
after
“deoptimization”
HA
Min
PB1
Min
NP
Min
HA(wt)
NP(wt)
PB1 (wt)
increasingly
deoptimized
increasingly
optimized
Codon pair bias:
before and after deoptimization
Mueller et al.,
Nature Biotechnology
28(7),
July, 2010.
Slide34PR8
NP
Min
HA
Min
PB1
Min
NP/HA
Min
HA/PB1
Min
PR8
3F
Virus titer [log
10
PFU]
0
10
20
30
40
50
Hours post infection
2
3
4
5
6
7
8
9
10
PR8
PR8-NP/HA
Min
PR8-HA
Min
PR8
3F
PR8
3F
combines all three codon pair deoptimized segments in one virus
All viruses carrying one, two, or three
deoptimized segments are viable
Mueller et al.,
Nature Biotechnology
28(7),
July, 2010.
Slide35Vaccine Experiments
Slide3670
80
90
100
110
120
0
2
4
6
8
10
12
14
16
Days post infection
Relative body weight [%]
Mock
PR8
PR8
3F
Experiment
• 10
4
PFU intranasal
• monitor weight daily
• avg. of five mice per group
PR8
3F
causes no disease in infected mice
Mueller et al.,
Nature Biotechnology
28(7),
July, 2010.
endpoint
Slide37Survival after PR8-wt challenge [%]
0
20
40
60
80
100
10
0
10
1
10
2
10
3
10
4
10
5
10
6
Vaccine dose [log
10
PFU]
0
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
Survival after primary inoculation [%]
PR8
3F
-vaccinated
PR8(wt)-vaccinated
Safety
Margin
10
0
10
1
10
2
10
3
10
4
10
5
10
6
virus PD
50
(PFU)
Protective Dose 50 (PD
50
) of
codon-pair deoptimized virus PR8
3F
a
The dose required to protect 50% of mice with a single inoculation of vaccine virus from a challenge infection with 1000 LD
50
of the wt virus.
PR8
3F
has wide safety margin in mice
Mueller et al.,
Nature Biotechnology
28(7),
July, 2010.
PR8 (wt)
1.0 x 10
0
PR8
3F
1.3 x 10
1
One ml of PR8
3F
culture supernatant
= 10,000,000 mouse PD
50
doses
Slide38Inactivated
LAIV - FluMist
LAIV -
SAVE
Pros and Cons
Pros
Cons
• safest option
• most expensive
• potential need of adjuvants
• no/limited cellular immunity
• well characterized
backbone (done once)
• high genetic stability
• low protective dose (wide
safety margin)
• applicable to any strain
as a whole
• 100% antigenic match in all
segments to target strain
• guaranteed “take” in the
population at highest risk
(those naive to the
impending seasonal strain)
• attenuation in HA and NA
(lower reassortment risk)
• only one backbone
• repeated use may induce
heterosubtypic immunity
against backbone - no “take”
• potential incompatibilities
btw. backbone and HA/NA
• “seeding” of WT HA and
NA in the population
(higher reassortmant risk)
• annual reformulation may
require new regulatory
considerations
• less expensive
• no adjuvants
• long lived humoral and cellular immunity
Slide39SAVE: Other Viruses, Other GroupsSeveral groups have been exploring SAVE to design attenuated strains for vaccines against a variety of pathogens, including:DengueVSV (related to rabies)HIV [Martrus, et.al.
Retrovirology 2013]Chikunguya virus [Nougairede, et.al. PLOS Pathogens, 2013]
Slide40Banner Moment
Slide41Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design
Codon
-Bias and Translation Rates
Optimizing Secondary Structure
Optimizing Sequence Autocorrelation
Future
Work
Slide42What is the Mechanism?Although we can reliably attenuate viruses through codon pair-deoptimized designs, it is embarrassing that we have little understanding of why it works.To provide insight, we have been studying codon pair bias and other issues related to translation in yeast.Our experiments show that codon pair bias design modulates expression this cellular eukaryote similarly to how it does in viruses.
Slide43Effects of Codon-Pair Bias in Yeast: Growth on 3-AT plate
WT
HIS3-Scr
dHIS3-2
his3
1
2
1
2
1
2
1
Growth inhibited
Similar growth as WT
-
His + 60mM 3-AT
3-AT: competitive inhibitor of the HIS3 enzyme.
Conclusion:
dHIS3-2
is attenuated, but the scramble
HIS3
is not.
Slide44Mechanisms of Codon-Pair BiasThe picture seems quite complicated. Presumably there are a mix of different reasons why particular codon-pairs are selected for/against.One class of preferentially “in-frame’’ pairs presumably work through a translation mechanism.Another class of “frame-neutral” pairs presumably work through various signaling mechanisms.Mutation studies show that changes in the HRP1 gene can rescue
deoptimized strains, although no mutations were seen in our deoptimized sequences themselves.
Slide45DNA sequences
protein sequences
DNA
Protein
M3
T605G
F202C
M4
A600G
R200S
M6
A598G
Δ1158-1206
R200G
Δ387-
402
M8
A598G
R200G
Point mutation and deletion found in HRP1 gene in mutants (but not the original dLys2-2 strain)
Slide46Mechanisms underlying Codon BiasIt has long been presumed that genomes exhibit particular codon preferences (bias) to optimize gene expression: more frequently used codons presumably translate faster.This remains unproven, and even controversial. Recent work [Qian et.al
, PloS Genetics, 2012] claims to disprove it.Before we can understanding the more phenomena of pair-bias, we need better understand the mechanism underlying codon bias.
Slide47Ribosome FootprintingExciting new technology for studying translation.Instantaneously bind ribosomes to the RNA they are translating, then dissolve away unprotected RNA.Sequencing the resulting 30-base tags and aligning to a reference genome identifies where the ribosomes were.
Slide48Ribosome Footprint DataAligning the sequenced footprints to a reference genome gives higher peaks where ribosomes are more frequently observed.The sources of the large peaks generally remain unexplained, but clearly cannot result from the (necessarily) small variance associated with differential codon translation.
Slide49The Wrong Approach to AnalysisThe obvious idea is to count the number of observed footprints containing each codon as a measure of translation rate, but:This requires correction for the highly differential expression among different genes.It also requires correction for any biases of footprint frequency, perhaps differential rates on 5’ vs. 3’ ends.It also requires correcting for or ignoring the large bumps observed in the footprints data.
No attempt to correct for all these ill-understood issues will end up being convincing.
Slide50The Method of Independent TrialsThere are thousands of genes, each with hundreds of codons.To simplify analysis, consider every length 2l-1 window where a particular codon
c will appears in a unique position in the length l sequence reads.The relative frequency of reads with c in the active position determines whether it is a fast or slow codon
.
All windows are local and of equivalent significance, so effects of positional/expression bias are minimized.
The relative frequency distributions can be fairly averaged over all windows centered around
codon
c
.
Intuition: detecting discrimination at an airport security line.
Slide51RRT Analysis Pipeline
Slide52Ribosome Residency Time (RRT) HistogramsThe RRT histograms show that position 6 is the active (A) site of decoding.
CTC has the highest RRT at position 6 of all Leu codons, implying the slowest decoding rate.
Proline
codons also have peaks at position 5, the P-site where the peptide bond is formed, befitting their chemistry as an
imino
acid.
Slide53Relative Codon Decoding SpeedsRare codons have relatively high RRT values, indicating that decoding time is greater.Common codons have relatively low RRT values.Position 6 RRT strongly correlates with
codon frequency (-0.7)Codons with lower CG content tend to have lower RRT values (meaning faster translation).
Slide54Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design
Codon-Bias and Translation RatesOptimizing Secondary Structure
Optimizing Sequence Autocorrelation
Future
Work
Slide55Secondary Structure Design Respecting Codon PreferencesThat C-G bonds are twice as a strong as A-U bonds means that our previous algorithm to optimize secondary structure will wildly-change codon usage in quest for energy.But holding codon distribution constant between designs is an important control when studying relative expression…
Incorporating the state of how many of which codons have been used so far will increase the state of our dynamic programming algorithm by a factor of O(n^61)…Thus we developed a search-based algorithm based on rearranging codons in a sequence and assessing the impact on structure by calls to Mfold.
Slide56Min/Max Structure Design Algorithm
Slide57Min/Max Design Space for GFP (WT codon distribution)Our resulting designs modulate structure much better than random sampling, but of course less extensively than our non-codon constrained designs can.
Slide58Min/Max Design Space for GFP(Cohen-Skiena codon distributions)Given the optimal max-structure distribution, we find the max structure design.Given good min-structure distributions, we find better min structure designs.
Slide59Experiments in YeastReplace WT His3 with high and low secondary structure designs.Histidine can be supplied in the growth media, so even defective His3 strains can be grown.Dilution studies enable assaying the relative health of strains with different His3 variants on His- media.Experiments show that low secondary structure variants grow better than those with high secondary structure.
Slide60Secondary Structure Level Affects Expression
Slide61Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design
Codon-Bias and Translation RatesOptimizing Secondary Structure
Optimizing Sequence Autocorrelation
Future
Work
Slide62Mappings: Codon - tRNA - Amino Acid
Multiple codons correspond to same tRNAMultiple tRNAs correspond to same amino acid
Codons
GCT
GCC
GCA
GCG
.
.
.
tRNAs
tRNA1
tRNA2
.
.
.
Amino Acids
Ala
.
.
.
Slide63tRNA Reuse and ExpressionIt has been observed that codons tend to appear near each other in coding sequences more often than statistical independence would suggest.tRNA
molecules must repeatedly do two things: recharge to the appropriate amino acid, and find the right place to deposit it.Frequent reoccurence of
codons
in principle increases the likelihood a recent-used
tRNA
molecule can be used again in translating a particular message.
Slide64Diffusion vs. Reuse
U
G
C
U
G
C
A
C
G
Ribosome
tRNA
A
C
G
tRNA
U
G
C
Ribosome
U
G
C
tRNA
re-use
tRNA
defused away
Slide65The TPI Measure of AutocorrelationThe most popular measure of sequence autocorrelation is TRI, which counts the number of transitions between synonymous codon occurrences.ABABAB has five transitions, vs. one for AAABBB.But TPI does not account for the genomic distance between occurrences of these codons, which governs what the diffusion time really is.
This suggests the need for a more sophisticated measure of sequence autocorrelation to model tRNA reuse.
Slide66Possible Distance FunctionstRNA reuse presumably falls off slowly with genomic distance between synonymous codons, but according to what function?
Slide67Scoring Function Test
X-axis : Parameter values
Y-axis : Fraction of 1000 random samples, scored below wildtype gene using the specified scoring function
Best distinguishes fast vs. slow genes
Threshold Function
Inverse Distance Function
Exponential Function
Slide68Identifying Fast Genes: DICA vs. TPIs
Rank order 400 genes (200 fast and 200 slow) based on the scores for both DICA and TPI
s
. The function better differentiating fast vs. slow genes should have increasing trend (#fast - #slow genes) along Y-axis and should start decreasing as we cross the middle along X-axis, which is theoretically the boundary between fast and slow genes.
DICA (Red) outperforms TPI
s
(Blue) in both data sets
Slide69Distance Incorporated DesignAmino Acid Sequence:
L S L S S
S
L
Constraints:
tRNA type
Amino Acid
#tRNAs
Legend
A
L
2
B
L
1
C
S
4
(
i
) AAB
(ii) BAA
(iii) ABA
Max
Min
We can maximize or minimize autocorrelation
according to the DICA score.
Slide70Autocorrelation Design OptimizationDesign goal for TPI is trivial
Just place similar tRNAs together
Design goal for DICA is not!
Must place codons so as to minimizes the sum of the
tRNA
distances
Applied exhaustive search with pruning
B
B
A
B
B
A
Not necessarily the optimal placement by genomic distance
Slide71Exhaustive Search with Pruning
Root
AA Sequence:
L S L
L
Constraints:
tRNA type
AA
#tRNA
A
L
2
B
L
1
C
S
1
B
A
A
B
B
A
A
0
0.98
0
0.98
0
0
0
0.0
0.99
A
Search tree for L
0.99
Prune branch soon as proving it can’t lead to a better solution
Slide72Algorithm Validation Through Experiments
Our DICA optimized genes scores (
Red
and
Green
arrow) are beyond the distribution of randomly generated genes scores
Slide73Optimal Scores for all Yeast Genes
Higher the height along Y-axis, more autocorrelated is the sequence.
Red dots (optimal max) are always above and blue dots (optimal min) are below the black dots.
: WT vs. WT
: WT vs. Max AC score
: WT vs. Min AC score
Slide74Autocorrelation Level Affects ExpressionWild conjecture: codon-bias exists to increase sequence autocorrelation.
Slide75Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design
Codon-Bias and Translation RatesOptimizing Secondary StructureOptimizing Sequence Autocorrelation
Future
Work
Slide76Take Home MessageDeclining DNA synthesis costs open the doors to exciting new ways of doing molecular biology.Taking full advantage of large-scale synthesis requires innovative sequence design, often based on sophisticated algorithmic ideas.
Slide77Current and Future WorkDevelopment of human and animal vaccines.Why does CPB/autocorrelation/secondary structure optimization work? Translation speed? Protein misfolding? CG dinucleotides?
Improved signal detection search and other novel designsSequence design algorithms to arbitrarily modulate expression using multiple mechanisms.Overlapping gene design (world’s shortest gene
)
Understanding reversion in codon-pair
deoptimized
sequences.
Slide78ThanksCS: Charles Ward, Dimitris Papamichail, Barry Cohen, Rukhsana Yeasmin
, Yaw-Ling Lin, Jesmin Tithi, Jeff Chen, Bharat Jain, Bei Wang, Pablo Montes, Heraldo
Memelli
,
Joondong
Kim, Joe Mitchell.
SAVE: Steffen Mueller, Rob Coleman, Bruce
Futcher
,
Yutong
Song, Molly
Arabov
,
Anjaruwee
Nimnual
, Chen Yang,
Aniko
Paul,
Eckard
Wimmer
Yeast: Justin
Gardin
,
Shuqi
Yan,
Sangeet
Honey, Sabine
Keppler
-Ross, Alisa
Yurovsky
, Bruce
Futcher
.
Short Genes: David Green,
Vadim
Patsalo
Gene Therapy:
Wadie
Bahou
,
Varsha
Sitaraman
, Pat Hearing
Support from NSF, NIH, Microsoft
Slide79Publications: Experimental PapersReduction
of the rate of poliovirus protein synthesis through large scale codon deoptimization causes virus attenuation of viral virulence S. Mueller, D. Papamichail
, J.R. Coleman, S. Skiena and E.
Wimmer
,
Journal of Virology
, October 2006,
p
. 9687-9696, Vol. 80, No. 19
Virus attenuation by genome-scale changes in
codon
-pair bias
J. Coleman, D.
Papamichail
,, S. Skiena B.
Futcher
, S. Mueller, and E.
Wimmer
),
Science
, July 2008, p.1784-1787, Vol. 320, 2008.
Live Attenuated Influenza Vaccines by Computer-Aided Rational Design
S. Mueller, R. Coleman, D.
Papamichail
, C. Ward, A.
Nimnual
, S. Skiena, B.
Futcher
, and E.
Wimmer
),
Nature
Biotechnogy
, Vol. 28, 2010.
Computationally-recoded AAV Rep 78 is Efficiently Maintained within an Adenovirus Vector
, V.
Sitaraman
, P. Hearing, C. Ward, D.
Gnatenko
, E.
Wimmer
, S. Mueller, S. Skiena, and W.
Bahou
.
PNAS
108 (2011) 14294-14299.
Identification
of two functionally redundant RNA elements in the coding sequence of poliovirus using computer-generated
design,
Yutong
Song, Y. Liu, CB. Ward, S. Mueller, B.
Futcher
, S. Skiena, A. Paul, and E.
Wimmer
.
, PNAS,109 (2012)
14301
-
14307.
Deliberate
reduction of
hemagglutinin
and neuraminidase expression of influenza virus leads to an
ultraprotective
live vaccine in
mice,
Chen
Yang,
S. Skiena, B.
Futcher
,
S. Mueller, and E.
Wimmer
, PNAS
2013 110 (23) 9481-
9486
M
easurement of decoding rates of all individual codons in vivo,
J.Gardin
, R.
Yeasmin
, A.
Yurovsky, Y. Cai, S. Skiena,and B. Futcher, eLife pending.
Slide80Publications: Theory PapersDesigning Better Phages,
S. Skiena, Bioinformatics 17 (2001) S253-261. Also ISMB 2001Natural selection and algorithmic design of mRNA B. Cohen and S. Skiena, J. Computational Biology 10 (2003) 419-432 and RECOMB 2002
Two proteins for the price of one: The design of maximally compressed coding sequences
B. Wang, D.
Papamichail
, S. Mueller and S. Skiena, 11th International Meeting on DNA Computing (
DNA11) 2005
and
Lecture Notes in Computer Science
2006, Vol. 3892, pp. 387-398
Optimizing Restriction Site Placement for Synthetic Genomes
P. Montes, H.
Memelli
, C. Ward, J. Kim, J. Mitchell, S. Skiena,
21
st
Symp
. Combinatorial Pattern Matching (CPM 2010)
.
Constructing Orthogonal de
Bruijn
Sequences
Y. Lin, C. Ward, B. Jain, S. Skiena,
Workshop on Algorithms and Data Structures (WADS 2011)
.
Redesigning
Viral Genomes
.
S.
Skiena, IEEE Computer 45(3): 47-53 (2012)
Designing RNA Secondary Structure in Coding Regions.
R.
Yeasmin
and S
.
Skiena, ISBRA 2012: 299-314.
Designing
Autocorrelated
Genes
R.
Yeasmin
, J.
Tithi
, J. Chen, and S. Skiena, BCB 2013: 458
Slide81Slide82Slide83Outline of TalkDNA Translation and the Redundancy of the Triplet CodeSynthetic Attenuated Virus Engineering (SAVE)
Signal Location DetectionOther ApplicationsFuture Work
Slide84Seaching for SignalsBiologically important signals like binding sites and splice signals often lurk within coding sequences. Identifying them is hard.If a synonymously coded gene (designed using our maximal codon ``scramble’’) results in a dead phenotype, some critical signal must have been stepped on – but where?We invented a search procedure to narrow down the location by synthesizing four synthetic designs implementing a combinatorial group testing procedure, employing balanced Gray codes to minimize cross-sequence boundaries.
Slide85Localizing sequence specific signals
I
II
III
IV
Scrambled
WT
I, II
I, II, IV
I, II, III
I, III
I, III, IV
I, IV
I
IV
III ,IV
III
II, III
II, III, IV
II, IV
II
Prediction of viruses
capable of replication
corresponding to each possible location of sequence specific signal:
Slide86Signal location in Ad/Rep78
I
II
III
IV
-
+
+
+
Scrambled
WT
1,2
1,2,4
1,2,3
1,3
1,3,4
1,4
1
4
3,4
3
2,3
2,3,4
2,4
2
Prediction of viruses capable of replication corresponding to each possible location of sequence specific signal:
Titer (Pfu/cell)
Ad/Rep78 -II: 1.07x10
3
Ad/Rep78-III: 5.45x10
3
Ad/Rep78 -IV: 3.10x10
3
Ad/sRep78 : 3.75x10
3
Ad/Rep78-I
1 12
-DpnI +DpnI
Ad/Rep78-II
1 12
-DpnI +DpnI
Ad/Rep78-III
1 12
-DpnI +DpnI
Ad/Rep78-IV
1 12
-DpnI +DpnI
Slide87Confirmation by Subcloning-
+
-
+
+
+
+
I
II
III
IV
S(wt1)
Rep78
S(wt2)
Rep78
S(wt3)
Rep78
Slide88Sequence-Ordered Group Testing Group testing strategies permit the identification of up to d defects in n samples using t tests in each of r rounds.Historically, group tests consist of arbitrary labeled subsets of samples, but in our application the permutation of samples makes a difference.
By sequencing the subsets as a balanced Grey code, we reduce the chances of signals crossing boundaries of our test, giving ambiguous results.
Slide89Consecutive Positive Group TestingWe have proposed a class of designs to locate a signal spanning up to d consecutive regions out of n.Our M3(7,3) design uses 10 tests to detect up to 3 consecutive positives in 105 regions, six times the resolution of previous designs.Our designs incorporate check bits for robustness to false readings.
Slide90Outline of TalkDNA Translation and the Redundancy of the Triplet CodeSynthetic Attenuated Virus Engineering (SAVE)Signal Location Detection
Other ApplicationsFuture Work
Slide91Genome RefactoringThe presence of unique restriction sites facilitates cloning.Abundant, regularly-spaced unique restriction sites can be designed into synthetic sequences to make them easy to work with (refactored).We have employed our placement algorithm in our latest virus design.Wildtype vs. refactored polio:
Slide92Adding /Removing Restriction Sites
We seek synonymous mutations to create frequent, well-spaced unique restriction sites.
Slide93Hardness of Unique Restriction Site PlacementGiven the set of achievable placements for each enzyme, select one from each row to minimize the maximum gap.Hard to approximate better than 3/2, but 2-factor approximation and good heuristics exist.
Slide94Sequences with Many PatternsWe are convinced that synthetic sequences designed to contain large-sets of interesting motifs/patterns open up entirely new classes of experiments (e.g. how to transcription factors respond to large numbers of new binding sites).De Bruijn sequences are strings containing all patterns of length k on a given alphabet (here {A,C,G,T}) exactly once.A readily-synthesized 4 kilobase sequence can contain all possible 6-base patterns.To avoid recombination, we seek many such de Bruijn sequences, all of which are very dissimilar from each other.
Slide95Orthogonal de Bruijn SequencesAAACAAGCACCAGACGAGTCATACTCCCGCCTAGGCGTGATCTGTAATTTGCTTCGGTTATGGGAAAAAGAATGAGGATAGTATCGACAGCGGGTGGCATTGTCTACGCTCAACCCTGCCGTTCCACTTTAAAAATAACTATTACATCACGTAGATGTTTCTTGACCTCGCAGTGCGAAGGGCTGGTCCGGAGCCCAA
A set S of order-k de Bruijn sequences on an alphabet of size a are orthogonal if any string of length k+1 occurs at most once in S.
Each de Bruijn sequence corresponds to an Eulerian cycle in the de Bruijn graph.
Orthogonal sequences/tours do not share edge-pairs/turns.
Slide96Constructing Orthogonal SequencesWe can efficiently construct sets of at least a/2 mutually orthogonal de Bruijn sequences for any k>1. Proof idea: For the ith sequence, start with an arbitrary matching at each vertex to create a set of edge-disjoint cycles avoiding all previously used edge pairs. Enough unused-pairs exist to guarantee a way to merge disconnected cycles.Through heuristic/combinatorial search, we have constructed sets of a-1 orthogonal sequences for all reasonable a and k.We conjecture a-1 orthogonal sequences exist for all k, but seems intimately connected to a well-known conjecture on Hamiltonian decompositions of line graphs (Bermond-78)
Slide97Slide98Genomes
Species length reference (nt) Poliovirus 7,500 Cello, Paul Wimmer 2002
Phage
X174 5,386 Smith et al. 2003
Page T7 “refactoring” 11,515 of 39,937 Chan, Kosuti, Endy 2005
1918 Influenza virus 13,500 Tumpey et al. 2005
“Phoenix” (fossil)
progenitor of hum.endog. retrov 9,472 Dewannieux et al., 2006
HERV-K (same as Phoenix) 9,472 Lee & Bienniaz 2007
SIVcpz 9,912 Takehisa et al. 2007
Human coronavirus (SARS) 29,700 Donaldson et al., 2008
Mycoplasma genitalium 582,970 Gibson et al. 2008
Codon-Pair Bias is conserved across species
Slide100Codon pair bias is conserved but
diverges with evolutionary distance
Slide101Poliovirus Genome and Polyprotein ProcessingUtilizes IRES in 5’NTR to initiate translation of a single open reading frameViral proteins produced by cis catalyzed cleavage events
Poliovirus genome only 7.5kb in length.
Structural Region
Non-structural Region
P1
P2
P3
VP2
VP3
VP1
VP4
2A
2B
2C
3D
3C
3B
3A
A
A
A n
3’ NTR
5’ NTR
Cloverleaf
IRES
7.5kb
2A
P1 P2 P3
Primary processing
VP4
VP2
VP1
VP3
2A
2B
2C
3A
3B
3C
3D
Mature
proteins
Structural capsid proteins
Nonstructural proteins
adapt. Wang, C..
Slide102Slide103Motivation: Restriction Sites in Bacteriophages
Slide104Why Eliminate Restriction Sites?Restriction enzymes exist in bacteria as a defense against phages.Phages have been proposed as an agent against bacterial infections.A theraputic phage might be enhanced by removing all restriction sites from its genome.
Slide105Minimizing Sequence Patterns is Hard with WildcardsThe reduction is from 3-satisfiability. Note that every CNF clause has one pattern of literals (all false) where it fails.The string-to-design will have n positions, each of which can be either T or G.Each pattern to avoid is of length n, corresponding to a clause. All positions are don’t care except the three positions of the corresponding clause, cutting only when all relevant literals are negated.
Slide106Slide107Determine RRT of CodonsTime Period: TRibosome spends more time on C1, moves faster on C2
Get read statistics surrounding a codonIf the ribosome spends more time on a codon, more reads should be gathered (C1 in right
)
If the ribosome moves faster, less reads will be found when the ribosome is on the codon (
C2 in right
)
For 10-codon reads, only the codons being
decoded inside
the ribosome should show peaks/troughs. Other positions should show stable trend.
C1
C2
C1
C2
Slide108Auto and Anti-Correlated Sequence
L
CUACUACUA
CUGCUGCUG
CUACUA
CUG
CUA
CUGCUG
CUA
CUG
CUA
CUG
CUA
CUG
Slide109Slide110Slide111UUU F 0.46 UCU S 0.19 UAU Y 0.44 UGU C 0.45
UUC F 0.54 UCC S 0.22 UAC Y 0.56 UGC C 0.55
UUA L
0.08
UCA S 0.15 UAA * 0.30 UGA * 0.47
UUG L 0.13 UCG S
0.05
UAG * 0.23 UGG W 1.00
CUU L 0.13 CCU P 0.28 CAU H 0.42 CGU R
0.08
CUC L 0.20 CCC P 0.33 CAC H 0.58 CGC R 0.19
CUA L
0.07
CCA P 0.28 CAA Q 0.26 CGA R 0.11
CUG L 0.40 CCG P 0.11 CAG Q 0.74 CGG R 0.20
AUU I 0.36 ACU T 0.25 AAU N 0.47 AGU S 0.15
AUC I 0.47 ACC T 0.36 AAC N 0.53 AGC S 0.24
AUA I 0.17 ACA T 0.28 AAA K 0.43 AGA R 0.21
AUG M 1.00 ACG T 0.11 AAG K 0.57 AGG R 0.21
GUU V 0.18 GCU A 0.26 GAU D 0.46 GGU G 0.16
GUC V 0.24 GCC A 0.40 GAC D 0.54 GGC G 0.34
GUA V 0.12 GCA A 0.23 GAA E 0.42 GGA G 0.25
GUG V 0.46 GCG A 0.11 GAG E 0.58 GGG G 0.25
0-0.1
0.1-0.15
0.15-1
Adapted from www.planetgene.com
Synonymous codon usage
Synonymous codons are used at unequal frequencies in the host genome
frequency
Slide112=
ln
F(
AB
F(
A
) x F(
B
)
F(
X
) x F(
Y
)
x F(
XY
)
CPS =
ln
observed
expected
observed genome-wide
frequency of
codon pair AB
observed genome-wide
frequency of
amino acid pair XY
expected genome-wide
frequency of
codon pair AB
There are 3721 possible pairs (61 x 61; excluding stop codons)
Calculating Codon Pair Score (CPS)
)
Slide113Slide114Encoding Genes in Alternate Reading FramesIn theory, six coding sequences/ORFs can co-exist on a single DNA sequence. In reality, many viruses do encode overlapping genes to:Reduce genome sizeFacilitate co-expression
Slide115Long Overlaps Exist in Viruses
Slide116Compression Algorithm (WPMS ’06)Worst case quadratic timeExpected time linear because overlaps are usually short
Slide117Two arbitrary proteins cannot be significantly interleaved…Overlapping genes in viruses evolved by losing stop codons, not design
Slide118… Unless we are free to replace amino acids with similar residues
Slide119Why might we want to design overlapping genes?Inserting new genes in a bacterial host is fundamental to biotechnologyBut the host doesn’t need these genes and deletes them.Interleaving an antibiotic resistance gene in means we can select hosts with the target.There seems to be enough flexibility to make this work.
Slide120Slide121C
332,652 H
492,388
N
98,245
O
131,196
P
7,501
S
2,340
Slide122Can we debilitate virus genome translation/ replication by increasing the number of
unfavorable
synonymous codon-pairs in the virus genome?
We seek a cumulative phenotype of many mutations each with a small effect - difficult to revert; genetically stable -
“death by a thousand cuts”
Does large scale codon pair
de
-optimization
of the influenza genome result in attenuation
?
Slide123RNA virusesPoliovirus is in the Picornaviridae family, (+) stranded, non-enveloped, RNA virusesRNA viruses are the largest virus group, containing dreaded human pathogens (HIV, Ebola, SARS, Dengue, Hanta, Influenza)High mutation rate (1/10,000 bases) confers high adapability to changing conditions
Slide124Poliovirus ExperimentsWe first validated the SAVE methodology in poliovirus.We maximally scrambled (rearranged) the codons in a large hunk of the P1 region. Such viruses typically grew.We inserted large amounts of infrequent synonymous codons. These viruses were typically weak/attenuated.Designs deoptimized according to the codon-pair bias criteria were (surprisingly) weak/attenuated.Our attenuated viruses functioned as vaccines in mice – paper published in Science!
The story is even better in our recent Flu work, so I will present these experimental results instead.
Slide125Future work – Sequence design tools
Slide126Slide127Virus LD
50
(PFU)
PR8 (wt) 6.1 x 10
1
PR8-NP
Min
5.0 x 10
2
PR8-HA
Min
1.7 x 10
3
PR8-PB1
Min
3.2 x 10
4
PR8
3F
7.9 x 10
5
Lethal Dose 50 (LD
50
) of codon-pair deoptimized viruses
All deoptimized viruses attenuated in mice
Attenuation correlates roughly with amount of deoptimization
Mueller et al.,
Nature Biotechnology
28(7),
July, 2010.
Slide128Codon Alteration Sequence DesignTo achieve maximum Hamming distance without altering codon bias, we used maximum weight bipartite matching between codon positions and codons, using as weight the number of bases changed.Restriction sites were inserted uniquely (inserted in specific areas and then eliminated everywhere else).
Certain regions were locked to preserve secondary structure.Evaluation of secondary structure:
Slide129Codon use statistics in PV(M), PV-SD, and PV-AB
PV(M)
PV(M)
PV-SD
PV-AB
PV-AB
PV-SD
Slide130Slide131Degeneracy Of The Genetic CodeMost amino acids can be encoded by more than one synonymous codons.
Gly - Ala - Met - Phe - Leu
GGA
GGC
GGG
GGU
GCA
GCC
GCG
GCU
AUG
UUC
UUU
CUA
CUC
CUG
CUU
UUA
UUG
4
x 4 x 1 x 2 x 6 =
192 encodings
Pentapeptide
Slide132Data from coleman et al., Science, 2008.
Synthetic Attenuated Virus Engineering (SAVE)
Wild Type High codon pair score Low codon pair score
Conclusion: low codon pair scores can reduce genetic function.
Slide133Open Problem 1Is finding the encoding which includes the minimum number of forbidden patterns hard for long patterns without wildcards?
Slide134Open Problem 2We say that two order-k de Bruijn sequences are d-orthogonal if they contain no (k+d) -mer in common, generalizing the notion beyond d=1.This holds the potential to create much order families of orthogonal sequences, whose cardinality should grow exponentially with d.How many mutually d-orthogonal sequences can you construct?
Slide135aa
pair
Expected
Observed
Obs./Exp.
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AA
AC
AC
.
.
.
.
.
.
YW
YW
YY
YY
YY
YY
Codon
pair
GCAGCA
GCAGCC
GCAGCG
GCAGCT
GCCGCA
GCCGCC
GCCGCG
GCCGCT
GCGGCA
GCGGCC
GCGGCG
GCGGCT
GCTGCA
GCTGCC
GCTGCG
GCTGCT
GCATGC
GCATGT
.
.
.
.
.
.
TACTGG
TATTGG
TACTAC
TACTAT
TATTAC
TATTAT
2856.40
4961.56
1341.51
3262.97
4961.56
8618.21
2330.20
5667.77
1341.51
2330.20
630.04
1532.46
3262.97
5667.77
1532.46
3727.41
1343.47
1131.59
.
.
.
.
.
.
1609.87
1308.13
2256.03
1833.19
1833.19
1489.60
4196
6033
1420
4711
1122
5141
2042
1378
1142
4032
2870
1472
4357
7014
1533
5562
554
881
.
.
.
.
.
.
2212
706
2854
1760
1339
1459
1.469
1.216
1.059
1.444
0.226
0.597
0.876
0.243
0.851
1.730
4.555
0.961
1.335
1.238
1.000
1.492
0.412
0.779
.
.
.
.
.
.
1.374
0.540
1.265
0.960
0.730
0.979
CPS
0.385
0.196
0.057
0.367
-1.487
-0.517
-0.132
-1.414
-0.161
0.548
1.516
-0.040
0.289
0.213
0.000
0.400
-0.886
-0.250
.
.
.
.
.
.
0.318
-0.617
0.235
-0.041
-0.314
-0.021
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
.
.
.
.
.
.
3716
3717
3718
3719
3720
3721
Scoring codon pairs
“Expected” = the statistically predicted frequency in the human genome at which a given aa pair is encoded by each possible synonymous codon pair (based on the the genome wide synonymous codon usage
“Observed” = the actual genome wide frequency of each codon pair
“Codon Pair Score - CPS” = a measure of the degree of under - or overrepresentation of the codon pair in the genome
= Ln (Obs./Exp.)
underrepresented CPS < 0 overrepresented CPS > 0
“Codon Pair Bias - CPB” = the average of CPS values across a protein coding sequence
Slide136Encoding Matters: Different Codon-pair Encodings Have Different Levels Of Function
Shuqi Yan*, Sangeet Honey*, Justin Gardin, Sabine Keppler-Ross, C. Ward#, S. Mueller*, S. Skiena
#
, E. Wimmer*, B. Futcher*
Stony Brook University, *Dept. Of Molecular Genetics and Microbiology, And # Dept. Of Computer Science.
Slide137