Steven Skiena Dept. of Computer Science - PowerPoint Presentation

blanko . @blanko

342 views
Uploaded On 2022-06-28

Steven Skiena Dept. of Computer Science - PPT Presentation

Stony Brook University httpwwwcssunysbeduskiena Optimizing the Design of Coding Sequences Sequencing vs Synthesis DNA sequencing is the technology which reads DNA molecules identifying the defining string on ACGT ID: 927096

pair codon design sequence codon pair sequence design bias sequences codons structure secondary genes virus pr8 translation genome min

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/927096" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Steven Skiena Dept. of Computer Science" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Steven SkienaDept. of Computer ScienceStony Brook Universityhttp://www.cs.sunysb.edu/~skiena

Optimizing the Design of Coding Sequences

Slide2

Sequencing vs. SynthesisDNA sequencing is the technology which reads DNA molecules, identifying the defining string on {A,C,G,T}.DNA synthesis is the technology which constructed (`writes’) DNA molecules to specification, given by a defining string on {A,C,G,T}.The freedom unleashed by large-scale synthesis opens up many exciting new directions for research in molecular biology.Imagination is needed to properly exploit these technologies.

Slide3

DNA Synthesis TechnologiesShort oligos (50-100 bases) are readily synthesized.Long molecules can be constructed by hybridizing short oligos, but it takes work.

The cost for large-scale synthesis is dropping rapidly, and is now about $0.20 /base for kilobase sequences, or a couple thousand dollars for a virus.Agilent’s array-based synthesis technologies produce mixtures of

55,000

custom-designed 200-base

oligos

: several

megabases

of synthesis!

Cheaper, faster, better…

Slide4

Synthetic BiologyNew synthesis technologies facilitate the engineering of novel biological structures and functions.Recent successes include eliminating all occurrences of the TAG stop codon in E.coli (Isaacs,et.al 2011) or refactoring a yeast chromosome (Dymond, et.al. 2011)But large-scale synthesis promises to revolutionize how natural organisms are studied as well: “what happens if we change this?”

Slide5

Why Me?

Slide6

Gene Design and Computer ScienceDesigning genomic sequences under constraints is an algorithmic optimization problem.Issues arising in our sequence designs include: bipartite matching, Hamming distance, simulated annealing, Gray codes, combinatorial group testing, de Bruijn sequences, etc.My interest in genome design optimization arose well before genome synthesis became a reality.

Slide7

What can we/you do with Sequence Design and Synthesis?Attenuate viruses to serve as vaccines (SAVE).Design genes to increase or decrease gene expression.Efficiently search for signals in coding sequences.Design synonymous but homologous genes which avoid recombination.Refactor viral genomes to make them easier to work with.Design large sets of interesting oligos

to exploit massive multiplex synthesis technologies.

Slide8

Outline of TalkDNA Translation and the Redundancy of the Triplet

Code

Codon

Pair

Bias and Vaccine Design

Codon

-Bias and Translation Rates

Optimizing Secondary Structure

Optimizing Sequence Autocorrelation

Future

Work

Slide9

DNA to RNA to ProteinDNA sequences act as templates for building proteins according to the triplet code

Slide10

The Triplet Code

Slide11

Which Encoding is Best?There are roughly 3^n possible gene sequences coding for any n-amino acid protein, e.g. 10^75 encodings of a 147-residue hemoglobin protein.

By comparison, there are only 10^79 atoms in the known universe.Why did nature select one particular encoding?

Alternately, can we exploit this redundancy to design the ‘best’ coding sequence?

We seek

to design the

DNA/RNA coding sequence for a particular protein/amino acid sequence which optimizes some particular objective.

Slide12

What Drives the Evolution of Coding Sequences?Sequences exhibit organism-specific codon bias.Coding helps regulate gene expression with common/scarce codons.RNA secondary structure affects stability.Many signals can be embedded in the coding regions of genes.

Sequence design to optimize gene expression is still very much an open problem. See Plotkin and Kundla’s review in Nature Genetics, January 2011.

Slide13

Design Criteria for Artificial GenesMatching a given codon/codon-pair distributionEliminating or inserting specific patterns (S. `01)Optimizing secondary structure (Cohen and S. `02)Encoding additional gene sequences in alternate reading frames (WPMS ‘06)

Slide14

Incorporating/Excluding Sequence PatternsMany biological features are encoded as substring patterns: restriction sites, miRNA targets, stop codons, etc.Differing objectives mandate either including or excluding specific patterns.For example, the restriction enzyme EcoRI cuts DNA at the pattern GAATTC.

Slide15

Sequence Optimization Algorithms (S. ‘01)Dynamic programming can be used to include/exclude many short patterns efficiently, in O(n p 4^k).Since the longest known cutter is only 16 bases, this is a tractible computation.It is NP-complete for long patterns with wildcards, but heuristics work.

Our algorithms can remove 90% of restriction sites of all known enzymes

Slide16

Results by Cutter Length

Slide17

Optimizing Secondary StructureNucleotides bind to complementary bases (A-C, G-T/U) so as to minimize their energy.Secondary structures affect molecular interactions and stabilityOur algorithms design genes with prescribed secondary structure while coding for a given protein

Slide18

The Zucker-Turner RNA ModelDynamic programming optimizes binding energy over different substructures.

Slide19

Designing Secondary Structure(Cohen and S. ‘02)We can adapt the Zucker-Turner recurrence relations to design a coding sequence maximizing secondary structure in O(n^3).Minimizing secondary structure in the model is NP-complete, but heuristics existThese algorithms do not preserve the

codon distribution of the input sequence.

Slide20

Maximizing RNA Secondary StructureThe optimized encoding requires twice as much energy to unfold.

Slide21

How Much Freedom does Nature with Respect to Secondary Structure?

Slide22

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon

-Pair Bias and Vaccine DesignCodon-Bias and Translation RatesOptimizing Secondary Structure

Optimizing Sequence Autocorrelation

Future

Work

Slide23

Chemical Synthesis of PoliovirusCello et al. synthesized poliovirus cDNA de novo without a natural templateThis groundbreaking study

made international headlines in July 2002 and opened the field of synthetic virology.

Cello, et al. Science. 2002 Aug 9;297(5583):1016-8.

Molla et al. Science. 1991; 254(5038):1647-51

Slide24

Cello, Paul & Wimmer, 2002

Molla, Paul & Wimmer, 1991

Reverse genetics of poliovirus.

Slide25

How might we rapidly create vaccines for new pathogens?

Slide26

Difficulties in Vaccine DesignSmall numbers of attenuating mutations each having a large effect can easily revert to virulence.RNA viruses (e.g. Polio, HIV, Ebola, SARS, Dengue, Hanta, Influenza) have a high mutation rate (1/10,000 bases) confers high adapability to changing conditions

Attenuation via passaging is costly and time consuming.The poliovirus vaccine strain Sabin1 was derived by 52 rounds of monkey infections and 16 rounds of monkey kidney cell culture passages, requiring several years of work at prohibitive cost:

100,000 monkeys @ $10,000 = $ 1 Billion

Slide27

Synthetic Attenuated Virus Engineering (SAVE)Motivation: viral diseases like SARS, 1918 influenza; bioterrorismInput: the genome sequence of a virusOutput: a synthetic, attenuated, variant of the virus designed to generate immune response and serve as a vaccine.We seek a cumulative phenotype of many mutations each with a small effect - difficult to revert; genetically stable -

“death by a thousand cuts”

Slide28

Codon-Pair Bias Certain pairs of synonymous codons for two given amino acids are found adjacent to one another more (less) frequently than should be expected.Statistically significant codon-pair bias has been observed in

human genes and other organismsThe mechanisms behind this are still unclear, but we can use it to design attenuated viruses.We measure bias with

Slide29

Some Codon Pairs Are Selected Over The Others

CUU

CGA

CUU

AGG

[Observed]/[Expected]=1.95

CPS=0.67

[Observed]/[Expected]=0.44

CPS= - 0.82

Leu

Arg

UUA

UUG

CUU

CUC

CUA

CUG

CGU

CGC

CGA

CGG

AGA

AGG

Slide30

Codon-Pair Bias DesignsCodon-pair optimization is essentially the traveling salesman problemWe use simulated annealing to shuffle the wildtype codons

By changing the optimization criteria, we can construct sequences with either highly favorable or highly unfavorable codon-pair scores.

Original Sequence

CPB “Altered” Sequence

Slide31

Experimental Results Random scrambles of synonymous codons generally have little effect on phenotype, unless a critical signal gets stepped on.Codon-pair bias deoptimized designs have proven to be significantly attenuated for both polio and influenza virus.These attenuated strains have proven to be safe and effective vaccines in mice, conferring immunity against the wildtype virus.

Slide32

Deoptimization of three Flu segments

PB1

PB1-F2

A minimum of 120 nt at either end is excluded from recoding (replication/packaging signals)

Slide33

Codon Pair Bias of selected Influenza A/PR8/3/34 genes and their deoptimized counterparts in relationship to the human ORFeome

Codon Pair Bias (average codon pair score per codon pair)

Length of ORF (amino acids)

after

“deoptimization”

Min

PB1

Min

HA(wt)

NP(wt)

PB1 (wt)

increasingly

deoptimized

increasingly

optimized

Codon pair bias:

before and after deoptimization

Mueller et al.,

Nature Biotechnology

28(7),

July, 2010.

Slide34

PR8

Min

PB1

Min

NP/HA

Min

HA/PB1

Min

PR8

Virus titer [log

PFU]

Hours post infection

PR8

PR8-NP/HA

Min

PR8-HA

Min

PR8

combines all three codon pair deoptimized segments in one virus

All viruses carrying one, two, or three

deoptimized segments are viable

Mueller et al.,

Nature Biotechnology

28(7),

July, 2010.

Slide35

Vaccine Experiments

Slide36

100

110

120

Days post infection

Relative body weight [%]

Mock

PR8

Experiment

• 10

PFU intranasal

• monitor weight daily

• avg. of five mice per group

PR8

causes no disease in infected mice

Mueller et al.,

Nature Biotechnology

28(7),

July, 2010.

endpoint

Slide37

Survival after PR8-wt challenge [%]

100

Vaccine dose [log

PFU]

100

Survival after primary inoculation [%]

PR8

-vaccinated

PR8(wt)-vaccinated

Safety

Margin

virus PD

(PFU)

Protective Dose 50 (PD

) of

codon-pair deoptimized virus PR8

The dose required to protect 50% of mice with a single inoculation of vaccine virus from a challenge infection with 1000 LD

of the wt virus.

PR8

has wide safety margin in mice

Mueller et al.,

Nature Biotechnology

28(7),

July, 2010.

PR8 (wt)

1.0 x 10

PR8

1.3 x 10

One ml of PR8

culture supernatant

= 10,000,000 mouse PD

doses

Slide38

Inactivated

LAIV - FluMist

LAIV -

SAVE

Pros and Cons

Pros

Cons

• safest option

• most expensive

• potential need of adjuvants

• no/limited cellular immunity

• well characterized

backbone (done once)

• high genetic stability

• low protective dose (wide

safety margin)

• applicable to any strain

as a whole

• 100% antigenic match in all

segments to target strain

• guaranteed “take” in the

population at highest risk

(those naive to the

impending seasonal strain)

• attenuation in HA and NA

(lower reassortment risk)

• only one backbone

• repeated use may induce

heterosubtypic immunity

against backbone - no “take”

• potential incompatibilities

btw. backbone and HA/NA

• “seeding” of WT HA and

NA in the population

(higher reassortmant risk)

• annual reformulation may

require new regulatory

considerations

• less expensive

• no adjuvants

• long lived humoral and cellular immunity

Slide39

SAVE: Other Viruses, Other GroupsSeveral groups have been exploring SAVE to design attenuated strains for vaccines against a variety of pathogens, including:DengueVSV (related to rabies)HIV [Martrus, et.al.

Retrovirology 2013]Chikunguya virus [Nougairede, et.al. PLOS Pathogens, 2013]

Slide40

Banner Moment

Slide41

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design

Codon

-Bias and Translation Rates

Optimizing Secondary Structure

Optimizing Sequence Autocorrelation

Future

Work

Slide42

What is the Mechanism?Although we can reliably attenuate viruses through codon pair-deoptimized designs, it is embarrassing that we have little understanding of why it works.To provide insight, we have been studying codon pair bias and other issues related to translation in yeast.Our experiments show that codon pair bias design modulates expression this cellular eukaryote similarly to how it does in viruses.

Slide43

Effects of Codon-Pair Bias in Yeast: Growth on 3-AT plate

HIS3-Scr

dHIS3-2

his3

Growth inhibited

Similar growth as WT

－

His + 60mM 3-AT

3-AT: competitive inhibitor of the HIS3 enzyme.

Conclusion:

dHIS3-2

is attenuated, but the scramble

HIS3

is not.

Slide44

Mechanisms of Codon-Pair BiasThe picture seems quite complicated. Presumably there are a mix of different reasons why particular codon-pairs are selected for/against.One class of preferentially “in-frame’’ pairs presumably work through a translation mechanism.Another class of “frame-neutral” pairs presumably work through various signaling mechanisms.Mutation studies show that changes in the HRP1 gene can rescue

deoptimized strains, although no mutations were seen in our deoptimized sequences themselves.

Slide45

DNA sequences

protein sequences

DNA

Protein

T605G

F202C

A600G

R200S

A598G

Δ1158-1206

R200G

Δ387-

402

A598G

R200G

Point mutation and deletion found in HRP1 gene in mutants (but not the original dLys2-2 strain)

Slide46

Mechanisms underlying Codon BiasIt has long been presumed that genomes exhibit particular codon preferences (bias) to optimize gene expression: more frequently used codons presumably translate faster.This remains unproven, and even controversial. Recent work [Qian et.al

, PloS Genetics, 2012] claims to disprove it.Before we can understanding the more phenomena of pair-bias, we need better understand the mechanism underlying codon bias.

Slide47

Ribosome FootprintingExciting new technology for studying translation.Instantaneously bind ribosomes to the RNA they are translating, then dissolve away unprotected RNA.Sequencing the resulting 30-base tags and aligning to a reference genome identifies where the ribosomes were.

Slide48

Ribosome Footprint DataAligning the sequenced footprints to a reference genome gives higher peaks where ribosomes are more frequently observed.The sources of the large peaks generally remain unexplained, but clearly cannot result from the (necessarily) small variance associated with differential codon translation.

Slide49

The Wrong Approach to AnalysisThe obvious idea is to count the number of observed footprints containing each codon as a measure of translation rate, but:This requires correction for the highly differential expression among different genes.It also requires correction for any biases of footprint frequency, perhaps differential rates on 5’ vs. 3’ ends.It also requires correcting for or ignoring the large bumps observed in the footprints data.

No attempt to correct for all these ill-understood issues will end up being convincing.

Slide50

The Method of Independent TrialsThere are thousands of genes, each with hundreds of codons.To simplify analysis, consider every length 2l-1 window where a particular codon

c will appears in a unique position in the length l sequence reads.The relative frequency of reads with c in the active position determines whether it is a fast or slow codon

All windows are local and of equivalent significance, so effects of positional/expression bias are minimized.

The relative frequency distributions can be fairly averaged over all windows centered around

codon

Intuition: detecting discrimination at an airport security line.

Slide51

RRT Analysis Pipeline

Slide52

Ribosome Residency Time (RRT) HistogramsThe RRT histograms show that position 6 is the active (A) site of decoding.

CTC has the highest RRT at position 6 of all Leu codons, implying the slowest decoding rate.

Proline

codons also have peaks at position 5, the P-site where the peptide bond is formed, befitting their chemistry as an

imino

acid.

Slide53

Relative Codon Decoding SpeedsRare codons have relatively high RRT values, indicating that decoding time is greater.Common codons have relatively low RRT values.Position 6 RRT strongly correlates with

codon frequency (-0.7)Codons with lower CG content tend to have lower RRT values (meaning faster translation).

Slide54

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design

Codon-Bias and Translation RatesOptimizing Secondary Structure

Optimizing Sequence Autocorrelation

Future

Work

Slide55

Secondary Structure Design Respecting Codon PreferencesThat C-G bonds are twice as a strong as A-U bonds means that our previous algorithm to optimize secondary structure will wildly-change codon usage in quest for energy.But holding codon distribution constant between designs is an important control when studying relative expression…

Incorporating the state of how many of which codons have been used so far will increase the state of our dynamic programming algorithm by a factor of O(n^61)…Thus we developed a search-based algorithm based on rearranging codons in a sequence and assessing the impact on structure by calls to Mfold.

Slide56

Min/Max Structure Design Algorithm

Slide57

Min/Max Design Space for GFP (WT codon distribution)Our resulting designs modulate structure much better than random sampling, but of course less extensively than our non-codon constrained designs can.

Slide58

Min/Max Design Space for GFP(Cohen-Skiena codon distributions)Given the optimal max-structure distribution, we find the max structure design.Given good min-structure distributions, we find better min structure designs.

Slide59

Experiments in YeastReplace WT His3 with high and low secondary structure designs.Histidine can be supplied in the growth media, so even defective His3 strains can be grown.Dilution studies enable assaying the relative health of strains with different His3 variants on His- media.Experiments show that low secondary structure variants grow better than those with high secondary structure.

Slide60

Secondary Structure Level Affects Expression

Slide61

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design

Codon-Bias and Translation RatesOptimizing Secondary Structure

Optimizing Sequence Autocorrelation

Future

Work

Slide62

Mappings: Codon - tRNA - Amino Acid

Multiple codons correspond to same tRNAMultiple tRNAs correspond to same amino acid

Codons

GCT

GCC

GCA

GCG

tRNAs

tRNA1

tRNA2

Amino Acids

Ala

Slide63

tRNA Reuse and ExpressionIt has been observed that codons tend to appear near each other in coding sequences more often than statistical independence would suggest.tRNA

molecules must repeatedly do two things: recharge to the appropriate amino acid, and find the right place to deposit it.Frequent reoccurence of

codons

in principle increases the likelihood a recent-used

tRNA

molecule can be used again in translating a particular message.

Slide64

Diffusion vs. Reuse

Ribosome

tRNA

Ribosome

tRNA

re-use

tRNA

defused away

Slide65

The TPI Measure of AutocorrelationThe most popular measure of sequence autocorrelation is TRI, which counts the number of transitions between synonymous codon occurrences.ABABAB has five transitions, vs. one for AAABBB.But TPI does not account for the genomic distance between occurrences of these codons, which governs what the diffusion time really is.

This suggests the need for a more sophisticated measure of sequence autocorrelation to model tRNA reuse.

Slide66

Possible Distance FunctionstRNA reuse presumably falls off slowly with genomic distance between synonymous codons, but according to what function?

Slide67

Scoring Function Test

X-axis : Parameter values

Y-axis : Fraction of 1000 random samples, scored below wildtype gene using the specified scoring function

Best distinguishes fast vs. slow genes

Threshold Function

Inverse Distance Function

Exponential Function

Slide68

Identifying Fast Genes: DICA vs. TPIs

Rank order 400 genes (200 fast and 200 slow) based on the scores for both DICA and TPI

. The function better differentiating fast vs. slow genes should have increasing trend (#fast - #slow genes) along Y-axis and should start decreasing as we cross the middle along X-axis, which is theoretically the boundary between fast and slow genes.

DICA (Red) outperforms TPI

(Blue) in both data sets

Slide69

Distance Incorporated DesignAmino Acid Sequence:

L S L S S

Constraints:

tRNA type

Amino Acid

#tRNAs

Legend

(

) AAB

(ii) BAA

(iii) ABA

Max

Min

We can maximize or minimize autocorrelation

according to the DICA score.

Slide70

Autocorrelation Design OptimizationDesign goal for TPI is trivial

Just place similar tRNAs together

Design goal for DICA is not!

Must place codons so as to minimizes the sum of the

tRNA

distances

Applied exhaustive search with pruning

Not necessarily the optimal placement by genomic distance

Slide71

Exhaustive Search with Pruning

Root

AA Sequence:

L S L

Constraints:

tRNA type

#tRNA

0.98

0.0

0.99

Search tree for L

0.99

Prune branch soon as proving it can’t lead to a better solution

Slide72

Algorithm Validation Through Experiments

Our DICA optimized genes scores (

Red

and

Green

arrow) are beyond the distribution of randomly generated genes scores

Slide73

Optimal Scores for all Yeast Genes

Higher the height along Y-axis, more autocorrelated is the sequence.

Red dots (optimal max) are always above and blue dots (optimal min) are below the black dots.

: WT vs. WT

: WT vs. Max AC score

: WT vs. Min AC score

Slide74

Autocorrelation Level Affects ExpressionWild conjecture: codon-bias exists to increase sequence autocorrelation.

Slide75

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeCodon-Pair Bias and Vaccine Design

Codon-Bias and Translation RatesOptimizing Secondary StructureOptimizing Sequence Autocorrelation

Future

Work

Slide76

Take Home MessageDeclining DNA synthesis costs open the doors to exciting new ways of doing molecular biology.Taking full advantage of large-scale synthesis requires innovative sequence design, often based on sophisticated algorithmic ideas.

Slide77

Current and Future WorkDevelopment of human and animal vaccines.Why does CPB/autocorrelation/secondary structure optimization work? Translation speed? Protein misfolding? CG dinucleotides?

Improved signal detection search and other novel designsSequence design algorithms to arbitrarily modulate expression using multiple mechanisms.Overlapping gene design (world’s shortest gene

)

Understanding reversion in codon-pair

deoptimized

sequences.

Slide78

ThanksCS: Charles Ward, Dimitris Papamichail, Barry Cohen, Rukhsana Yeasmin

, Yaw-Ling Lin, Jesmin Tithi, Jeff Chen, Bharat Jain, Bei Wang, Pablo Montes, Heraldo

Memelli

Joondong

Kim, Joe Mitchell.

SAVE: Steffen Mueller, Rob Coleman, Bruce

Futcher

Yutong

Song, Molly

Arabov

Anjaruwee

Nimnual

, Chen Yang,

Aniko

Paul,

Eckard

Wimmer

Yeast: Justin

Gardin

Shuqi

Yan,

Sangeet

Honey, Sabine

Keppler

-Ross, Alisa

Yurovsky

, Bruce

Futcher

Short Genes: David Green,

Vadim

Patsalo

Gene Therapy:

Wadie

Bahou

Varsha

Sitaraman

, Pat Hearing

Support from NSF, NIH, Microsoft

Slide79

Publications: Experimental PapersReduction

of the rate of poliovirus protein synthesis through large scale codon deoptimization causes virus attenuation of viral virulence S. Mueller, D. Papamichail

, J.R. Coleman, S. Skiena and E.

Wimmer

Journal of Virology

, October 2006,

. 9687-9696, Vol. 80, No. 19

Virus attenuation by genome-scale changes in

codon

-pair bias

J. Coleman, D.

Papamichail

,, S. Skiena B.

Futcher

, S. Mueller, and E.

Wimmer

Science

, July 2008, p.1784-1787, Vol. 320, 2008.

Live Attenuated Influenza Vaccines by Computer-Aided Rational Design

S. Mueller, R. Coleman, D.

Papamichail

, C. Ward, A.

Nimnual

, S. Skiena, B.

Futcher

, and E.

Wimmer

Nature

Biotechnogy

, Vol. 28, 2010.

Computationally-recoded AAV Rep 78 is Efficiently Maintained within an Adenovirus Vector

, V.

Sitaraman

, P. Hearing, C. Ward, D.

Gnatenko

, E.

Wimmer

, S. Mueller, S. Skiena, and W.

Bahou

PNAS

108 (2011) 14294-14299.

Identification

of two functionally redundant RNA elements in the coding sequence of poliovirus using computer-generated

design,

Yutong

Song, Y. Liu, CB. Ward, S. Mueller, B.

Futcher

, S. Skiena, A. Paul, and E.

Wimmer

, PNAS,109 (2012)

14301

14307.

Deliberate

reduction of

hemagglutinin

and neuraminidase expression of influenza virus leads to an

ultraprotective

live vaccine in

mice,

Chen

Yang,

S. Skiena, B.

Futcher

S. Mueller, and E.

Wimmer

, PNAS

2013 110 (23) 9481-

9486

easurement of decoding rates of all individual codons in vivo,

J.Gardin

, R.

Yeasmin

, A.

Yurovsky, Y. Cai, S. Skiena,and B. Futcher, eLife pending.

Slide80

Publications: Theory PapersDesigning Better Phages,

S. Skiena, Bioinformatics 17 (2001) S253-261. Also ISMB 2001Natural selection and algorithmic design of mRNA B. Cohen and S. Skiena, J. Computational Biology 10 (2003) 419-432 and RECOMB 2002

Two proteins for the price of one: The design of maximally compressed coding sequences

B. Wang, D.

Papamichail

, S. Mueller and S. Skiena, 11th International Meeting on DNA Computing (

DNA11) 2005

and

Lecture Notes in Computer Science

2006, Vol. 3892, pp. 387-398

Optimizing Restriction Site Placement for Synthetic Genomes

P. Montes, H.

Memelli

, C. Ward, J. Kim, J. Mitchell, S. Skiena,

Symp

. Combinatorial Pattern Matching (CPM 2010)

Constructing Orthogonal de

Bruijn

Sequences

Y. Lin, C. Ward, B. Jain, S. Skiena,

Workshop on Algorithms and Data Structures (WADS 2011)

Redesigning

Viral Genomes

Skiena, IEEE Computer 45(3): 47-53 (2012)

Designing RNA Secondary Structure in Coding Regions.

Yeasmin

and S

Skiena, ISBRA 2012: 299-314.

Designing

Autocorrelated

Genes

Yeasmin

, J.

Tithi

, J. Chen, and S. Skiena, BCB 2013: 458

Slide81

Slide82

Slide83

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeSynthetic Attenuated Virus Engineering (SAVE)

Signal Location DetectionOther ApplicationsFuture Work

Slide84

Seaching for SignalsBiologically important signals like binding sites and splice signals often lurk within coding sequences. Identifying them is hard.If a synonymously coded gene (designed using our maximal codon ``scramble’’) results in a dead phenotype, some critical signal must have been stepped on – but where?We invented a search procedure to narrow down the location by synthesizing four synthetic designs implementing a combinatorial group testing procedure, employing balanced Gray codes to minimize cross-sequence boundaries.

Slide85

Localizing sequence specific signals

III

Scrambled

I, II

I, II, IV

I, II, III

I, III

I, III, IV

I, IV

III ,IV

III

II, III

II, III, IV

II, IV

Prediction of viruses

capable of replication

corresponding to each possible location of sequence specific signal:

Slide86

Signal location in Ad/Rep78

III

Scrambled

1,2

1,2,4

1,2,3

1,3

1,3,4

1,4

3,4

2,3

2,3,4

2,4

Prediction of viruses capable of replication corresponding to each possible location of sequence specific signal:

Titer (Pfu/cell)

Ad/Rep78 -II: 1.07x10

Ad/Rep78-III: 5.45x10

Ad/Rep78 -IV: 3.10x10

Ad/sRep78 : 3.75x10

Ad/Rep78-I

1 12

-DpnI +DpnI

Ad/Rep78-II

1 12

-DpnI +DpnI

Ad/Rep78-III

1 12

-DpnI +DpnI

Ad/Rep78-IV

1 12

-DpnI +DpnI

Slide87

Confirmation by Subcloning-

III

S(wt1)

Rep78

S(wt2)

Rep78

S(wt3)

Rep78

Slide88

Sequence-Ordered Group Testing Group testing strategies permit the identification of up to d defects in n samples using t tests in each of r rounds.Historically, group tests consist of arbitrary labeled subsets of samples, but in our application the permutation of samples makes a difference.

By sequencing the subsets as a balanced Grey code, we reduce the chances of signals crossing boundaries of our test, giving ambiguous results.

Slide89

Consecutive Positive Group TestingWe have proposed a class of designs to locate a signal spanning up to d consecutive regions out of n.Our M3(7,3) design uses 10 tests to detect up to 3 consecutive positives in 105 regions, six times the resolution of previous designs.Our designs incorporate check bits for robustness to false readings.

Slide90

Outline of TalkDNA Translation and the Redundancy of the Triplet CodeSynthetic Attenuated Virus Engineering (SAVE)Signal Location Detection

Other ApplicationsFuture Work

Slide91

Genome RefactoringThe presence of unique restriction sites facilitates cloning.Abundant, regularly-spaced unique restriction sites can be designed into synthetic sequences to make them easy to work with (refactored).We have employed our placement algorithm in our latest virus design.Wildtype vs. refactored polio:

Slide92

Adding /Removing Restriction Sites

We seek synonymous mutations to create frequent, well-spaced unique restriction sites.

Slide93

Hardness of Unique Restriction Site PlacementGiven the set of achievable placements for each enzyme, select one from each row to minimize the maximum gap.Hard to approximate better than 3/2, but 2-factor approximation and good heuristics exist.

Slide94

Sequences with Many PatternsWe are convinced that synthetic sequences designed to contain large-sets of interesting motifs/patterns open up entirely new classes of experiments (e.g. how to transcription factors respond to large numbers of new binding sites).De Bruijn sequences are strings containing all patterns of length k on a given alphabet (here {A,C,G,T}) exactly once.A readily-synthesized 4 kilobase sequence can contain all possible 6-base patterns.To avoid recombination, we seek many such de Bruijn sequences, all of which are very dissimilar from each other.

Slide95

Orthogonal de Bruijn SequencesAAACAAGCACCAGACGAGTCATACTCCCGCCTAGGCGTGATCTGTAATTTGCTTCGGTTATGGGAAAAAGAATGAGGATAGTATCGACAGCGGGTGGCATTGTCTACGCTCAACCCTGCCGTTCCACTTTAAAAATAACTATTACATCACGTAGATGTTTCTTGACCTCGCAGTGCGAAGGGCTGGTCCGGAGCCCAA

A set S of order-k de Bruijn sequences on an alphabet of size a are orthogonal if any string of length k+1 occurs at most once in S.

Each de Bruijn sequence corresponds to an Eulerian cycle in the de Bruijn graph.

Orthogonal sequences/tours do not share edge-pairs/turns.

Slide96

Constructing Orthogonal SequencesWe can efficiently construct sets of at least a/2 mutually orthogonal de Bruijn sequences for any k>1. Proof idea: For the ith sequence, start with an arbitrary matching at each vertex to create a set of edge-disjoint cycles avoiding all previously used edge pairs. Enough unused-pairs exist to guarantee a way to merge disconnected cycles.Through heuristic/combinatorial search, we have constructed sets of a-1 orthogonal sequences for all reasonable a and k.We conjecture a-1 orthogonal sequences exist for all k, but seems intimately connected to a well-known conjecture on Hamiltonian decompositions of line graphs (Bermond-78)

Slide97

Slide98

Genomes

Species length reference (nt) Poliovirus 7,500 Cello, Paul Wimmer 2002

Phage



X174 5,386 Smith et al. 2003

Page T7 “refactoring” 11,515 of 39,937 Chan, Kosuti, Endy 2005

1918 Influenza virus 13,500 Tumpey et al. 2005

“Phoenix” (fossil)

progenitor of hum.endog. retrov 9,472 Dewannieux et al., 2006

HERV-K (same as Phoenix) 9,472 Lee & Bienniaz 2007

SIVcpz 9,912 Takehisa et al. 2007

Human coronavirus (SARS) 29,700 Donaldson et al., 2008

Mycoplasma genitalium 582,970 Gibson et al. 2008

Slide99

Codon-Pair Bias is conserved across species

Slide100

Codon pair bias is conserved but

diverges with evolutionary distance

Slide101

Poliovirus Genome and Polyprotein ProcessingUtilizes IRES in 5’NTR to initiate translation of a single open reading frameViral proteins produced by cis catalyzed cleavage events

Poliovirus genome only 7.5kb in length.

Structural Region

Non-structural Region

VP2

VP3

VP1

VP4

A n

3’ NTR

5’ NTR

Cloverleaf

IRES

7.5kb

P1 P2 P3

Primary processing

VP4

VP2

VP1

VP3

Mature

proteins

Structural capsid proteins

Nonstructural proteins

adapt. Wang, C..

Slide102

Slide103

Motivation: Restriction Sites in Bacteriophages

Slide104

Why Eliminate Restriction Sites?Restriction enzymes exist in bacteria as a defense against phages.Phages have been proposed as an agent against bacterial infections.A theraputic phage might be enhanced by removing all restriction sites from its genome.

Slide105

Minimizing Sequence Patterns is Hard with WildcardsThe reduction is from 3-satisfiability. Note that every CNF clause has one pattern of literals (all false) where it fails.The string-to-design will have n positions, each of which can be either T or G.Each pattern to avoid is of length n, corresponding to a clause. All positions are don’t care except the three positions of the corresponding clause, cutting only when all relevant literals are negated.

Slide106

Slide107

Determine RRT of CodonsTime Period: TRibosome spends more time on C1, moves faster on C2

Get read statistics surrounding a codonIf the ribosome spends more time on a codon, more reads should be gathered (C1 in right

)

If the ribosome moves faster, less reads will be found when the ribosome is on the codon (

C2 in right

)

For 10-codon reads, only the codons being

decoded inside

the ribosome should show peaks/troughs. Other positions should show stable trend.

Slide108

Auto and Anti-Correlated Sequence

CUACUACUA

CUGCUGCUG

CUACUA

CUG

CUA

CUGCUG

CUA

CUG

CUA

CUG

CUA

CUG

Slide109

Slide110

Slide111

UUU F 0.46 UCU S 0.19 UAU Y 0.44 UGU C 0.45

UUC F 0.54 UCC S 0.22 UAC Y 0.56 UGC C 0.55

UUA L

0.08

UCA S 0.15 UAA * 0.30 UGA * 0.47

UUG L 0.13 UCG S

0.05

UAG * 0.23 UGG W 1.00

CUU L 0.13 CCU P 0.28 CAU H 0.42 CGU R

0.08

CUC L 0.20 CCC P 0.33 CAC H 0.58 CGC R 0.19

CUA L

0.07

CCA P 0.28 CAA Q 0.26 CGA R 0.11

CUG L 0.40 CCG P 0.11 CAG Q 0.74 CGG R 0.20

AUU I 0.36 ACU T 0.25 AAU N 0.47 AGU S 0.15

AUC I 0.47 ACC T 0.36 AAC N 0.53 AGC S 0.24

AUA I 0.17 ACA T 0.28 AAA K 0.43 AGA R 0.21

AUG M 1.00 ACG T 0.11 AAG K 0.57 AGG R 0.21

GUU V 0.18 GCU A 0.26 GAU D 0.46 GGU G 0.16

GUC V 0.24 GCC A 0.40 GAC D 0.54 GGC G 0.34

GUA V 0.12 GCA A 0.23 GAA E 0.42 GGA G 0.25

GUG V 0.46 GCG A 0.11 GAG E 0.58 GGG G 0.25

0-0.1

0.1-0.15

0.15-1

Adapted from www.planetgene.com

Synonymous codon usage

Synonymous codons are used at unequal frequencies in the host genome

frequency

Slide112

) x F(

)

) x F(

)

x F(

)

CPS =

observed

expected

observed genome-wide

frequency of

codon pair AB

observed genome-wide

frequency of

amino acid pair XY

expected genome-wide

frequency of

codon pair AB

There are 3721 possible pairs (61 x 61; excluding stop codons)

Calculating Codon Pair Score (CPS)

)

Slide113

Slide114

Encoding Genes in Alternate Reading FramesIn theory, six coding sequences/ORFs can co-exist on a single DNA sequence. In reality, many viruses do encode overlapping genes to:Reduce genome sizeFacilitate co-expression

Slide115

Long Overlaps Exist in Viruses

Slide116

Compression Algorithm (WPMS ’06)Worst case quadratic timeExpected time linear because overlaps are usually short

Slide117

Two arbitrary proteins cannot be significantly interleaved…Overlapping genes in viruses evolved by losing stop codons, not design

Slide118

… Unless we are free to replace amino acids with similar residues

Slide119

Why might we want to design overlapping genes?Inserting new genes in a bacterial host is fundamental to biotechnologyBut the host doesn’t need these genes and deletes them.Interleaving an antibiotic resistance gene in means we can select hosts with the target.There seems to be enough flexibility to make this work.

Slide120

Slide121

332,652 H

492,388

98,245

131,196

7,501

2,340

Slide122

Can we debilitate virus genome translation/ replication by increasing the number of

unfavorable

synonymous codon-pairs in the virus genome?

We seek a cumulative phenotype of many mutations each with a small effect - difficult to revert; genetically stable -

“death by a thousand cuts”

Does large scale codon pair

-optimization

of the influenza genome result in attenuation

Slide123

RNA virusesPoliovirus is in the Picornaviridae family, (+) stranded, non-enveloped, RNA virusesRNA viruses are the largest virus group, containing dreaded human pathogens (HIV, Ebola, SARS, Dengue, Hanta, Influenza)High mutation rate (1/10,000 bases) confers high adapability to changing conditions

Slide124

Poliovirus ExperimentsWe first validated the SAVE methodology in poliovirus.We maximally scrambled (rearranged) the codons in a large hunk of the P1 region. Such viruses typically grew.We inserted large amounts of infrequent synonymous codons. These viruses were typically weak/attenuated.Designs deoptimized according to the codon-pair bias criteria were (surprisingly) weak/attenuated.Our attenuated viruses functioned as vaccines in mice – paper published in Science!

The story is even better in our recent Flu work, so I will present these experimental results instead.

Slide125

Future work – Sequence design tools

Slide126

Slide127

Virus LD

(PFU)

PR8 (wt) 6.1 x 10

PR8-NP

Min

5.0 x 10

PR8-HA

Min

1.7 x 10

PR8-PB1

Min

3.2 x 10

PR8

7.9 x 10

Lethal Dose 50 (LD

) of codon-pair deoptimized viruses

All deoptimized viruses attenuated in mice

Attenuation correlates roughly with amount of deoptimization

Mueller et al.,

Nature Biotechnology

28(7),

July, 2010.

Slide128

Codon Alteration Sequence DesignTo achieve maximum Hamming distance without altering codon bias, we used maximum weight bipartite matching between codon positions and codons, using as weight the number of bases changed.Restriction sites were inserted uniquely (inserted in specific areas and then eliminated everywhere else).

Certain regions were locked to preserve secondary structure.Evaluation of secondary structure:

Slide129

Codon use statistics in PV(M), PV-SD, and PV-AB

PV(M)

PV-SD

PV-AB

PV-SD

Slide130

Slide131

Degeneracy Of The Genetic CodeMost amino acids can be encoded by more than one synonymous codons.

Gly - Ala - Met - Phe - Leu

GGA

GGC

GGG

GGU

GCA

GCC

GCG

GCU

AUG

UUC

UUU

CUA

CUC

CUG

CUU

UUA

UUG

x 4 x 1 x 2 x 6 =

192 encodings

Pentapeptide

Slide132

Data from coleman et al., Science, 2008.

Synthetic Attenuated Virus Engineering (SAVE)

Wild Type High codon pair score Low codon pair score

Conclusion: low codon pair scores can reduce genetic function.

Slide133

Open Problem 1Is finding the encoding which includes the minimum number of forbidden patterns hard for long patterns without wildcards?

Slide134

Open Problem 2We say that two order-k de Bruijn sequences are d-orthogonal if they contain no (k+d) -mer in common, generalizing the notion beyond d=1.This holds the potential to create much order families of orthogonal sequences, whose cardinality should grow exponentially with d.How many mutually d-orthogonal sequences can you construct?

Slide135

pair

Expected

Observed

Obs./Exp.

Codon

pair

GCAGCA

GCAGCC

GCAGCG

GCAGCT

GCCGCA

GCCGCC

GCCGCG

GCCGCT

GCGGCA

GCGGCC

GCGGCG

GCGGCT

GCTGCA

GCTGCC

GCTGCG

GCTGCT

GCATGC

GCATGT

TACTGG

TATTGG

TACTAC

TACTAT

TATTAC

TATTAT

2856.40

4961.56

1341.51

3262.97

4961.56

8618.21

2330.20

5667.77

1341.51

2330.20

630.04

1532.46

3262.97

5667.77

1532.46

3727.41

1343.47

1131.59

1609.87

1308.13

2256.03

1833.19

1489.60

4196

6033

1420

4711

1122

5141

2042

1378

1142

4032

2870

1472

4357

7014

1533

5562

554

881

2212

706

2854

1760

1339

1459

1.469

1.216

1.059

1.444

0.226

0.597

0.876

0.243

0.851

1.730

4.555

0.961

1.335

1.238

1.000

1.492

0.412

0.779

1.374

0.540

1.265

0.960

0.730

0.979

CPS

0.385

0.196

0.057

0.367

-1.487

-0.517

-0.132

-1.414

-0.161

0.548

1.516

-0.040

0.289

0.213

0.000

0.400

-0.886

-0.250

0.318

-0.617

0.235

-0.041

-0.314

-0.021

3716

3717

3718

3719

3720

3721

Scoring codon pairs

“Expected” = the statistically predicted frequency in the human genome at which a given aa pair is encoded by each possible synonymous codon pair (based on the the genome wide synonymous codon usage

“Observed” = the actual genome wide frequency of each codon pair

“Codon Pair Score - CPS” = a measure of the degree of under - or overrepresentation of the codon pair in the genome

= Ln (Obs./Exp.)

underrepresented CPS < 0 overrepresented CPS > 0

“Codon Pair Bias - CPB” = the average of CPS values across a protein coding sequence

Slide136

Encoding Matters: Different Codon-pair Encodings Have Different Levels Of Function

Shuqi Yan*, Sangeet Honey*, Justin Gardin, Sabine Keppler-Ross, C. Ward#, S. Mueller*, S. Skiena

, E. Wimmer*, B. Futcher*

Stony Brook University, *Dept. Of Molecular Genetics and Microbiology, And # Dept. Of Computer Science.

Slide137