/
Computational  analysis of PromoterS Computational  analysis of PromoterS

Computational analysis of PromoterS - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
438 views
Uploaded On 2017-10-26

Computational analysis of PromoterS - PPT Presentation

Gene regulation Genomes usually contain several thousands of different genes Some of the gene products are required by the cell under all growth conditions and are called housekeeping genes ID: 599697

cpg promoter transcription gene promoter cpg gene transcription genes tata sequence features dna core prediction elements tss promoters regulation transcriptional binding inr

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Computational analysis of PromoterS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Computational analysis of PromoterSSlide2

Gene regulation

Genomes

usually contain several

thousands of

different genes

.

Some

of the

gene products

are required by the cell under all growth conditions and are called

housekeeping genes

.

genes for DNA polymerase, RNA

polymerase,

rRNA, tRNA, …

Many

other gene products are

required under

specific growth conditions.

e.g. enzymes responding

to a specific environmental

condition such

as DNA

damageSlide3

Gene

regulation

Housekeeping genes must be expressed at some level all of the time.

Frequently, as

the cell grows faster, more of the housekeeping gene products are needed.

The gene products required for specific growth conditions are not needed all of the time.

These

genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed.

Apparently,

the

gene expression must be

regulated

so that the genes that are being expressed meet the needs of

different cell types, developmental stages, or different external conditions.Slide4

Gene regulation

Gene regulation basically occurs at three different places:

transcriptional regulation

transcription of the gene is regulated

control of transcription initiation – most important control mechanism

translational regulation

translation of the gene is regulated

How often the mRNA is

translated influences

the amount of gene product that is

made.

post-transcriptional/post-translational regulation

regulation of gene

products

after

they are completely

synthesized, e.g. degradation, chemical modifications (methylation, phosphorylation)Slide5

Transcriptional regulation

Transcription control has two key features:

protein-binding

regulatory DNA

sequences (

control elements

)

are associated with

genes

specific proteins that bind to

regulatory sequences

determine where transcription will start,

and either activate

or repress its

transcription

DNA sequence

specifying where

RNA polymerase binds and initiates transcription of

a gene

is called a

promoter.

Transcription

from a particular

promoter is

controlled by DNA-binding proteins, termed

transcription factors

.

DNA control elements in

binding

transcription factors

may be located very far

from the promoter they

regulate.Slide6

Three different polymerases

As a result of this

arrangement, transcription

from a single promoter may be regulated

by binding

of multiple transcription factors to alternative

control elements

, permitting

complex

control of gene expression

.

RNA polymerase I synthesizes

rRNA.

RNA polymerase II

synthesizes

mRNA.

RNA polymerase III synthesizes small RNAs and

tRNA.Slide7

source: Molecular Biology of the Cell.

4th

edition. Alberts BSlide8

Three parts of promoter

core promoter

responsible for actual binding of transcription apparatus

very close upstream (~35 bp), may also be downstream, see later

proximal promoter

contains several regulatory elements

few hundreds bases upstream of

transcriptional start

site (TSS)

distal promoter

contains enhancers (upstream/downstream), silencers

They are

cis-acting … cis-element regulates gene on the same DNA molecule.

cis-acting sequences are bound by trans-acting (i.e. acting from a different molecule) regulatory proteins.However, the distinctions between proximal elements and enhancers/silencers is not very clear.Slide9

Core

promoter

Eukaryotic RNAPII is not itself capable of transcriptional initiation in vitro.

It needs to b

e

supplemented by

general (basal) transcription factors

(GTFs).

Factors are identified as TFIIX, where X is a letter

. e.g. TFIIA, TFIIB, …

RNAPII + TFs form

pre-initiation complex (PIC). Only then transcription can commence.minimal (core) promoter

– DNA sequence sufficient for assembly of pre-initiation complex.Transcription initiated by the core promoter is called basal transcription.Slide10

C

ore promoter

elements

Core promoter is usually located proximal to or overlapping TSS.

Contains several sequence motifs. TFs interact with them in sequence-specific manner.

Combination of TF-binding motifs vary depending on the gene.Slide11

Core promoter elements

TATA box

… ~ 30 bp upstream, consensus TATA(A/T)A(A/T)

Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (

Inr

) … surrounds TSS, extremely degenerate consensus sequence

YY

A

N(T/A)YY

Y

(

A

– TSS, N – any nucleotide)Promoters with both TATA and Inr also exist.DPE (downstream promoter element) in TATA-lessPresent in some TATA

-, Inr+ promoters, 30 bp downstream. consensus: RGWCGTG (W = A or T)Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression.

Genes Dev.

2002

;

16

(20):2583-92

.Slide12
Slide13

Promoter proximal elements

Found

within 100 to 200 bp of

the TSS.

CAAT

(CCAAT, CAT)

box

… consensus GGCCAATCT

GC

box

… consensus

G/T G/A

GGCG G/T G/A G/A C/T.It’s GC rich segment.Promoter may contain multiple GC boxes, such promoter

usually lack TATA box.Slide14

A hypothetic mammalian promoter region

TATA

+1

Promoter

Proximal

Element

Enhancer

Enhancer

Enhancer

Intron

Exon

-30

-200

+10~50 Kb

-10~-50 KbSlide15

CpG island

Transcription of genes with

TATA/Inr promoters begins

at a well-defined

sites.

However

, transcription of many

protein-coding genes

has been shown to begin at any one of multiple

possible sites

over an extended

region

20–200 bp long. As a result, such genes give rise to mRNAs with multiple alternative

5’ ends.These are housekeeping genes, they do not contain TATA, Inr.Most genes of this type contain a CG-rich stretch of several hundreds nucleotides – CpG island – within ≈100 base pairs upstream of TSS.CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes.Slide16

CpG island

Computational analysis is based on CG dinucleotide imbalance.

length = 200 bp, C+G content min 50%,

M

. Gardiner-Garden,

M

. Frommer, CpG islands in vertebrate genomes,

J. Mol. Biol.

1987

,

196

, 261-282.

length

=

500

bp, C+G content min

55%,

5

D

. Takai,

P

. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22,

PNAS

2002

,

99

, 3740-45.

 

mRNA

Multiple

5’-start sites

CpG island

~100 bpSlide17

CpG island

simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs

EMBOSS CpGPlot/CpGReport -

http

://www.ebi.ac.uk/Tools/emboss/cpgplot

/

CpG Island Searcher -

http

://cpgislands.usc.edu

/

(IE only)

len=51, #C=76, #g=101, #CG=30,

,

,

,

,

CpGo/e=0.98

 Slide18

Promoter regions in human genes

TATA

32%

Inr

85%

GC box

97%

CAAT box

64%

located in CpG

48%

TATA

+

Inr

+

28%

TATA

+

Inr

-

4%

TATA

-

I

nr

+

56%

TATA

-

Inr

-

12%

Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes.

Genome Res.

2001

,

11

(5):677-84

.Slide19

Computational analysis of promotersSlide20

Introduction

Regulatory regions typically contain several transcription factor binding sites strung out over a large region.

Which

particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time.

Any

given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development.Slide21

Introduction

Transcription factors themselves are also subject to similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops.

While

this all is very nice and interesting from a biologist’s point of view, it spells

big trouble for promoter prediction

.Slide22

Computational difficulties

T

here

thousands

of transcriptional

regulators, many of which have

recognition

sequences that are not yet

characterized.

Any

given sequence element might be recognized

by different

factors in different cell types.Core promoter regulatory elements are short and not completely conserved

similar elements will be found purely by chance all over the genome.

 Slide23

What promoter prediction

methods actually

predict?

1

st

nucleotide copied at the 5’ end of

the corresponding

mRNA – transcription start site

TSS

region around TSS is often referred as the

core promoter

Owing to the strong link between TSS and core promoter, these terms are often used

interchangeably.Three distinct types of promoter prediction

signal featurescontext featuresstructure featuresSlide24

Evaluating predictions

sensitivity (Se),

recall, TPR

proportion of correct predictions of TSSs relative to all experimental

TSSs

positive

predictive value (PPV), precision

proportion of correct predictions of TSSs out of all counted

positive

predictions

 Slide25

Evaluating predictions

specificity Sp

false positive rate (FPR)

correlation coefficient (CC)

 Slide26

Evaluating predictions

And how to obtain FP, FN, TP?

You have a gene sequence for which you know TSS location. And you make your prediction.

If it falls within the region [-2000, +2000] relative to annotated TSS, you have TP.

Prediction falling into the annotated part of gene within [+2001, EndOfGene] are FPs.

If you predict no promoter for this gene sequence, you have FN.Slide27

Signal features

Recognize “conserved” signals such as TATA box, Inr, DPE, BRE etc.

Such motifs are highly variable and degenerate. This leads to high false positive rate.

Methods based on core promoter

elements

and other specific TFBs (e.g. CAAT box) are

far from being accurate

.

Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands.

CpG and non-CpG promoters are predicted with different success,

prediction of non-CpG is less accurate

 Slide28

Context features

Extracted from genomic context of promoters

Represented by a set of

n

-mers

(DNA sequence long

n

bases). Their statistics are estimated from training samples.

n

-mers can cover most biological signals (TFBS: TATAAA, CCAAT; CpG: GC rich

n

-mers like CGGCG)

n-mer representation encodes contextual information of promoters and has following advantagescontextual information is independent of any biological signals

distribution of n-mers may have biological significance (TFBS, CpG)n-mers may reveal details of yet unknown promoter regionsn-mers reduce FPR while maintaining relatively high TPR (i.e. Se)Slide29

Structure features

They originate from DNA 3D structures that characterize proximal promoters.

DNA actually encodes in its sequence at least two independent levels of functional information

DNA sequence – encodes proteins and their regulatory elements.

Physical and structural properties of DNA itself.

Example:

dinucleotide properties – stacking energy, propeller twist

trinucleotide – bendability, nucleocome position preference

They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence.Slide30

Model for cooperative assembly of an activated transcription-initiation

complex.

This figure clearly shows, why are structural features such as flexibility important.

Molecular Cell Biology.

4

th

edition. Lodish

H, Berk A, Zipursky SL, et

al. New

York: W. H. Freeman; 2000

.

Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation.

FASEB J.

2003

;

17

(10

):1228-37. Slide31

Software

Signal features

(two leading CpG predictors)

FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon

Eponine – TATA and G+C rich domain, Relevance Vector

Machine

Context features

PromoterInspector – IUPAC word groups with wildcards

Structure features

McPromoter – DNA

sequence,

bending, DNA

twist, ANN

EP3 – features from1, prediction based just on the threshold imposed on the structural profile.1 Florquin K et al.,

Large-scale structural analysis of the core promoter in mammalian and plant genomes.

Nucleic Acids Res.

2005

;

33

(13):

4255Slide32

Integrated approaches

combine sequence, context and structural features

ARTS

– SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies)

does not distinguish CpG related promoter from unrelated, not clear how it performs on non-CpG

SCS

– sequence (TATA, Inr, DPE, CpG), structure (flexibility), and

context (6-mers

) features are used in different prediction models, their outcomes are combined by Decission Tree

CoreBoost

– boosting technique with stumps, integrates core promoter signals, DNA flexibility,

n

-mer frequency, …

CoreBoost_HM … adds experimental histone modification dataSlide33

Boosting, stumps

Boosting

Belongs between

ensemble

methods that

produce

a very accurate prediction

rule (strong learner)

by combining rough and moderately

inaccurate (i.e. just a bit better than random guessing) rules (weak learners, WL).

Iteratively learn

weak classifiers

and add them to a final strong classifierWhen WL is added, it’s weighted based on their

accuracy. After a WL is added, the data is reweighted: misclassified examples gain weight and correctly classified examples lose weight. Thus, future WLs focus more on the examples that previous WLs

misclassified.

Stump

One-level decision tree (i.e. it has one root and

two terminal nodes)

source:

wikipediaSlide34

Databases

EPD – Eukaryotic Promoter Database

http://

epd.vital-it.ch

manually annotated

non-redundant collection of eukaryotic POL II

promoters

DBTSS

http://dbtss.hgc.jp

/

putative core promoter: e.g. -100 bp … +50 bp, -250 bp … +50 bp, -200 … +200 bpSlide35

Actual state of the promoter prediction

CpG island promoters are better to predict than non-CpG.

CpG islands usually correspond to housekeeping genes. Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy.

non-CpG islands correspond to

tissue-specific

expression. And are the bottleneck in accurate promoter prediction.

Best way how to do it: use transcription data. Alignment of the 5’ of ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS. Slide36

Future directions

False

positives

are still the main problem.

This is because the information about

chromatine structure

is missing in prediction models.

Without knowing

which regions of chromatin are

opened

or closed (and to what degree),

researchers

have to assume the whole genome is

accessi- ble for binding, which is obvi- ously wrong and will lead to more FP (and FN because

of

the extra noise

).

Chromatin remodelling

:

enzyme-assisted

movement

of nucleosomes

on

DNA.

source:

http

://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.htmlSlide37

PPP evaluation and comparison

clanky Whole genome a updateSlide38

Motif discovery

So far we have discussed only one of the problems in computational promoter analysis: localization of the core promoter (TSS prediction)

Another related problem is identification of

cis

-regulatory elements –

motif discovery

.Slide39

Motif discoverySlide40

References

Biology of transcriptional regulation

Pedersen

AG, Baldi P, Chauvin Y, Brunak S. The biology of eukaryotic promoter

prediction-a

review.

Comput. Chem

. 1999 Jun 15;23(3-4):191-207

.

comprehesive

list of features and references to their models may be found in Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255-64. PMID:

16049029Slide41

Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition.

IEEE/ACM Trans Comput Biol Bioinform.

2010

;

7

(3):550-62. PubMed PMID: 20671324

.

In Introduction contains very nice overview of sequence, context and structure features and lists promoter prediction software using these features.Slide42

Large-scale software comparison

Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome.

Nat Biotechnol.

2004

Nov;

22

(11):1467-73. PMID: 15529174. Slide43

IUPAC words