Gene regulation Genomes usually contain several thousands of different genes Some of the gene products are required by the cell under all growth conditions and are called housekeeping genes ID: 599697
Download Presentation The PPT/PDF document "Computational analysis of PromoterS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Computational analysis of PromoterSSlide2
Gene regulation
Genomes
usually contain several
thousands of
different genes
.
Some
of the
gene products
are required by the cell under all growth conditions and are called
housekeeping genes
.
genes for DNA polymerase, RNA
polymerase,
rRNA, tRNA, …
Many
other gene products are
required under
specific growth conditions.
e.g. enzymes responding
to a specific environmental
condition such
as DNA
damageSlide3
Gene
regulation
Housekeeping genes must be expressed at some level all of the time.
Frequently, as
the cell grows faster, more of the housekeeping gene products are needed.
The gene products required for specific growth conditions are not needed all of the time.
These
genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed.
Apparently,
the
gene expression must be
regulated
so that the genes that are being expressed meet the needs of
different cell types, developmental stages, or different external conditions.Slide4
Gene regulation
Gene regulation basically occurs at three different places:
transcriptional regulation
transcription of the gene is regulated
control of transcription initiation – most important control mechanism
translational regulation
translation of the gene is regulated
How often the mRNA is
translated influences
the amount of gene product that is
made.
post-transcriptional/post-translational regulation
regulation of gene
products
after
they are completely
synthesized, e.g. degradation, chemical modifications (methylation, phosphorylation)Slide5
Transcriptional regulation
Transcription control has two key features:
protein-binding
regulatory DNA
sequences (
control elements
)
are associated with
genes
specific proteins that bind to
regulatory sequences
determine where transcription will start,
and either activate
or repress its
transcription
DNA sequence
specifying where
RNA polymerase binds and initiates transcription of
a gene
is called a
promoter.
Transcription
from a particular
promoter is
controlled by DNA-binding proteins, termed
transcription factors
.
DNA control elements in
binding
transcription factors
may be located very far
from the promoter they
regulate.Slide6
Three different polymerases
As a result of this
arrangement, transcription
from a single promoter may be regulated
by binding
of multiple transcription factors to alternative
control elements
, permitting
complex
control of gene expression
.
RNA polymerase I synthesizes
rRNA.
RNA polymerase II
synthesizes
mRNA.
RNA polymerase III synthesizes small RNAs and
tRNA.Slide7
source: Molecular Biology of the Cell.
4th
edition. Alberts BSlide8
Three parts of promoter
core promoter
responsible for actual binding of transcription apparatus
very close upstream (~35 bp), may also be downstream, see later
proximal promoter
contains several regulatory elements
few hundreds bases upstream of
transcriptional start
site (TSS)
distal promoter
contains enhancers (upstream/downstream), silencers
They are
cis-acting … cis-element regulates gene on the same DNA molecule.
cis-acting sequences are bound by trans-acting (i.e. acting from a different molecule) regulatory proteins.However, the distinctions between proximal elements and enhancers/silencers is not very clear.Slide9
Core
promoter
Eukaryotic RNAPII is not itself capable of transcriptional initiation in vitro.
It needs to b
e
supplemented by
general (basal) transcription factors
(GTFs).
Factors are identified as TFIIX, where X is a letter
. e.g. TFIIA, TFIIB, …
RNAPII + TFs form
pre-initiation complex (PIC). Only then transcription can commence.minimal (core) promoter
– DNA sequence sufficient for assembly of pre-initiation complex.Transcription initiated by the core promoter is called basal transcription.Slide10
C
ore promoter
elements
Core promoter is usually located proximal to or overlapping TSS.
Contains several sequence motifs. TFs interact with them in sequence-specific manner.
Combination of TF-binding motifs vary depending on the gene.Slide11
Core promoter elements
TATA box
… ~ 30 bp upstream, consensus TATA(A/T)A(A/T)
Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (
Inr
) … surrounds TSS, extremely degenerate consensus sequence
YY
A
N(T/A)YY
Y
(
A
– TSS, N – any nucleotide)Promoters with both TATA and Inr also exist.DPE (downstream promoter element) in TATA-lessPresent in some TATA
-, Inr+ promoters, 30 bp downstream. consensus: RGWCGTG (W = A or T)Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression.
Genes Dev.
2002
;
16
(20):2583-92
.Slide12Slide13
Promoter proximal elements
Found
within 100 to 200 bp of
the TSS.
CAAT
(CCAAT, CAT)
box
… consensus GGCCAATCT
GC
box
… consensus
G/T G/A
GGCG G/T G/A G/A C/T.It’s GC rich segment.Promoter may contain multiple GC boxes, such promoter
usually lack TATA box.Slide14
A hypothetic mammalian promoter region
TATA
+1
Promoter
Proximal
Element
Enhancer
Enhancer
Enhancer
Intron
Exon
-30
-200
+10~50 Kb
-10~-50 KbSlide15
CpG island
Transcription of genes with
TATA/Inr promoters begins
at a well-defined
sites.
However
, transcription of many
protein-coding genes
has been shown to begin at any one of multiple
possible sites
over an extended
region
20–200 bp long. As a result, such genes give rise to mRNAs with multiple alternative
5’ ends.These are housekeeping genes, they do not contain TATA, Inr.Most genes of this type contain a CG-rich stretch of several hundreds nucleotides – CpG island – within ≈100 base pairs upstream of TSS.CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes.Slide16
CpG island
Computational analysis is based on CG dinucleotide imbalance.
length = 200 bp, C+G content min 50%,
M
. Gardiner-Garden,
M
. Frommer, CpG islands in vertebrate genomes,
J. Mol. Biol.
1987
,
196
, 261-282.
length
=
500
bp, C+G content min
55%,
5
D
. Takai,
P
. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22,
PNAS
2002
,
99
, 3740-45.
mRNA
Multiple
5’-start sites
CpG island
~100 bpSlide17
CpG island
simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs
EMBOSS CpGPlot/CpGReport -
http
://www.ebi.ac.uk/Tools/emboss/cpgplot
/
CpG Island Searcher -
http
://cpgislands.usc.edu
/
(IE only)
len=51, #C=76, #g=101, #CG=30,
,
,
,
,
CpGo/e=0.98
Slide18
Promoter regions in human genes
TATA
32%
Inr
85%
GC box
97%
CAAT box
64%
located in CpG
48%
TATA
+
Inr
+
28%
TATA
+
Inr
-
4%
TATA
-
I
nr
+
56%
TATA
-
Inr
-
12%
Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes.
Genome Res.
2001
,
11
(5):677-84
.Slide19
Computational analysis of promotersSlide20
Introduction
Regulatory regions typically contain several transcription factor binding sites strung out over a large region.
Which
particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time.
Any
given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development.Slide21
Introduction
Transcription factors themselves are also subject to similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops.
While
this all is very nice and interesting from a biologist’s point of view, it spells
big trouble for promoter prediction
.Slide22
Computational difficulties
T
here
thousands
of transcriptional
regulators, many of which have
recognition
sequences that are not yet
characterized.
Any
given sequence element might be recognized
by different
factors in different cell types.Core promoter regulatory elements are short and not completely conserved
similar elements will be found purely by chance all over the genome.
Slide23
What promoter prediction
methods actually
predict?
1
st
nucleotide copied at the 5’ end of
the corresponding
mRNA – transcription start site
TSS
region around TSS is often referred as the
core promoter
Owing to the strong link between TSS and core promoter, these terms are often used
interchangeably.Three distinct types of promoter prediction
signal featurescontext featuresstructure featuresSlide24
Evaluating predictions
sensitivity (Se),
recall, TPR
proportion of correct predictions of TSSs relative to all experimental
TSSs
positive
predictive value (PPV), precision
proportion of correct predictions of TSSs out of all counted
positive
predictions
Slide25
Evaluating predictions
specificity Sp
false positive rate (FPR)
correlation coefficient (CC)
Slide26
Evaluating predictions
And how to obtain FP, FN, TP?
You have a gene sequence for which you know TSS location. And you make your prediction.
If it falls within the region [-2000, +2000] relative to annotated TSS, you have TP.
Prediction falling into the annotated part of gene within [+2001, EndOfGene] are FPs.
If you predict no promoter for this gene sequence, you have FN.Slide27
Signal features
Recognize “conserved” signals such as TATA box, Inr, DPE, BRE etc.
Such motifs are highly variable and degenerate. This leads to high false positive rate.
Methods based on core promoter
elements
and other specific TFBs (e.g. CAAT box) are
far from being accurate
.
Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands.
CpG and non-CpG promoters are predicted with different success,
prediction of non-CpG is less accurate
Slide28
Context features
Extracted from genomic context of promoters
Represented by a set of
n
-mers
(DNA sequence long
n
bases). Their statistics are estimated from training samples.
n
-mers can cover most biological signals (TFBS: TATAAA, CCAAT; CpG: GC rich
n
-mers like CGGCG)
n-mer representation encodes contextual information of promoters and has following advantagescontextual information is independent of any biological signals
distribution of n-mers may have biological significance (TFBS, CpG)n-mers may reveal details of yet unknown promoter regionsn-mers reduce FPR while maintaining relatively high TPR (i.e. Se)Slide29
Structure features
They originate from DNA 3D structures that characterize proximal promoters.
DNA actually encodes in its sequence at least two independent levels of functional information
DNA sequence – encodes proteins and their regulatory elements.
Physical and structural properties of DNA itself.
Example:
dinucleotide properties – stacking energy, propeller twist
trinucleotide – bendability, nucleocome position preference
They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence.Slide30
Model for cooperative assembly of an activated transcription-initiation
complex.
This figure clearly shows, why are structural features such as flexibility important.
Molecular Cell Biology.
4
th
edition. Lodish
H, Berk A, Zipursky SL, et
al. New
York: W. H. Freeman; 2000
.
Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation.
FASEB J.
2003
;
17
(10
):1228-37. Slide31
Software
Signal features
(two leading CpG predictors)
FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon
Eponine – TATA and G+C rich domain, Relevance Vector
Machine
Context features
PromoterInspector – IUPAC word groups with wildcards
Structure features
McPromoter – DNA
sequence,
bending, DNA
twist, ANN
EP3 – features from1, prediction based just on the threshold imposed on the structural profile.1 Florquin K et al.,
Large-scale structural analysis of the core promoter in mammalian and plant genomes.
Nucleic Acids Res.
2005
;
33
(13):
4255Slide32
Integrated approaches
combine sequence, context and structural features
ARTS
– SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies)
does not distinguish CpG related promoter from unrelated, not clear how it performs on non-CpG
SCS
– sequence (TATA, Inr, DPE, CpG), structure (flexibility), and
context (6-mers
) features are used in different prediction models, their outcomes are combined by Decission Tree
CoreBoost
– boosting technique with stumps, integrates core promoter signals, DNA flexibility,
n
-mer frequency, …
CoreBoost_HM … adds experimental histone modification dataSlide33
Boosting, stumps
Boosting
Belongs between
ensemble
methods that
produce
a very accurate prediction
rule (strong learner)
by combining rough and moderately
inaccurate (i.e. just a bit better than random guessing) rules (weak learners, WL).
Iteratively learn
weak classifiers
and add them to a final strong classifierWhen WL is added, it’s weighted based on their
accuracy. After a WL is added, the data is reweighted: misclassified examples gain weight and correctly classified examples lose weight. Thus, future WLs focus more on the examples that previous WLs
misclassified.
Stump
One-level decision tree (i.e. it has one root and
two terminal nodes)
source:
wikipediaSlide34
Databases
EPD – Eukaryotic Promoter Database
http://
epd.vital-it.ch
manually annotated
non-redundant collection of eukaryotic POL II
promoters
DBTSS
http://dbtss.hgc.jp
/
putative core promoter: e.g. -100 bp … +50 bp, -250 bp … +50 bp, -200 … +200 bpSlide35
Actual state of the promoter prediction
CpG island promoters are better to predict than non-CpG.
CpG islands usually correspond to housekeeping genes. Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy.
non-CpG islands correspond to
tissue-specific
expression. And are the bottleneck in accurate promoter prediction.
Best way how to do it: use transcription data. Alignment of the 5’ of ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS. Slide36
Future directions
False
positives
are still the main problem.
This is because the information about
chromatine structure
is missing in prediction models.
Without knowing
which regions of chromatin are
opened
or closed (and to what degree),
researchers
have to assume the whole genome is
accessi- ble for binding, which is obvi- ously wrong and will lead to more FP (and FN because
of
the extra noise
).
Chromatin remodelling
:
enzyme-assisted
movement
of nucleosomes
on
DNA.
source:
http
://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.htmlSlide37
PPP evaluation and comparison
clanky Whole genome a updateSlide38
Motif discovery
So far we have discussed only one of the problems in computational promoter analysis: localization of the core promoter (TSS prediction)
Another related problem is identification of
cis
-regulatory elements –
motif discovery
.Slide39
Motif discoverySlide40
References
Biology of transcriptional regulation
Pedersen
AG, Baldi P, Chauvin Y, Brunak S. The biology of eukaryotic promoter
prediction-a
review.
Comput. Chem
. 1999 Jun 15;23(3-4):191-207
.
comprehesive
list of features and references to their models may be found in Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255-64. PMID:
16049029Slide41
Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition.
IEEE/ACM Trans Comput Biol Bioinform.
2010
;
7
(3):550-62. PubMed PMID: 20671324
.
In Introduction contains very nice overview of sequence, context and structure features and lists promoter prediction software using these features.Slide42
Large-scale software comparison
Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome.
Nat Biotechnol.
2004
Nov;
22
(11):1467-73. PMID: 15529174. Slide43
IUPAC words