Marcus Chibucos PhD University of Maryland School of Medicine June 2014 Overview amp goals Understand 1 How we predict presence amp structure of coding amp noncoding genes in the genome ID: 296122
Download Presentation The PPT/PDF document "An Overview of Gene Structure & Func..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Overview of Gene Structure & Function Prediction
Marcus Chibucos, Ph.D.
University of Maryland School of Medicine
June 2014Slide2
Overview & goals
Understand
1. How we predict
presence & structure
of coding & non-coding genes in the genome
2. How we know
what a gene product does
&
how evidence is used
to support this
When searching databases like
FungiDB
or
InterPro
, understand the meaning of terms like:
protein motif
,
domain
,
ortholog
,
HMM
,
EC
,
GO annotation
,
and so forth
Learn fundamentals with prokaryotes
Overview of eukaryotesSlide3
Gene Structural Annotation
3Slide4
What is a gene model?
Yandell
and
Ence
(2012) Nature Reviews Genetics. 13:329-342.Slide5
Fundamental methods of pattern detection
Intrinsic (
ab
initio
/
de novo
, “from the beginning”)
Uses only DNA sequence & the inherent patterns within it
Canonical features like start & stop codons
Extrinsic
Uses additional sources of evidence information
Homologous proteins
mRNA (ESTs, RNA-
Seq
)
SyntenySlide6
Prokaryotic Structural Annotation
6Slide7
Prokaryotic gene structure
AUG
RBS
DNA
mRNA
ATG
TAG
U
AG
start
stop
Open reading frame (ORF)
promotor
RBS
start
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide8
Start with DNA sequenceSlide9
DNA sequence has 6 translation frames
3 on forward strand, 3 on reverse strand
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide10
Each horizontal bar represents one of the translation frames.
Tall vertical lines represent translation stops (
TAG, TAA, TGA
).
Short vertical lines represent translation starts (
ATG, GTG, TTG
).
Graphical display of 6-frame translation
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide11
These are examples of the many
ORFs in this graphic.
stop
start
Graphical display of 6-frame
translation
What is an “ORF”?Slide12
Prokaryotic gene finders
Glimmer
http://www.cbcb.umd.edu/software/glimmer
prok
and
euk
versions
Prodigal
http://
prodigal.ornl.gov
GeneMark
http://
exon.gatech.edu
prok
and
euk
versionsEasyGene
http://www.cbs.dtu.dk/services/EasyGeneMany others exist (or have existed...)Slide13
Glimmer
Tool uses interpolated Markov models (IMMs) to predict which ORFs in a genome contain real genes.
Glimmer compares nucleotide patterns it finds in a training set of genes known (or believed) to be real to nucleotide patterns of ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves.
Using Glimmer is a two-part process
Train
Glimmer with genes from organism that was sequenced, which are known, or strongly believed, to be real genes.
Run
trained Glimmer against the entire genome sequence.
This is actually how most
ab
initio
gene predictors—including eukaryotic predictors like Augustus,
GeneID
, SNAP, and others—work.Slide14
these
not these
Gathering the
training set
Using verified, published sequences ideal… not always possible
Minimum needed is 250 kb of total sequence
BLAST translated ORFs against a protein database (slow)
Keep only very strong matches
Gather long non-overlapping ORFs (fast)
Many more complex strategies exist, especially for eukaryotesSlide15
Training Glimmer
All k-
mers
from size 5-8 in sequence are tracked
Frequency of each nucleotide following any given k-
mer
is recorded
This data set is used to build a statistical model that provides the probability that any given nucleotide will follow any given k-
mer
This model is used to score the ORFs in the genome
Those where the patterns of nucleotides/k-
mers
match the model are predicted to be real genesSlide16
+1
+2
+3
-1
-2
-3
Candidate
ORFs
Choose a minimum length cut-off
Blue ORFs meet this minimum
Each blue ORF will be scored against the model built from the training genes
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide17
Categorizing ORFs as genes or not
Some ORFs will score well to the model (green)
Some will not (red)
Green ORFs will be retained as predicted genes (blue arrows depicted along the DNA molecule in black at the bottom of the figure)
+1
+2
+3
-1
-2
-3Slide18
Potential problems to watch for
False Positives
An ORF is predicted to be a gene, but really isn’t
May result in overlaps
False Negatives
An ORF is not predicted to be a gene, but really is
May result in “gaps” in feature predictions
Wrong start site chosen
Most genes have multiple start codons near the beginning – it can be hard to determine which is the true oneSlide19
+1
+2
+3
-1
-2
-3
Is one of these a False Positive?
Probably. Genes don’t generally overlap to this extent in prokaryotes.
What about eukaryotes?Slide20
+1
+2
+3
-1
-2
-3
Is this a false negative?
Probably. There are not large regions without gene content in prokaryotes.
Why might this happen?
If a region of DNA is different in composition than the rest of the genome then the gene finders will score the ORFs poorly when in fact they are real genes. Different composition may come about in many ways – one common way is through lateral (or horizontal) transfer, e.g. things such as phage integration, transposition, et cetera.
What about eukaryotes?Slide21
Translation start
sites
21
Start
site frequency
:
ATG
>>
GTG
>>
TTG
Ribosome
b
inding
s
ite
(RBS
): AG
rich sequence
5
-11
bp
upstream of the start
codon
Similarity to match proteins,
in BER & multiple
alignments
- Example below
shows beginning of
a BER alignment. (DNA sequence reads down in columns for each
codon.) H
omology starts exactly at first
atg (current chosen start,
aa #1
). There is favorable RBS (
gagggaga) beginning 9
bp
upstream of this atg. No reason to consider the
ttg, and no justification for moving to the second atg
(this would cut off some similarity and it does not have an RBS).
RBS upstream of chosen start
3 possible
start sites
This ORF’s upstream boundary
BER matchSlide22
When
two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. Small regions of overlap are allowed (circle).
Overlap analysis
22Slide23
Areas
of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against a non-redundant protein database.
Interevidence regions
23Slide24
It’s not just about proteins
Can predict many genes beyond protein coding onesSlide25Slide26Slide27
Manatee genome viewer
http://manatee.igs.umaryland.edu
/
http
://
manatee.sourceforge.net
/
igs
/
index.shtmlSlide28
Artemis gene model curation tool
http://
www.sanger.ac.uk
/resources/software/
artemis
/Slide29
Eukaryotic Structural Annotation
29Slide30
…now things get more complicated
Eukaryotic gene structure predictionSlide31
Gene finder evaluation
Sensitivity (
Sn
) measures false negatives
The fraction of a known reference feature that is predicted by a gene predictor
= TP / (TP + FN)
Specificity (
Sp
) measures false positives
The fraction of the prediction that overlaps a known reference feature
= TP / (TP + FP) Slide32
Real gene model
Sensitivity (
Sn
) false negatives = TP / (TP + FN)
Specificity (
Sp
) false positives = TP / (TP + FP)
Sn
= 1.0
Sp
= 0.75
Sn
= 0.67
Sp
= 1.0
False positive
False negative
True positives
True negatives
True positives
Sn
= 3/(3+0) = 1.0
Sp
= 3/(3+0) = 1.0
Assessed at different levels
Base
Exon (pictured above)
Transcript
GeneSlide33
Intrinsic (
ab
initio
) success rates
Prokaryotic – very good >95% correct
Eukaryotic – not so good ~50% correct (shown below)
http://bioinf.uni-greifswald.de/augustus/
accuracy (accessed May 2013)Slide34
Complexities of eukaryotic gene finding
Large eukaryote genomes have low coding density compared to prokaryotes where all long ORFs encode genes
Genomic repeats
Non-canonical (ATG) start codon
Splicing (exons & introns) - alternative splicing (40-50% genes)
Pseudogenes
Long genes or short genes
Long introns
Non-canonical introns
UTR introns
Overlapping genes on opposite strands
Nested genes overlapping on strand or in intron
Polycistronic
peptide coding genes
One mRNA codes for several very short (~11
aa
) peptides… regulatory function
Even if you have some RNA (helpful) transcription not always activeRequire multiple biological conditionsSlide35
Masking repeats is essential
RepeatMasker (http://www.repeatmasker.org) finds interspersed repeats & low complexity DNA sequences by comparing DNA sequence to curated genomic-specific libraries
Simple Repeats – 1-5
bp
duplications such as A, CA, CGG
Tandem Repeats - 100-200 bases found at centromeres & telomeres
Segmental Duplications - 10-300
kilobases
blocks copied to another genomic region
Interspersed Repeats
Processed
pseudogenes
,
retrotranscripts
(short-interspersed elements- SINES): Non-functional copies of RNA genes reintegrated into the genome via reverse transcriptase
DNA transposons
Retrovirus
retrotransposonsNon-retrovirus retrotransposons (long interspersed elements- LINES)
~50% of human genomic DNA currently will be maskedRepeatModeler searches for repeats
ab initio and can find not previously characterized repeatsSlide36
Repeats yield similarities in non-homologous regions
Alkes
L. Price, Neil C. Jones and
Pavel
A.
Pevzner
(June
28, 2005
)
http
://
bix.ucsd.edu
/
repeatscout
/
repeatscout-ismb.ppt
GENE1
GENE2
GENE1
GENE2
Using unmasked genomic DNA
Using masked genomic DNASlide37
Predicted genes that are actually repeats
Using masked genomic DNA
Using unmasked genomic DNA
Gene predictors
Repeats
Predicted
models
No
modelsSlide38
Factors affecting gene predictor results
Underlying algorithm
Program parameters
Training set (number and quality of models)
Extrinsic data (expression data, protein/genome alignment)
Training set 1
Training set 2
GeneMark-ES (self training)
9,024
9,024
Augustus trained on
Fungus
8,694
9,011
Augustus with
“
optimize”
step
8,503
8,920
SNAP trained on
Fungus
7,335
7,955
GlimmerHMM
trained on Fungus10,31311,894
Scipio alignments with other Fungi10,691
10,691Trinity assemblies GMAP aligned
9,527
9,527
Trinity (
Jaccard clip option on)
10,023
10,023GLEAN consensus
8,7059,123Slide39
Which model is “correct”?
Protein alignments
Consensus model
Models from three different predictors/conditionsSlide40
We rely on certain conventions
Rules are based on gene composition & signal
First, what is the basic structure of a gene?
Coding region (exon) is inside ORF of one reading frame
All exons on same strand for a given gene
Exons within a gene can have different reading frame
Inherent frequency patterns exist…Slide41
Dimer frequency distribution
Dimer frequency in protein sequence is
not
evenly distributed and
is
organism specific
Some amino acids “prefer” to be next to one another
Most
dicodons
are biased toward either coding or non-coding, not neutral
Expected frequency of dimer
I
f random = 0.25% (1/20 * 1/20)
If a dimer has lower than expected frequency, protein less likely to contain it… and the reasoning follows that if a sequence
does
contain it, it is less likely to exist in a coding region
Example: In human genome
, AAA AAA appears 1% of time in coding regions and 5% of time in non-coding regionsSlide42
Modified from: http://
en.wikipedia.org
/wiki/
File:Intron_miguelferig.jpg
http://
en.wikipedia.org
/wiki/
File:Pre-mRNA_to_mRNA.svg
Splicing
Find all GT/AG donor/acceptor sites
Score with
position-specific scoring matrix
(PSSM)
model
s
plice
donor
splice
acceptor
p
oly-
pyrimidine
tract
b
ranch
pointSlide43
Position Specific Scoring Matrix (PSSM)
1
2
3
4
5
6
7
8
A
G
C
U
1
1
1
0
0
0
1
1
0
4
0
2
1
1
1
2
2
5
0
0
2
1
2
1
0
0
5
0
0
0
4
1
Let’s say you look at 5 splice donor (GU) sites:
ATCGUCGC
UCAGUGGC
CUCGUCCC
GUCGUUAC
CACGUCUA
Gene
Gene finders use this information to predict where gene features are.
For this to work,
one must have
confirmed
splice sites
to use for training. These are not always available for new genomes… and some splice sites are non-canonical… and some genes are alternatively spliced… so it can become somewhat complex.Slide44
Translation start prediction
Position-specific scoring matrix (PSSM)
Certain nucleotides tend to be in position around start site (ATG), and others not so
Such biased nucleotide distribution is basis for translation start prediction
Figure courtesy of
Sucheta
Tripathy
http
://
www.slideshare.net
/
tsucheta
/29th-
june2011Slide45
Mathematical model
Fi(X): freq. of X (A, G, C, T) in position I
Score string by
Σ
log (Fi (X)/0.25)
Figure courtesy of
Sucheta
Tripathy
http
://
www.slideshare.net
/
tsucheta
/29th-
june2011Slide46
Pattern-based exon & gene prediction
Assess different criteria
Coding region inside ORF (start & stop, no interrupting stops)
Dimer frequency
Coding score
Donor site score
Acceptor site score
Other factors to consider
GC content
Exon length distribution
Polymerase II promoter elements (GC box, CCAT box, TATA region)
Ribosome binding site
Polyadenylation signal upstream poly-A cleavage site
Termination signal downstream poly-A cleavage siteSlide47
Example of ab
initio
gene predictor flow
http://
genome.crg.es
/software/
geneid
/Slide48
Confirming a predicted gene with cDNA
http://
pasa.sourceforge.net
/
26 exons!Slide49
Extrinsic evidence & manual curation
Expression data
EST (expressed sequence tag) sequences
RNA
-
s
eq
reads
mRNA
cDNA
High throughput sequencing
Align reads to genome
sequence
Homology based approaches
Protein (or expression data) sequences from other organisms
Nucleic acid conservation via
tblastx or many other methods
Ortholog mapping/syntenyExperimentally confirmed gene products & gene families
Manual curation is often done by experts in a domainSlide50
mRNA
cDNA
GCTAATGCGAAGTCCTAGACCAGATTGAC
ATGCGATGCAGCTGACGCTGGCTAATGCG
CGCATAGCCAGATGACCATGATGCGATGC
TGACAGATTAGACAGTAGGACAGATAGAC
……..many millions of reads
Reads mapped to genome with gene models
?
Gene model is confirmed by transcript information
Part of the gene model is confirmed but the exons predicted in the middle do not have transcript evidence. Does this mean they are not real? Not necessarily.
Transcript sequencing allows for novel gene detection. There is transcript evidence for the presence of a gene (or at least transcription) in an area of the genome without a gene model currently predicted.
1
2
3
RNA-
s
eq
of transcripts as evidence for gene modelsSlide51
Splice boundaries and alternate transcripts
Some reads will span the intron/exon boundaries
Allows for verification of gene models
Observation of alternate transcripts
Intron
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.pngSlide52
Multiple genome alignment & conservationSlide53
Experimentally based manual curation
We have experimentally characterized protein
What do I know about this gene family?
What do I know about genes in general?
No introns in multiples of three, short introns, et ceteraSlide54
Leverage comparative genomics
Arnaud, et al. (2010) Nucleic
Acids
Res
.
38
(Database issue)
: D420
-7
.Slide55
Gather models for ab
initio
training set
Get models verified via expression, homology, or manual curation
Use
manually curated
genes from your organism
Generate preliminary
ab
initio
model set and then do a
homology search
at Swiss-
Prot
, retaining most-conserved genes
Use
CEGMA
(Core Eukaryotic Genes Mapping Approach) to predict highly conserved genesAlign proteins from related organisms to your genome with splice-aware aligner, thus creating models with exon boundaries that have homologsAlign RNA-
seq reads or ESTs to your genome to create or update existing models.
Use models with multiple sources & remove highly similar ones ORUse pre-existing training set related to your organism
For example, I could use chicken if I am studying finchMany software packages provide parameter files for common organismsSlide56
Run gene finder as online or stand alone
Augustus web has text & graphical output
Click!
Predictions stored in GFF3
or GFF2 or GTF formatSlide57
RNA-
s
eq
can show differential expression of alternative transcriptsSlide58
Combiners
Incorporate multiple evidence types including
ab
initio
predictions, expression data, and homology—and these usually perform the best
Glean
Evidence Modeler (EVM)
Jigsaw
Maker (actually a whole pipeline that can be used online)
PASA (combines predicted structures with expression data)
And more…
Note that many
ab
inito
predictors, for example Augustus, incorporate other data types such as protein alignments or expression dataSlide59
One example, the Glean combiner
Glean paper at http://genomebiology.com/2007/8/1/
R13
Top track below is a statistically derived combination of the ones below it Slide60
Example of annotation pipeline
Fungal Genome Annotation Standard Operating Procedure (SOP) at JGI
Repeat masking
Mapping ESTs (BLAT) from organism and publicly available proteins from related taxa (
BLASTx
)
Ab
initio
(FGENESH, GeneMark),
homolgy
-based (FGENESH+,
Genewise
seeded by
BLASTx
against nr), EST-based (
EST_map
) gene predictionEST clustering to improve gene modelsFiltering overlapping gene models based on protein homology and EST support to derive “best” model
Non-coding genes with tRNAscan-SE…ready for functional annotation
http://genome.jgi.doe.gov
/programs/fungi/FungalGenomeAnnotationSOP.pdfSlide61
nGASP – the nematode genome annotation assessment project
http://
www.biomedcentral.com
/1471-2105/9/549Slide62
Take home message
Intrinsic & extrinsic prediction methods
Intrinsic gene finders need high-quality training datasets in order to produce good predictions
“Correct” gene predictions are a moving target
Note the steady decrease in the number of predicted genes as the human genome is further curated
G
ene finders & gene finding pipelines produce predictions, which must be verified and refined – do not take them at face value
The more pieces of high-quality evidence you add to the process the better
In eukaryotes especially, there is not necessarily only one correct modelSlide63
Protein Functional Annotation
63Slide64
Annotation defined
64
annotate
– to make or furnish critical or explanatory notes or comment.
--
Merriam-Webster dictionary
genome annotation
– the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.
--
Lincoln Stein, PMID 11433356
Gene Ontology (GO) annotation
– the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.
--
http://
www.geneontology.orgSlide65
What do our predicted genes do
?
What we would like:
Experimental knowledge of function
Literature curation
Perform experiment
Not possible for all proteins in most organisms (not even close in most)
What we actually have:
Sequence similarity
Similarity to motifs, domains, or whole sequences
Protein not DNA for finding function
Shared sequence can imply shared function
All sequence-based annotations are putative until proven experimentally
65Slide66
Basic set of protein annotations
protein name
- descriptive common name for the protein
e.g. “
ribokinase
”
gene symbol
- mnemonic abbreviation for the gene
e.g. “
recA
”
EC number
- only applicable to enzymes
e.g. 1.4.3.2
r
ole
- what the protein is doing in the cell and whye.g. “amino acid biosynthesis”
supporting evidenceaccession numbers of BER and HMM matchesTmHMM, SignalP, LipoP
whatever information you used to make the annotationunique identifier
e.g. locus ids66Slide67
Alignments/Families/Motifs
67
pairwise alignments
two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match
multiple alignments
3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column
more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function.
protein families
clusters of proteins that all share sequence similarity and presumably similar function
may be modeled by various statistical techniques
motifs
short regions of amino acid sequence shared by many proteins
transmembrane
regions
active sites
signal peptidesSlide68
Important terms to understand
homologs
two sequences have evolved from the same common ancestor
they may or may not share the same function
two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another.
orthologs
a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciation event has created the two copies of the sequence.
orthologs often, but not always, share the same function
paralogs
a type of homolog where the two sequences have arisen due to a gene duplication within one species
paralogs
will initially have the same function (just after the duplication) but as time goes by, one copy will be free to evolve new functions, as the other copy will maintain the original function. This process is called “
neofunctionalization
”.
xenologs
a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer
68Slide69
69
ancestor
speciation to orthologs
duplication to paralogs
one
paralog
evolves a new function
lateral transfer to a different species makes xenologs
“
neofunctionalization
” – the duplicated
gene/protein develops a new functionSlide70
Pairwise alignments
There are numerous tools available for pairwise alignments
NCBI BLAST resources
FASTA searches
Many more
At IGS we use a tool called BER (BLAST-extend-
repraze
) that combines BLAST and Smith-Waterman approaches
Actually much of bioinformatics is based on reusing tools in new and creative ways…
70Slide71
BER
71
BLAST
modified Smith-
Waterman Alignment
genome’s protein set
vs.
Significant hits (using a liberal cutoff) put into mini-
dbs
for each protein
non-redundant protein database
vs.
Query
protein is extended
mini-database from BLAST search
,
mini-db for protein #1
mini-db for protein #2
,
mini-db for protein #3
...
mini-db for protein #3000
Mini database
BER
alignment
Extended Query
protein by 300
ntSlide72
BER Alignment
72
…to
look through in-frame stop codons and across
frameshifts
to determine if similarity continuesSlide73
73Slide74
Extensions in BER
74
The extensions help in the detection of
frameshifts
(FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence.
Blue
line indicates predicted protein coding sequence,
green
line indicates up- and downstream extensions.
Red
line is the match protein.
ORFxxxxx
300 bp
300 bp
FS
PM
FS or PM ?
two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs
end5
end3
search protein
match protein
similarity extending through a frameshift upstream or downstream into extensions
similarity extending in the same frame through a stop codon
?
normal full length match
*
!
!Slide75
How do you know when an alignment is good enough to determine function?
Good question! No easy answer…
Generally, you want a minimum of 40%-50% identity over the full lengths of both query and match with conservation of all important structural and catalytic sites
However, some information can be gained from partial alignments
Domains
Motifs
BEWARE OF TRANSITIVE ANNOTATION ERRORS
75Slide76
Pitfalls of transitive annotation
Current public datasets full of such errors
A good way to avoid transitive annotation errors is to require that in a pairwise match, the match annotation must be trusted
Be conservative
Err on the side of not making an annotation, when possibly you should, rather than making an annotation when probably you shouldn’t.
76
A
B
B
C
C
D
Transitive Annotation is the process of passing annotation from one protein (or gene) to another based on sequence similarity:
A’s name has passed to D from A through several intermediates.
-
This is fine if A is similar to D.
-
This is NOT fine if A is NOT similar to D
Transitive annotation errors are easy to make and happen often.Slide77
Trusted annotations
It is important to know what proteins in our search database are characterized.
proteins marked as characterized from public databases
Gene Ontology repository (more on this later)
GenBank (only recently began)
UniProt
proteins at “protein existence level 1”
Proteins with literature reference tags indicating characterization
77Slide78
Swiss-
Prot
European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB)
all entries manually curated
http://
www.expasy.ch
/
sprot
annotation includes
links to references
coordinates of protein features
links to cross-referenced databases
TrEMBL
EBI and SIB
entries have not been manually curated
once they are accessions remain the same but move into Swiss-
Prot
http://
www.expasy.ch
/
sprot
Protein Information Resource (PIR)
http://
pir.georgetown.edu
78
UniProt
UniProt http://www.uniprot.orgSlide79
UniProt
79Slide80
80Slide81
81Slide82
82Slide83
83Slide84
Enzyme Commission
not sequence based
categorized collection of enzymatic reactions
reactions have accession numbers indicating the type of reaction, for example EC 1.2.1.5
http://www.chem.qmul.ac.uk/iubmb/enzyme/
http://www.expasy.ch/enzyme/
84
Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they
CatalyseSlide85
All ECs starting with #1 are some kind of
oxidoreductase
Further numbers narrow specificity of the type of enzyme
A four-position EC number describes one particular reaction
EC number
Hierarchy
85Slide86
Example entry for one specific enzyme
86Slide87
Metabolic pathway databases
KEGG
http://www.genome.jp/kegg/
MetaCyc
/
BioCyc
http://metacyc.org/
http://www.biocyc.org/
BRENDA
http://www.brenda-enzymes.info/
87Slide88
88Slide89
89Slide90
Hidden Markov models (HMMs
)
Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and, presumably, functional similarity
Two sets routinely used for protein functional annotation
TIGRFAMs
(
www.tigr.org
/TIGRFAMs/)
Pfam
(
pfam.sanger.ac.uk
)
Each TIGRFAM model is assigned to a category which describes the type of functional relationship the proteins in the model have to each other
Equivalog
- one specific function, e.g. “
ribokinase
”
Subfamily
- group of related functions generally with different substrate specificities, e.g. “carbohydrate kinase”
Superfamily
- different specific functions that are related in a very general way, e.g. “kinase”
Domain
- not necessarily full-length of the protein, contains one functional part or structural feature of a protein, may be fairly specific or may be very general, e.g. “ATP-binding domain
”
90Slide91
Annotation attached to HMMs
Functionally specific HMMs have specific annotations
TIGR00433 (
accession number for the model)
name: biotin synthase
category:
equivalog
EC: 2.8.1.6
gene symbol:
bioB
Roles:
biotin biosynthesis (TIGR 77/GO:0009102)
biotin synthase activity (GO:0004076)
Functionally general HMMs have general annotations
PF04055
name: radical SAM domain protein
category: domain
EC: not applicable
gene symbol: not applicable
Roles:
enzymes of unknown specificity (TIGR role 703)
catalytic activity (GO:0003824)
metabolism (GO:0008152)
91Slide92
HMM building
Alignments of functionally related proteins act as training sets for HMM building
Statistical Model
Model specific to a family of proteins, generally found across many species
Proteins from many species
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide93
HMM scores
93
When a protein is searched against an HMM it receives a BITS score and an e-value indicating the significance of the match
The search protein’s score is compared with the trusted and noise cutoff scores attached to the HMM
proteins scoring above the trusted cutoff can be assumed to be members of the family
proteins scoring below the noise cutoff can be assumed NOT to be members of the family
when proteins score in-between the trusted and noise cutoffs, the protein may be a member of the family and may not.
N
Statistical Model
T
Statistical Model
The person building the HMM will search the new HMM against a protein database and decide on the trusted and noise cutoff scoresSlide94
HMM databases
94
Alignments of functionally related proteins act as training sets for HMM building
N
Statistical Model
T
Database of HMM models, each specific to one protein family and/or functional level
Add this model to the database
Model specific to a family of proteins, generally found across many species
Proteins from many species
Examples:
Pfam
and TIGRFAM
Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,
University of Maryland School of Medicine, 2013Slide95
0
100
-50
0
100
-50
0
100
-50
…above trusted: the protein is a
member of family the HMM models
…below noise: the protein is not a
member of family the HMM models
…in-between noise and trusted:
the protein MAY be a member of
the family the HMM models
T
N
P
0
100
-50
...above trusted and some or all
scores are negative: the protein is
a member of the family the HMM models
The cutoff scores attached to HMMs, are sometimes high and sometimes low and sometimes even negative. There is no inherent meaning in how high or low a cutoff score is, the
important thing is the query protein’s score relative to the trusted and noise scores.
95Slide96
Orthologous groups
COGs – have not been updated in a long time
eggNOG
– newer, more complete
96
A
B
C
1
2
3
Bi-directional best BLASTSlide97
Motif searches
97
PROSITE
- http://
www.expasy.ch
/
prosite
/
“consists of documentation entries describing protein domains, families and functional sites as well as associated patterns to identify them.”
Center for Biological Sequence Analysis
-
http
://
www.cbs.dtu.dk
/
Protein Sorting (7 tools)
Signal P
finds potential secreted proteins
LipoP
finds potential lipoproteins
TargetP
predicts subcellular location of proteins
Protein function and structure (9 tools)
TmHMM
finds potential membrane spans
Post-translational modifications (14 tools)
Immunological features (9 tools)
Gene finding and splice sites (9 tools)
DNA microarray analysis (2 tools)Small molecules (2 tools)Slide98
One-stop shopping -
InterPro
InterPro
Brings together multiple databases of HMM, motif, and domain information.
Excellent annotation and documentation
http://
www.ebi.ac.uk
/
interpro
/
98Slide99
Making annotations
Use the information from the evidence sources to decide what the gene/protein is doing
Assign annotations that are appropriate to your knowledge
Name
EC number
Role
Etc.
99Slide100
Main Categories:
Amino acid biosynthesis
Purines,
pyrimidines
, nucleosides, and nucleotides
Fatty acid and
phospholipidmetabolism
Biosynthesis of cofactors, prosthetic groups, and carriers
Central intermediary metabolism
Energy metabolism
Transport and binding proteins
DNA metabolism
Transcription
Protein synthesis
Protein Fate
Regulatory Functions
Signal Transduction
Cell envelope
Cellular processes
Other categories
Unknown
Hypothetical
Disrupted Reading Frame
Unclassified
(
not a real role
)
Each main category has several subcategories.
TIGR
rolesSlide101
Names (and other annotations) should reflect knowledge
101
specific function
Example
: “
adenylosuccinate
lyase
”,
purB
, 4.3.2.2
varying knowledge about substrate specificity
A good example: ABC transporters
ribose ABC transporter
sugar ABC transporter
ABC transporter
choosing the name at the appropriate level of specificity requires careful evaluation of the evidence looking for specific characterized matches and
HMMs
.
family designation -
no gene symbol, partial EC
“
Cbby
family protein”
“carbohydrate
kinase
, FGGY family”
hypotheticals
“hypothetical protein”
“conserved hypothetical protein”Slide102
Ontologies
102Slide103
Names can be problematic….
….because humans do not always use precise and consistent terminology
Our language is riddled with
Synonyms – different names for the same thing
Homonyms – different things with the same name
This makes data mining/query difficult
What name should you assign?
What name should you use when you search UniProt or NCBI or any other database?
103Slide104
Synonyms
Within any domain do people use precise & consistent language?
Take biologists, for example…
Mutually understood concepts – DNA, RNA, protein
Translation & protein synthesis
Synonym: one thing, more than one name
Enzyme Commission reactions
Standardized id, official name & alternative names
104
http://www.expasy.ch/enzyme/2.7.1.40Slide105
Homonyms
Different things known by same name
Common in biology
Sporulation
Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?)
105
Endospore formation
Bacillus
anthracis
Reproductive
sporulation
Asci
&
ascospores
,
Morchella
elata
(morel)
http://en.wikipedia.org/wiki/File:Morelasci.jpg
©PG Warner 2008 (accessed 17-Sep-09)
http://www.microbelibrary.org/ASMOnly/details.asp?id=1426&Lang=
©L Stauffer 2003 (accessed 17-Sep-09)
“
Sporulation
”Slide106
Standardization with controlled vocabularies (CVs)
An official list of precisely defined terms used to classify information & facilitate its retrieval
Flat list
Thesaurus
Catalog
106
http://www.nlm.nih.gov/nichsr/hta101/ta101014.html
A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.”
Benefits of CVs
Allow standardized descriptions
Synonyms & homonyms addressed
Can be cross-referenced externally
Facilitate electronic searchingSlide107
Ontology: CV with defined relationships
Formalizes knowledge of subject with precise textual definitions
Networked terms; child more specific (“granular”) than parent
107
National Drug FileSlide108
An example is the Gene Ontology with three controlled vocabularies
Molecular Function
What the gene product is doing
Biological Process
Why the gene product is doing what it does
Cellular component
Where a gene product is doing what it does
108Slide109
The Gene Ontology
A good example of a biological ontology
Relationships among networked, defined terms
Vascular terms shown with relationships
Less specific
More granularSlide110
Example: a GO annotation
Associating GO term with gene product (GP)
GP has function (6-phosphofructokinase activity)
GP participates in process (glycolysis)
GP is located in part of cell (cytoplasm)
Linking GO term to GP asserts it has that attribute
Based on literature or
computational methods
Always involves:
Learning something about gene product
Selecting appropriate GO term
Providing appropriate evidence code
Citing reference [preferably open access]
Entering information into GO annotation file
110Slide111
Annotation becomes a series of ids linked to other proteins/genes/features
GO:0005887
GO:0008272
GO:0015419
GO:0043190
111
This protein is integral to the plasma membrane and is part of an ATP-binding cassette (ABC) transporter complex. It functions as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source.
=Slide112
112
Term name
GO ID (unique numerical identifier)
Precise textual definition that describes some aspect of the biology of the gene product
Synonyms for searching, alt. names, misspellings…
GO slim
Ontology relationships (next page)
Definition referenceSlide113
Genomes can be compared
High-level biological process terms used to compare
Plasmodium
and
Saccharomyces
(made by “slimming”)
113
MJ Gardner, et al. (2002) Nature 419:498-511Slide114
Evidence
114Slide115
The importance of evidence
t
racking
115
I conclude that
you are a cat.
Why?
You look like other cats I know
I heard you meow and purr
Why?
You look like other protein
kinases
I know
You have been observed to add phosphate to proteins
I conclude that
you code for a
protein
kinase
.
The process of functional annotation involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why.
Functional annotations should only be as specific as the supporting evidence allows
All evidence that led to the annotation conclusions that were made must be stored.
In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.Slide116
Knowledge & annotation specificity
Available evidence for three genes
Corresponding GO annotations
How much can we
accurately
say?Slide117
Types of Evidence
Experiments (often considered the best evidence)
Pairwise/multiple alignments
HMM/domain matches scoring above trusted cutoff
Metabolic Pathway analysis
Match to an
ortholog
group (
COG,eggNOG
)
Motifs
117Slide118
The Evidence Ontology (ECO)
ECO terms have standardized definitions, references & synonyms
Allows standardizing evidence description and searching by evidence type
Can filter by evidence type & do other things!
GO evidence codes are subset of
ECOSlide119
ECO roots and combinatorial termSlide120
The big picture: Evidence and sequence repositoriesSlide121
Experimental pathway
Can be described by particular evidence term
(Many ECO terms describe outputs of processes in aggregate)
Researcher performs experiment...Slide122
Publish…
Author does analysis & arrives at conclusions,
publishes these in paper
A
different
person reads about the conclusion, interprets & makes annotation
Term from ontology or other descriptive vocabulary
Protein name, id, et cetera
How associating a piece of information with a protein was performed. Who did the work, a person or a computer?
...annotateSlide123
Annotation stored in sequence repository
The annotation with a term from a descriptive vocabulary/ontology such as the Gene
Ontolgoy
The protein sequence
The evidence term used to support the decision to associate the term from the descriptive vocabulary with that particular protein
Not shown: protein ID, date, name, et ceteraSlide124
The overall experimental annotation flow
Pure evidence term
Pure assertion method term
Evidence x assertion method cross productSlide125
Similarity pathway
…researcher
or
computer performs search
Protein sequence of interest
Sequence database
Resulting alignment with match protein
Evidence termSlide126
Interpretation is made and annotation is made
Assertion
method
“Protein 1 & 2 look same, so protein 1 has same function as protein 2…”
The assertionSlide127
Store in sequence repository
Evidence x assertion method cross product
Same evidence, but we know who or what did the asserting…Slide128
The overall similarity annotation flow
Pure evidence term
Pure assertion method term
Evidence x assertion method cross productSlide129
Concluding remarks
129Slide130
The big picture: an example
p
ipeline
130
DNA Sequence
(assembly, masking)
Automatic Annotation using the evidence hierarchy of
Pfunc
Searches:
Pairwise
BER searches against UniRef100
HMM searches against
Pfam
and
TIGRfam
Motif searches with
LipoP
, THMHH, PROSITE
NCBI
COGs
Prium
profiles
Automated start site and gene overlap correction
translation
RNA finding:
tRNAScan
, RNAMMER, homology searches
Predicted RNA Genes
Gene Prediction
Predicted protein coding genes
MySQL
database using the
C
hado
schema
Genome viewer/editor
Flat files of annotation informationSlide131
Some concluding themes…
The best annotation comes from looking at multiple sources of evidence
It is important to track and check the evidence used in an annotation
Do not assume the annotation you see on a protein is correct unless it comes from a trusted source
Always err on the side of under-annotating rather than over-annotating
Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of information it provides.
131