/
An Overview of Gene Structure & Function Prediction An Overview of Gene Structure & Function Prediction

An Overview of Gene Structure & Function Prediction - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
400 views
Uploaded On 2016-04-27

An Overview of Gene Structure & Function Prediction - PPT Presentation

Marcus Chibucos PhD University of Maryland School of Medicine June 2014 Overview amp goals Understand 1 How we predict presence amp structure of coding amp noncoding genes in the genome ID: 296122

gene protein evidence annotation protein gene annotation evidence genes amp sequence http genome model proteins www start dna specific

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Overview of Gene Structure & Func..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An Overview of Gene Structure & Function Prediction

Marcus Chibucos, Ph.D.

University of Maryland School of Medicine

June 2014Slide2

Overview & goals

Understand

1. How we predict

presence & structure

of coding & non-coding genes in the genome

2. How we know

what a gene product does

&

how evidence is used

to support this

When searching databases like

FungiDB

or

InterPro

, understand the meaning of terms like:

protein motif

,

domain

,

ortholog

,

HMM

,

EC

,

GO annotation

,

and so forth

Learn fundamentals with prokaryotes

Overview of eukaryotesSlide3

Gene Structural Annotation

3Slide4

What is a gene model?

Yandell

and

Ence

(2012) Nature Reviews Genetics. 13:329-342.Slide5

Fundamental methods of pattern detection

Intrinsic (

ab

initio

/

de novo

, “from the beginning”)

Uses only DNA sequence & the inherent patterns within it

Canonical features like start & stop codons

Extrinsic

Uses additional sources of evidence information

Homologous proteins

mRNA (ESTs, RNA-

Seq

)

SyntenySlide6

Prokaryotic Structural Annotation

6Slide7

Prokaryotic gene structure

AUG

RBS

DNA

mRNA

ATG

TAG

U

AG

start

stop

Open reading frame (ORF)

promotor

RBS

start

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide8

Start with DNA sequenceSlide9

DNA sequence has 6 translation frames

3 on forward strand, 3 on reverse strand

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide10

Each horizontal bar represents one of the translation frames.

Tall vertical lines represent translation stops (

TAG, TAA, TGA

).

Short vertical lines represent translation starts (

ATG, GTG, TTG

).

Graphical display of 6-frame translation

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide11

These are examples of the many

ORFs in this graphic.

stop

start

Graphical display of 6-frame

translation

What is an “ORF”?Slide12

Prokaryotic gene finders

Glimmer

http://www.cbcb.umd.edu/software/glimmer

prok

and

euk

versions

Prodigal

http://

prodigal.ornl.gov

GeneMark

http://

exon.gatech.edu

prok

and

euk

versionsEasyGene

http://www.cbs.dtu.dk/services/EasyGeneMany others exist (or have existed...)Slide13

Glimmer

Tool uses interpolated Markov models (IMMs) to predict which ORFs in a genome contain real genes.

Glimmer compares nucleotide patterns it finds in a training set of genes known (or believed) to be real to nucleotide patterns of ORFs in the whole genome. ORFs with patterns similar to the patterns in the training genes are considered real themselves.

Using Glimmer is a two-part process

Train

Glimmer with genes from organism that was sequenced, which are known, or strongly believed, to be real genes.

Run

trained Glimmer against the entire genome sequence.

This is actually how most

ab

initio

gene predictors—including eukaryotic predictors like Augustus,

GeneID

, SNAP, and others—work.Slide14

these

not these

Gathering the

training set

Using verified, published sequences ideal… not always possible

Minimum needed is 250 kb of total sequence

BLAST translated ORFs against a protein database (slow)

Keep only very strong matches

Gather long non-overlapping ORFs (fast)

Many more complex strategies exist, especially for eukaryotesSlide15

Training Glimmer

All k-

mers

from size 5-8 in sequence are tracked

Frequency of each nucleotide following any given k-

mer

is recorded

This data set is used to build a statistical model that provides the probability that any given nucleotide will follow any given k-

mer

This model is used to score the ORFs in the genome

Those where the patterns of nucleotides/k-

mers

match the model are predicted to be real genesSlide16

+1

+2

+3

-1

-2

-3

Candidate

ORFs

Choose a minimum length cut-off

Blue ORFs meet this minimum

Each blue ORF will be scored against the model built from the training genes

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide17

Categorizing ORFs as genes or not

Some ORFs will score well to the model (green)

Some will not (red)

Green ORFs will be retained as predicted genes (blue arrows depicted along the DNA molecule in black at the bottom of the figure)

+1

+2

+3

-1

-2

-3Slide18

Potential problems to watch for

False Positives

An ORF is predicted to be a gene, but really isn’t

May result in overlaps

False Negatives

An ORF is not predicted to be a gene, but really is

May result in “gaps” in feature predictions

Wrong start site chosen

Most genes have multiple start codons near the beginning – it can be hard to determine which is the true oneSlide19

+1

+2

+3

-1

-2

-3

Is one of these a False Positive?

Probably. Genes don’t generally overlap to this extent in prokaryotes.

What about eukaryotes?Slide20

+1

+2

+3

-1

-2

-3

Is this a false negative?

Probably. There are not large regions without gene content in prokaryotes.

Why might this happen?

If a region of DNA is different in composition than the rest of the genome then the gene finders will score the ORFs poorly when in fact they are real genes. Different composition may come about in many ways – one common way is through lateral (or horizontal) transfer, e.g. things such as phage integration, transposition, et cetera.

What about eukaryotes?Slide21

Translation start

sites

21

Start

site frequency

:

ATG

>>

GTG

>>

TTG

Ribosome

b

inding

s

ite

(RBS

): AG

rich sequence

5

-11

bp

upstream of the start

codon

Similarity to match proteins,

in BER & multiple

alignments

- Example below

shows beginning of

a BER alignment. (DNA sequence reads down in columns for each

codon.) H

omology starts exactly at first

atg (current chosen start,

aa #1

). There is favorable RBS (

gagggaga) beginning 9

bp

upstream of this atg. No reason to consider the

ttg, and no justification for moving to the second atg

(this would cut off some similarity and it does not have an RBS).

RBS upstream of chosen start

3 possible

start sites

This ORF’s upstream boundary

BER matchSlide22

When

two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If both don’t match anything, other considerations such as presence in a putative operon and potential start codon quality are considered. Small regions of overlap are allowed (circle).

Overlap analysis

22Slide23

Areas

of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc., such regions may include an entire gene in case of “hypothetical proteins”) are translated in all 6 frames and searched against a non-redundant protein database.

Interevidence regions

23Slide24

It’s not just about proteins

Can predict many genes beyond protein coding onesSlide25
Slide26
Slide27

Manatee genome viewer

http://manatee.igs.umaryland.edu

/

http

://

manatee.sourceforge.net

/

igs

/

index.shtmlSlide28

Artemis gene model curation tool

http://

www.sanger.ac.uk

/resources/software/

artemis

/Slide29

Eukaryotic Structural Annotation

29Slide30

…now things get more complicated

Eukaryotic gene structure predictionSlide31

Gene finder evaluation

Sensitivity (

Sn

) measures false negatives

The fraction of a known reference feature that is predicted by a gene predictor

= TP / (TP + FN)

Specificity (

Sp

) measures false positives

The fraction of the prediction that overlaps a known reference feature

= TP / (TP + FP) Slide32

Real gene model

Sensitivity (

Sn

) false negatives = TP / (TP + FN)

Specificity (

Sp

) false positives = TP / (TP + FP)

Sn

= 1.0

Sp

= 0.75

Sn

= 0.67

Sp

= 1.0

False positive

False negative

True positives

True negatives

True positives

Sn

= 3/(3+0) = 1.0

Sp

= 3/(3+0) = 1.0

Assessed at different levels

Base

Exon (pictured above)

Transcript

GeneSlide33

Intrinsic (

ab

initio

) success rates

Prokaryotic – very good >95% correct

Eukaryotic – not so good ~50% correct (shown below)

http://bioinf.uni-greifswald.de/augustus/

accuracy (accessed May 2013)Slide34

Complexities of eukaryotic gene finding

Large eukaryote genomes have low coding density compared to prokaryotes where all long ORFs encode genes

Genomic repeats

Non-canonical (ATG) start codon

Splicing (exons & introns) - alternative splicing (40-50% genes)

Pseudogenes

Long genes or short genes

Long introns

Non-canonical introns

UTR introns

Overlapping genes on opposite strands

Nested genes overlapping on strand or in intron

Polycistronic

peptide coding genes

One mRNA codes for several very short (~11

aa

) peptides… regulatory function

Even if you have some RNA (helpful) transcription not always activeRequire multiple biological conditionsSlide35

Masking repeats is essential

RepeatMasker (http://www.repeatmasker.org) finds interspersed repeats & low complexity DNA sequences by comparing DNA sequence to curated genomic-specific libraries

Simple Repeats – 1-5

bp

duplications such as A, CA, CGG

Tandem Repeats - 100-200 bases found at centromeres & telomeres

Segmental Duplications - 10-300

kilobases

blocks copied to another genomic region

Interspersed Repeats

Processed

pseudogenes

,

retrotranscripts

(short-interspersed elements- SINES): Non-functional copies of RNA genes reintegrated into the genome via reverse transcriptase

DNA transposons

Retrovirus

retrotransposonsNon-retrovirus retrotransposons (long interspersed elements- LINES)

~50% of human genomic DNA currently will be maskedRepeatModeler searches for repeats

ab initio and can find not previously characterized repeatsSlide36

Repeats yield similarities in non-homologous regions

Alkes

L. Price, Neil C. Jones and

Pavel

A.

Pevzner

(June

28, 2005

)

http

://

bix.ucsd.edu

/

repeatscout

/

repeatscout-ismb.ppt

GENE1

GENE2

GENE1

GENE2

Using unmasked genomic DNA

Using masked genomic DNASlide37

Predicted genes that are actually repeats

Using masked genomic DNA

Using unmasked genomic DNA

Gene predictors

Repeats

Predicted

models

No

modelsSlide38

Factors affecting gene predictor results

Underlying algorithm

Program parameters

Training set (number and quality of models)

Extrinsic data (expression data, protein/genome alignment)

Training set 1

Training set 2

GeneMark-ES (self training)

9,024

9,024

Augustus trained on

Fungus

8,694

9,011

Augustus with

optimize”

step

8,503

8,920

SNAP trained on

Fungus

7,335

7,955

GlimmerHMM

trained on Fungus10,31311,894

Scipio alignments with other Fungi10,691

10,691Trinity assemblies GMAP aligned

9,527

9,527

Trinity (

Jaccard clip option on)

10,023

10,023GLEAN consensus

8,7059,123Slide39

Which model is “correct”?

Protein alignments

Consensus model

Models from three different predictors/conditionsSlide40

We rely on certain conventions

Rules are based on gene composition & signal

First, what is the basic structure of a gene?

Coding region (exon) is inside ORF of one reading frame

All exons on same strand for a given gene

Exons within a gene can have different reading frame

Inherent frequency patterns exist…Slide41

Dimer frequency distribution

Dimer frequency in protein sequence is

not

evenly distributed and

is

organism specific

Some amino acids “prefer” to be next to one another

Most

dicodons

are biased toward either coding or non-coding, not neutral

Expected frequency of dimer

I

f random = 0.25% (1/20 * 1/20)

If a dimer has lower than expected frequency, protein less likely to contain it… and the reasoning follows that if a sequence

does

contain it, it is less likely to exist in a coding region

Example: In human genome

, AAA AAA appears 1% of time in coding regions and 5% of time in non-coding regionsSlide42

Modified from: http://

en.wikipedia.org

/wiki/

File:Intron_miguelferig.jpg

http://

en.wikipedia.org

/wiki/

File:Pre-mRNA_to_mRNA.svg

Splicing

Find all GT/AG donor/acceptor sites

Score with

position-specific scoring matrix

(PSSM)

model

s

plice

donor

splice

acceptor

p

oly-

pyrimidine

tract

b

ranch

pointSlide43

Position Specific Scoring Matrix (PSSM)

1

2

3

4

5

6

7

8

A

G

C

U

1

1

1

0

0

0

1

1

0

4

0

2

1

1

1

2

2

5

0

0

2

1

2

1

0

0

5

0

0

0

4

1

Let’s say you look at 5 splice donor (GU) sites:

ATCGUCGC

UCAGUGGC

CUCGUCCC

GUCGUUAC

CACGUCUA

Gene

Gene finders use this information to predict where gene features are.

For this to work,

one must have

confirmed

splice sites

to use for training. These are not always available for new genomes… and some splice sites are non-canonical… and some genes are alternatively spliced… so it can become somewhat complex.Slide44

Translation start prediction

Position-specific scoring matrix (PSSM)

Certain nucleotides tend to be in position around start site (ATG), and others not so

Such biased nucleotide distribution is basis for translation start prediction

Figure courtesy of

Sucheta

Tripathy

http

://

www.slideshare.net

/

tsucheta

/29th-

june2011Slide45

Mathematical model

Fi(X): freq. of X (A, G, C, T) in position I

Score string by

Σ

log (Fi (X)/0.25)

Figure courtesy of

Sucheta

Tripathy

http

://

www.slideshare.net

/

tsucheta

/29th-

june2011Slide46

Pattern-based exon & gene prediction

Assess different criteria

Coding region inside ORF (start & stop, no interrupting stops)

Dimer frequency

Coding score

Donor site score

Acceptor site score

Other factors to consider

GC content

Exon length distribution

Polymerase II promoter elements (GC box, CCAT box, TATA region)

Ribosome binding site

Polyadenylation signal upstream poly-A cleavage site

Termination signal downstream poly-A cleavage siteSlide47

Example of ab

initio

gene predictor flow

http://

genome.crg.es

/software/

geneid

/Slide48

Confirming a predicted gene with cDNA

http://

pasa.sourceforge.net

/

26 exons!Slide49

Extrinsic evidence & manual curation

Expression data

EST (expressed sequence tag) sequences

RNA

-

s

eq

reads

mRNA

cDNA

High throughput sequencing

Align reads to genome

sequence

Homology based approaches

Protein (or expression data) sequences from other organisms

Nucleic acid conservation via

tblastx or many other methods

Ortholog mapping/syntenyExperimentally confirmed gene products & gene families

Manual curation is often done by experts in a domainSlide50

mRNA

cDNA

GCTAATGCGAAGTCCTAGACCAGATTGAC

ATGCGATGCAGCTGACGCTGGCTAATGCG

CGCATAGCCAGATGACCATGATGCGATGC

TGACAGATTAGACAGTAGGACAGATAGAC

……..many millions of reads

Reads mapped to genome with gene models

?

Gene model is confirmed by transcript information

Part of the gene model is confirmed but the exons predicted in the middle do not have transcript evidence. Does this mean they are not real? Not necessarily.

Transcript sequencing allows for novel gene detection. There is transcript evidence for the presence of a gene (or at least transcription) in an area of the genome without a gene model currently predicted.

1

2

3

RNA-

s

eq

of transcripts as evidence for gene modelsSlide51

Splice boundaries and alternate transcripts

Some reads will span the intron/exon boundaries

Allows for verification of gene models

Observation of alternate transcripts

Intron

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.pngSlide52

Multiple genome alignment & conservationSlide53

Experimentally based manual curation

We have experimentally characterized protein

What do I know about this gene family?

What do I know about genes in general?

No introns in multiples of three, short introns, et ceteraSlide54

Leverage comparative genomics

Arnaud, et al. (2010) Nucleic

Acids

Res

.

38

(Database issue)

: D420

-7

.Slide55

Gather models for ab

initio

training set

Get models verified via expression, homology, or manual curation

Use

manually curated

genes from your organism

Generate preliminary

ab

initio

model set and then do a

homology search

at Swiss-

Prot

, retaining most-conserved genes

Use

CEGMA

(Core Eukaryotic Genes Mapping Approach) to predict highly conserved genesAlign proteins from related organisms to your genome with splice-aware aligner, thus creating models with exon boundaries that have homologsAlign RNA-

seq reads or ESTs to your genome to create or update existing models.

Use models with multiple sources & remove highly similar ones ORUse pre-existing training set related to your organism

For example, I could use chicken if I am studying finchMany software packages provide parameter files for common organismsSlide56

Run gene finder as online or stand alone

Augustus web has text & graphical output

Click!

Predictions stored in GFF3

or GFF2 or GTF formatSlide57

RNA-

s

eq

can show differential expression of alternative transcriptsSlide58

Combiners

Incorporate multiple evidence types including

ab

initio

predictions, expression data, and homology—and these usually perform the best

Glean

Evidence Modeler (EVM)

Jigsaw

Maker (actually a whole pipeline that can be used online)

PASA (combines predicted structures with expression data)

And more…

Note that many

ab

inito

predictors, for example Augustus, incorporate other data types such as protein alignments or expression dataSlide59

One example, the Glean combiner

Glean paper at http://genomebiology.com/2007/8/1/

R13

Top track below is a statistically derived combination of the ones below it Slide60

Example of annotation pipeline

Fungal Genome Annotation Standard Operating Procedure (SOP) at JGI

Repeat masking

Mapping ESTs (BLAT) from organism and publicly available proteins from related taxa (

BLASTx

)

Ab

initio

(FGENESH, GeneMark),

homolgy

-based (FGENESH+,

Genewise

seeded by

BLASTx

against nr), EST-based (

EST_map

) gene predictionEST clustering to improve gene modelsFiltering overlapping gene models based on protein homology and EST support to derive “best” model

Non-coding genes with tRNAscan-SE…ready for functional annotation

http://genome.jgi.doe.gov

/programs/fungi/FungalGenomeAnnotationSOP.pdfSlide61

nGASP – the nematode genome annotation assessment project

http://

www.biomedcentral.com

/1471-2105/9/549Slide62

Take home message

Intrinsic & extrinsic prediction methods

Intrinsic gene finders need high-quality training datasets in order to produce good predictions

“Correct” gene predictions are a moving target

Note the steady decrease in the number of predicted genes as the human genome is further curated

G

ene finders & gene finding pipelines produce predictions, which must be verified and refined – do not take them at face value

The more pieces of high-quality evidence you add to the process the better

In eukaryotes especially, there is not necessarily only one correct modelSlide63

Protein Functional Annotation

63Slide64

Annotation defined

64

annotate

– to make or furnish critical or explanatory notes or comment.

--

Merriam-Webster dictionary

genome annotation

– the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.

--

Lincoln Stein, PMID 11433356

Gene Ontology (GO) annotation

– the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.

--

http://

www.geneontology.orgSlide65

What do our predicted genes do

?

What we would like:

Experimental knowledge of function

Literature curation

Perform experiment

Not possible for all proteins in most organisms (not even close in most)

What we actually have:

Sequence similarity

Similarity to motifs, domains, or whole sequences

Protein not DNA for finding function

Shared sequence can imply shared function

All sequence-based annotations are putative until proven experimentally

65Slide66

Basic set of protein annotations

protein name

- descriptive common name for the protein

e.g. “

ribokinase

gene symbol

- mnemonic abbreviation for the gene

e.g. “

recA

EC number

- only applicable to enzymes

e.g. 1.4.3.2

r

ole

- what the protein is doing in the cell and whye.g. “amino acid biosynthesis”

supporting evidenceaccession numbers of BER and HMM matchesTmHMM, SignalP, LipoP

whatever information you used to make the annotationunique identifier

e.g. locus ids66Slide67

Alignments/Families/Motifs

67

pairwise alignments

two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match

multiple alignments

3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column

more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicative of shared function.

protein families

clusters of proteins that all share sequence similarity and presumably similar function

may be modeled by various statistical techniques

motifs

short regions of amino acid sequence shared by many proteins

transmembrane

regions

active sites

signal peptidesSlide68

Important terms to understand

homologs

two sequences have evolved from the same common ancestor

they may or may not share the same function

two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another.

orthologs

a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciation event has created the two copies of the sequence.

orthologs often, but not always, share the same function

paralogs

a type of homolog where the two sequences have arisen due to a gene duplication within one species

paralogs

will initially have the same function (just after the duplication) but as time goes by, one copy will be free to evolve new functions, as the other copy will maintain the original function. This process is called “

neofunctionalization

”.

xenologs

a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer

68Slide69

69

ancestor

speciation to orthologs

duplication to paralogs

one

paralog

evolves a new function

lateral transfer to a different species makes xenologs

neofunctionalization

” – the duplicated

gene/protein develops a new functionSlide70

Pairwise alignments

There are numerous tools available for pairwise alignments

NCBI BLAST resources

FASTA searches

Many more

At IGS we use a tool called BER (BLAST-extend-

repraze

) that combines BLAST and Smith-Waterman approaches

Actually much of bioinformatics is based on reusing tools in new and creative ways…

70Slide71

BER

71

BLAST

modified Smith-

Waterman Alignment

genome’s protein set

vs.

Significant hits (using a liberal cutoff) put into mini-

dbs

for each protein

non-redundant protein database

vs.

Query

protein is extended

mini-database from BLAST search

,

mini-db for protein #1

mini-db for protein #2

,

mini-db for protein #3

...

mini-db for protein #3000

Mini database

BER

alignment

Extended Query

protein by 300

ntSlide72

BER Alignment

72

…to

look through in-frame stop codons and across

frameshifts

to determine if similarity continuesSlide73

73Slide74

Extensions in BER

74

The extensions help in the detection of

frameshifts

(FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence.

Blue

line indicates predicted protein coding sequence,

green

line indicates up- and downstream extensions.

Red

line is the match protein.

ORFxxxxx

300 bp

300 bp

FS

PM

FS or PM ?

two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs

end5

end3

search protein

match protein

similarity extending through a frameshift upstream or downstream into extensions

similarity extending in the same frame through a stop codon

?

normal full length match

*

!

!Slide75

How do you know when an alignment is good enough to determine function?

Good question! No easy answer…

Generally, you want a minimum of 40%-50% identity over the full lengths of both query and match with conservation of all important structural and catalytic sites

However, some information can be gained from partial alignments

Domains

Motifs

BEWARE OF TRANSITIVE ANNOTATION ERRORS

75Slide76

Pitfalls of transitive annotation

Current public datasets full of such errors

A good way to avoid transitive annotation errors is to require that in a pairwise match, the match annotation must be trusted

Be conservative

Err on the side of not making an annotation, when possibly you should, rather than making an annotation when probably you shouldn’t.

76

A

B

B

C

C

D

Transitive Annotation is the process of passing annotation from one protein (or gene) to another based on sequence similarity:

A’s name has passed to D from A through several intermediates.

-

This is fine if A is similar to D.

-

This is NOT fine if A is NOT similar to D

Transitive annotation errors are easy to make and happen often.Slide77

Trusted annotations

It is important to know what proteins in our search database are characterized.

proteins marked as characterized from public databases

Gene Ontology repository (more on this later)

GenBank (only recently began)

UniProt

proteins at “protein existence level 1”

Proteins with literature reference tags indicating characterization

77Slide78

Swiss-

Prot

European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB)

all entries manually curated

http://

www.expasy.ch

/

sprot

annotation includes

links to references

coordinates of protein features

links to cross-referenced databases

TrEMBL

EBI and SIB

entries have not been manually curated

once they are accessions remain the same but move into Swiss-

Prot

http://

www.expasy.ch

/

sprot

Protein Information Resource (PIR)

http://

pir.georgetown.edu

78

UniProt

UniProt http://www.uniprot.orgSlide79

UniProt

79Slide80

80Slide81

81Slide82

82Slide83

83Slide84

Enzyme Commission

not sequence based

categorized collection of enzymatic reactions

reactions have accession numbers indicating the type of reaction, for example EC 1.2.1.5

http://www.chem.qmul.ac.uk/iubmb/enzyme/

http://www.expasy.ch/enzyme/

84

Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they

CatalyseSlide85

All ECs starting with #1 are some kind of

oxidoreductase

Further numbers narrow specificity of the type of enzyme

A four-position EC number describes one particular reaction

EC number

Hierarchy

85Slide86

Example entry for one specific enzyme

86Slide87

Metabolic pathway databases

KEGG

http://www.genome.jp/kegg/

MetaCyc

/

BioCyc

http://metacyc.org/

http://www.biocyc.org/

BRENDA

http://www.brenda-enzymes.info/

87Slide88

88Slide89

89Slide90

Hidden Markov models (HMMs

)

Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and, presumably, functional similarity

Two sets routinely used for protein functional annotation

TIGRFAMs

(

www.tigr.org

/TIGRFAMs/)

Pfam

(

pfam.sanger.ac.uk

)

Each TIGRFAM model is assigned to a category which describes the type of functional relationship the proteins in the model have to each other

Equivalog

- one specific function, e.g. “

ribokinase

Subfamily

- group of related functions generally with different substrate specificities, e.g. “carbohydrate kinase”

Superfamily

- different specific functions that are related in a very general way, e.g. “kinase”

Domain

- not necessarily full-length of the protein, contains one functional part or structural feature of a protein, may be fairly specific or may be very general, e.g. “ATP-binding domain

90Slide91

Annotation attached to HMMs

Functionally specific HMMs have specific annotations

TIGR00433 (

accession number for the model)

name: biotin synthase

category:

equivalog

EC: 2.8.1.6

gene symbol:

bioB

Roles:

biotin biosynthesis (TIGR 77/GO:0009102)

biotin synthase activity (GO:0004076)

Functionally general HMMs have general annotations

PF04055

name: radical SAM domain protein

category: domain

EC: not applicable

gene symbol: not applicable

Roles:

enzymes of unknown specificity (TIGR role 703)

catalytic activity (GO:0003824)

metabolism (GO:0008152)

91Slide92

HMM building

Alignments of functionally related proteins act as training sets for HMM building

Statistical Model

Model specific to a family of proteins, generally found across many species

Proteins from many species

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide93

HMM scores

93

When a protein is searched against an HMM it receives a BITS score and an e-value indicating the significance of the match

The search protein’s score is compared with the trusted and noise cutoff scores attached to the HMM

proteins scoring above the trusted cutoff can be assumed to be members of the family

proteins scoring below the noise cutoff can be assumed NOT to be members of the family

when proteins score in-between the trusted and noise cutoffs, the protein may be a member of the family and may not.

N

Statistical Model

T

Statistical Model

The person building the HMM will search the new HMM against a protein database and decide on the trusted and noise cutoff scoresSlide94

HMM databases

94

Alignments of functionally related proteins act as training sets for HMM building

N

Statistical Model

T

Database of HMM models, each specific to one protein family and/or functional level

Add this model to the database

Model specific to a family of proteins, generally found across many species

Proteins from many species

Examples:

Pfam

and TIGRFAM

Figure: Michelle Giglio, Ph.D., Institute for Genome Sciences,

University of Maryland School of Medicine, 2013Slide95

0

100

-50

0

100

-50

0

100

-50

…above trusted: the protein is a

member of family the HMM models

…below noise: the protein is not a

member of family the HMM models

…in-between noise and trusted:

the protein MAY be a member of

the family the HMM models

T

N

P

0

100

-50

...above trusted and some or all

scores are negative: the protein is

a member of the family the HMM models

The cutoff scores attached to HMMs, are sometimes high and sometimes low and sometimes even negative. There is no inherent meaning in how high or low a cutoff score is, the

important thing is the query protein’s score relative to the trusted and noise scores.

95Slide96

Orthologous groups

COGs – have not been updated in a long time

eggNOG

– newer, more complete

96

A

B

C

1

2

3

Bi-directional best BLASTSlide97

Motif searches

97

PROSITE

- http://

www.expasy.ch

/

prosite

/

“consists of documentation entries describing protein domains, families and functional sites as well as associated patterns to identify them.”

Center for Biological Sequence Analysis

-

http

://

www.cbs.dtu.dk

/

Protein Sorting (7 tools)

Signal P

finds potential secreted proteins

LipoP

finds potential lipoproteins

TargetP

predicts subcellular location of proteins

Protein function and structure (9 tools)

TmHMM

finds potential membrane spans

Post-translational modifications (14 tools)

Immunological features (9 tools)

Gene finding and splice sites (9 tools)

DNA microarray analysis (2 tools)Small molecules (2 tools)Slide98

One-stop shopping -

InterPro

InterPro

Brings together multiple databases of HMM, motif, and domain information.

Excellent annotation and documentation

http://

www.ebi.ac.uk

/

interpro

/

98Slide99

Making annotations

Use the information from the evidence sources to decide what the gene/protein is doing

Assign annotations that are appropriate to your knowledge

Name

EC number

Role

Etc.

99Slide100

Main Categories:

Amino acid biosynthesis

Purines,

pyrimidines

, nucleosides, and nucleotides

Fatty acid and

phospholipidmetabolism

Biosynthesis of cofactors, prosthetic groups, and carriers

Central intermediary metabolism

Energy metabolism

Transport and binding proteins

DNA metabolism

Transcription

Protein synthesis

Protein Fate

Regulatory Functions

Signal Transduction

Cell envelope

Cellular processes

Other categories

Unknown

Hypothetical

Disrupted Reading Frame

Unclassified

(

not a real role

)

Each main category has several subcategories.

TIGR

rolesSlide101

Names (and other annotations) should reflect knowledge

101

specific function

Example

: “

adenylosuccinate

lyase

”,

purB

, 4.3.2.2

varying knowledge about substrate specificity

A good example: ABC transporters

ribose ABC transporter

sugar ABC transporter

ABC transporter

choosing the name at the appropriate level of specificity requires careful evaluation of the evidence looking for specific characterized matches and

HMMs

.

family designation -

no gene symbol, partial EC

Cbby

family protein”

“carbohydrate

kinase

, FGGY family”

hypotheticals

“hypothetical protein”

“conserved hypothetical protein”Slide102

Ontologies

102Slide103

Names can be problematic….

….because humans do not always use precise and consistent terminology

Our language is riddled with

Synonyms – different names for the same thing

Homonyms – different things with the same name

This makes data mining/query difficult

What name should you assign?

What name should you use when you search UniProt or NCBI or any other database?

103Slide104

Synonyms

Within any domain do people use precise & consistent language?

Take biologists, for example…

Mutually understood concepts – DNA, RNA, protein

Translation & protein synthesis

Synonym: one thing, more than one name

Enzyme Commission reactions

Standardized id, official name & alternative names

104

http://www.expasy.ch/enzyme/2.7.1.40Slide105

Homonyms

Different things known by same name

Common in biology

Sporulation

Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?)

105

Endospore formation

Bacillus

anthracis

Reproductive

sporulation

Asci

&

ascospores

,

Morchella

elata

(morel)

http://en.wikipedia.org/wiki/File:Morelasci.jpg

©PG Warner 2008 (accessed 17-Sep-09)

http://www.microbelibrary.org/ASMOnly/details.asp?id=1426&Lang=

©L Stauffer 2003 (accessed 17-Sep-09)

Sporulation

”Slide106

Standardization with controlled vocabularies (CVs)

An official list of precisely defined terms used to classify information & facilitate its retrieval

Flat list

Thesaurus

Catalog

106

http://www.nlm.nih.gov/nichsr/hta101/ta101014.html

A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.”

Benefits of CVs

Allow standardized descriptions

Synonyms & homonyms addressed

Can be cross-referenced externally

Facilitate electronic searchingSlide107

Ontology: CV with defined relationships

Formalizes knowledge of subject with precise textual definitions

Networked terms; child more specific (“granular”) than parent

107

National Drug FileSlide108

An example is the Gene Ontology with three controlled vocabularies

Molecular Function

What the gene product is doing

Biological Process

Why the gene product is doing what it does

Cellular component

Where a gene product is doing what it does

108Slide109

The Gene Ontology

A good example of a biological ontology

Relationships among networked, defined terms

Vascular terms shown with relationships

Less specific

More granularSlide110

Example: a GO annotation

Associating GO term with gene product (GP)

GP has function (6-phosphofructokinase activity)

GP participates in process (glycolysis)

GP is located in part of cell (cytoplasm)

Linking GO term to GP asserts it has that attribute

Based on literature or

computational methods

Always involves:

Learning something about gene product

Selecting appropriate GO term

Providing appropriate evidence code

Citing reference [preferably open access]

Entering information into GO annotation file

110Slide111

Annotation becomes a series of ids linked to other proteins/genes/features

GO:0005887

GO:0008272

GO:0015419

GO:0043190

111

This protein is integral to the plasma membrane and is part of an ATP-binding cassette (ABC) transporter complex. It functions as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source.

=Slide112

112

Term name

GO ID (unique numerical identifier)

Precise textual definition that describes some aspect of the biology of the gene product

Synonyms for searching, alt. names, misspellings…

GO slim

Ontology relationships (next page)

Definition referenceSlide113

Genomes can be compared

High-level biological process terms used to compare

Plasmodium

and

Saccharomyces

(made by “slimming”)

113

MJ Gardner, et al. (2002) Nature 419:498-511Slide114

Evidence

114Slide115

The importance of evidence

t

racking

115

I conclude that

you are a cat.

Why?

You look like other cats I know

I heard you meow and purr

Why?

You look like other protein

kinases

I know

You have been observed to add phosphate to proteins

I conclude that

you code for a

protein

kinase

.

The process of functional annotation involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why.

Functional annotations should only be as specific as the supporting evidence allows

All evidence that led to the annotation conclusions that were made must be stored.

In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.Slide116

Knowledge & annotation specificity

Available evidence for three genes

Corresponding GO annotations

How much can we

accurately

say?Slide117

Types of Evidence

Experiments (often considered the best evidence)

Pairwise/multiple alignments

HMM/domain matches scoring above trusted cutoff

Metabolic Pathway analysis

Match to an

ortholog

group (

COG,eggNOG

)

Motifs

117Slide118

The Evidence Ontology (ECO)

ECO terms have standardized definitions, references & synonyms

Allows standardizing evidence description and searching by evidence type

Can filter by evidence type & do other things!

GO evidence codes are subset of

ECOSlide119

ECO roots and combinatorial termSlide120

The big picture: Evidence and sequence repositoriesSlide121

Experimental pathway

Can be described by particular evidence term

(Many ECO terms describe outputs of processes in aggregate)

Researcher performs experiment...Slide122

Publish…

Author does analysis & arrives at conclusions,

publishes these in paper

A

different

person reads about the conclusion, interprets & makes annotation

Term from ontology or other descriptive vocabulary

Protein name, id, et cetera

How associating a piece of information with a protein was performed. Who did the work, a person or a computer?

...annotateSlide123

Annotation stored in sequence repository

The annotation with a term from a descriptive vocabulary/ontology such as the Gene

Ontolgoy

The protein sequence

The evidence term used to support the decision to associate the term from the descriptive vocabulary with that particular protein

Not shown: protein ID, date, name, et ceteraSlide124

The overall experimental annotation flow

Pure evidence term

Pure assertion method term

Evidence x assertion method cross productSlide125

Similarity pathway

…researcher

or

computer performs search

Protein sequence of interest

Sequence database

Resulting alignment with match protein

Evidence termSlide126

Interpretation is made and annotation is made

Assertion

method

“Protein 1 & 2 look same, so protein 1 has same function as protein 2…”

The assertionSlide127

Store in sequence repository

Evidence x assertion method cross product

Same evidence, but we know who or what did the asserting…Slide128

The overall similarity annotation flow

Pure evidence term

Pure assertion method term

Evidence x assertion method cross productSlide129

Concluding remarks

129Slide130

The big picture: an example

p

ipeline

130

DNA Sequence

(assembly, masking)

Automatic Annotation using the evidence hierarchy of

Pfunc

Searches:

Pairwise

BER searches against UniRef100

HMM searches against

Pfam

and

TIGRfam

Motif searches with

LipoP

, THMHH, PROSITE

NCBI

COGs

Prium

profiles

Automated start site and gene overlap correction

translation

RNA finding:

tRNAScan

, RNAMMER, homology searches

Predicted RNA Genes

Gene Prediction

Predicted protein coding genes

MySQL

database using the

C

hado

schema

Genome viewer/editor

Flat files of annotation informationSlide131

Some concluding themes…

The best annotation comes from looking at multiple sources of evidence

It is important to track and check the evidence used in an annotation

Do not assume the annotation you see on a protein is correct unless it comes from a trusted source

Always err on the side of under-annotating rather than over-annotating

Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of information it provides.

131