/
Gene Finding and Sequence Annotation Gene Finding and Sequence Annotation

Gene Finding and Sequence Annotation - PowerPoint Presentation

bitsy
bitsy . @bitsy
Follow
343 views
Uploaded On 2022-06-13

Gene Finding and Sequence Annotation - PPT Presentation

Lecture 3 Gene Finding and Sequence Annotation Objectives of this lecture Introduce you to basic concepts and approaches of gene finding Show you differences between gene prediction for prokaryotic and eukaryotic genomes ID: 917454

sequence gene finding annotation gene sequence annotation finding lecture genome dna genes reading prediction start mrna sequences codon coding

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Gene Finding and Sequence Annotation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Gene Finding and Sequence Annotation

Lecture 3. Gene Finding and Sequence Annotation

Slide2

Objectives of this lecture

Introduce you to basic concepts and approaches of gene finding

Show you differences between gene prediction for prokaryotic and eukaryotic genomes

Show you which sequence features can be used to identify genesIntroduce you gene finding methodsBriefly discuss the evaluation of gene finding methodsThis lecture will get you familiar with several important concepts of gene prediction, which will help you to recognize some important pitfalls and to make an informed choice for specific software applications.

Lecture 3. Gene Finding and Sequence Annotation

Slide3

Gene Prediction: Computational Challenge

>Genomics DNA…….. atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Where is gene?

Slide4

Lecture 3. Gene Finding and Sequence Annotation

Gene identification (or finding, or prediction, or annotation)

is about finding the location and structure of genes on (full) genomic DNA sequences.

This is generally a complicated process which can be facilitated by data obtained from Sequencing

,

gene expression

and

proteomics

experiments

because these provide a first source of information about the gene that are expressed and thus must be present on the genome.

Slide5

Gene

prediction

E

xpression

data may

facilitate

gene

prediction

Lecture 3. Gene Finding and Sequence Annotation

Genomics, Transcriptomics, Proteomics and Metabolomics

Slide6

Lecture 3. Gene Finding and Sequence Annotation

With the advent of next generation sequencing it has become fairly easy to generate full genome sequences. The real challenge is the annotation of these sequences (see next slide), i.e., providing a full description of the genome that lists all genes and other structures on the genome.

Why Gene Prediction/finding/searching?

Slide7

Lecture 3. Gene Finding and Sequence Annotation

Genome (annotation) projects

According to National Center for Biotechnology Information (NCBI;

February 2012;

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html)

Slide8

Lecture 3. Gene Finding and Sequence Annotation

Look for ORF (Open Reading Frame)

(begins with start

codon, ends with stop codon, no internal stops!) long (usually > 60-100 aa) If homologous to “known” protein more likelyLook for basal signals

Transcription, splicing, translation

Look for regulatory signals

Depends on organism

Prokaryotes

vs

Eukaryotes

Vertebrate

vs

fungi

Protein Coding Genes in Genome!

Slide9

Why and How Annotation?

This Increase in number of whole-genome sequences make it necessary

These are analyzed to identify protein-coding genes AND other genetic elements

Often some experimental data available to assist in this taskE.g., previously characterized genes, gene products, ESTsSequences of genes and products (from other organisms) can be aligned to identify translated regionsSet of genes from alignment only will be incompleteFeatures such as repeat and control sequences will be missingTherefore, computational methods have been developed to characterize genes and other features: ANNOTATION

Lecture 3. Gene Finding and Sequence Annotation

Slide10

Prediction of genes & Genome annotation

Lecture 3. Gene Finding and Sequence Annotation

Use and development of computational approaches to accurately predict gene structure and annotate genomes

Ultimate goal: near 100% accuracy.

Reduce amount of experimental verification work.

Genome sequencing

Slide11

Gene prediction in prokaryotic genomes is much simpler than for Eukaryotic genomes

Lecture 3. Gene Finding and Sequence Annotation

Genome

: 10Mbp-670Gbp

Genome

: 0.5-10Mbp

Human: 3Gbp

1%

protein

coding

>90%

protein

coding

Many repetitive sequences Few repetitive sequencesGene: exon structure Gene: single contiguous stretch

Slide12

Lecture 3. Gene Finding and Sequence Annotation

There exist several classes of gene prediction methods:

>

methods are based on homology. Homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) or a duplication event (paralogs). In gene identification you can compare known DNA/mRNA sequences to a newly obtained genome sequence to obtain information about the location of a gene (and its structure) on the genome.>Other methods are ‘ab initio’. These methods don’t use existing experimental data (e.g., sequence data as in homology searching) but apply algorithms to identify gene signals in the DNA which may indicate the presence of a gene, or they determine the composition (gene content) of a piece of DNA, which may also give clues about the existence of a gene in a particular region of DNA.

Gene prediction methods

Slide13

Categories of gene prediction programs

Lecture 3. Gene Finding and Sequence Annotation

Gene prediction methods

Ab

initio

Homology

Gene signals

start/stop codons

intron splice signals

transcription factor binding sites

ribosomal binding sites

poly-adenylation sites

Gene content

statistical description

of coding regions

difference between coding

and non-coding regions

translated DNA matches

known protein sequence

exons of genomic DNA

match a sequenced

cDNA

Intrinsic methods: without reference to known sequences

Extrinsic methods: with reference to known sequences

Slide14

Protein-coding gene prediction in prokaryotes

Note

: we won’t look at the prediction of non-protein coding genes in this lecture

Lecture 3. Gene Finding and Sequence Annotation

The interaction of components of the transcription/translation machinery with the nucleotide sequence, and constraints imposed on protein-coding

nt

-sequences have resulted in distinct features that can be used to identify genes

Slide15

Gene annotation in prokaryotes

Prokaryotes

stack multiple genes together for expression (“operons”)

Lecture 3. Gene Finding and Sequence AnnotationPromoterGene1

Gene2

Gene N

Terminator

Transcription

RNA Polymerase

mRNA 5’

3’

Translation

1

2

N

Polypeptides

N

C

N

C

N

C

1

2

3

Slide16

Gene annotation in prokaryotesGene structure of prokaryotes

Coding region

Translation

start

Stop

ρ

-independent

t

ranscription

signal

Ribosomal

binding site

Transcription

start

Start codon

ATG

Stop codon

TAA, TAG, TGA

Identification of

sequence features helps identifying the gene

rho-independent transcription:

Causes the transcribed mRNA to

form a hairpin and terminate

transcription

Lecture 3. Gene Finding and Sequence Annotation

Slide17

Lecture 3. Gene Finding and Sequence Annotation

Readings

,

For prokaryotes we can determine the open reading frame from the DNA sequence (and from the mRNA sequence). The ORF is the part of the sequence that codes for the protein. The ORF starts with an ATG (start codon) and ends with a end codon (see next slide). Every triplet of nucleotides (codon) is translated to its corresponding amino acid according to the genetic table (see next slide). In this example we observe a “ATG” in the middle of the sequence. This is not a start codon. It is even divided over two neighboring codons.

Slide18

Gene annotation in prokaryotes

Genetic code: translation of codons to amino acids

Lecture 3. Gene Finding and Sequence Annotation

64 codons

Synonymous

codons

ATG>AUG – DNA>RNA

Slide19

Gene Prediction: Computational Challenge

>Genomics DNA…….. atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggct

atgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

Slide20

Microbial Gene Finding

Microbial genome tends to be gene rich (80%-90% of the sequence is coding)

The most reliable method – homology searches (e.g. using BLAST and/or FASTA)

Major problem – finding genes without known homologue.

Slide21

Open Reading Frame

Open Reading Frame (ORF)

is a sequence of codons which starts with start codon, ends with an end

codon

and has no end

codons

in-between.

Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse

Is the ORF a coding sequence?

Must be long enough (roughly 300

bp

or more)

Should have average amino-acid composition specific for a give organism.

Should have

codon use specific for the given organism.

Slide22

Gene annotation in prokaryotes

Open Reading Frames (ORF): 6 reading frames

Lecture 3. Gene Finding and Sequence Annotation

ORF (open reading frame)

Start codon

Stop codon

Transcription

start

Frame 1

Frame 2

Frame 3

ATGACAGATTACAGATTACAGATTACAGGA

TAG

Next slide for detail

Slide23

Gene annotation in prokaryotes

Lecture 3. Gene Finding and Sequence Annotation

Reading!!

Each sequence has 6 possible reading frames that potentially encodes a proteins in each direction (sense and anti-sense)For every piece of DNA/mRNA we can potentially define 6 reading frames (3 in the sense direction, 3 in the anti-sense direction). To identify the open reading frame (starting with an ATG and ending with an stop codon) we must in principle inspect each of these 6 reading frames. The ORF with the largest number of

codons

is often the correct one.

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

GAC

GTC

TGC

TTT

GGA

GAA

CTACAT

CAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGG

ACTGTGGCT

GTT

ATT

ACT

TCT

GAT

GGC

AGA

ATG

ATT

GTG

GA

CGT

CTG

CTT

TGG

AGA

ACT

ACA

TCA

ACC

GGA

CTG

TGG

CTG

TTA

TTA

CTT

CTG

ATG

GCA

GAA

TGA

TTG

TG

CT

GCA

GACGAAACC

TCTTGATGTAGTTGG

CCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACG

AAACCTCTTGATGTA

GTTGGCCTGACA

CCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAG

TTGGCCTGACACCGA

CAATAATGAAGACTACCGTCTTACTAACAC

Six Frames in a DNA Sequence looks like

stop

codons

– TAA, TAG, TGA

start

codons

-

ATG

Slide24

Lecture 3. Gene Finding and Sequence Annotation

A

reading frame

refers to one of three possible ways of reading a nucleotide sequence.

Let's say we have a stretch of 15 DNA base pairs:

acttagccgggacta

You

can start translating

the DNA from the first letter, 'a,' which would be referred to as the first reading frame.

Or you can start reading from the second letter, 'c,' which is the second reading frame.

Or

you

can start reading from the third letter, 't,' which is the third reading frame.

The reading frame affects which protein is made. In the example below, the upper case letters represent amino acids that are coded by the three letters above and to the left of them.

The illustration above shows three reading frames. However, there are actually

six reading frames

: three on the positive strand, and three (which are read in the reverse direction) on the negative strand.

Reading frame

Slide25

Problems

:

There will be many "ORFs“ occurring by chance

Some will be short - how do we know which are true?Introns make this useless in Eukaryotic DNA

Slide26

Gene annotation in prokaryotes

Finding ORFs

Many more ORFs than genes

In E.Coli one finds 6500 ORFs while there are 4290 genes.In random DNA, one stop codon every 64/3=21 codons on average.Average protein is ~300 codons long. => search long ORFs.ProblemShort genesLecture 3. Gene Finding and Sequence Annotation

Genomic Sequence

Open reading frame

ATG

TGA

Slide27

Gene annotation in prokaryotesBasic statistics (base statistics)

Codon

frequency can be used as a gene predication feature

Lecture 3. Gene Finding and Sequence Annotation

Figure from:

Zvelebil

M, Baum JO (2008) Chapter 10 Gene Detection and Genome Annotation in Understanding Bioinformatics, Garland Science, New York

clear difference

similar codon usage

Slide28

Gene annotation in prokaryotes

Ribosomal binding site: Shine-

Delgarno

sequenceThe ribosome binding site for bacterial translation. In Escherichia coli, the ribosome binding site has the consensus sequence: 5′-AGGAGGU-3′ Location: between 3 and 10 nucleotides upstream of the initiation codon. Lecture 3. Gene Finding and Sequence Annotation

5’

3

AGGAGGU

AUG

3-10 nucleotides

Initiation codon

Ribosome binding site

Slide29

Gene annotation in prokaryotesSequence homology

(mRNA-Protein)

Lecture 3. Gene Finding and Sequence Annotation

Uncharacterized

genome

(Blast) alignment

of

mRNA

(or protein) sequence

evidence for

presence of a gene

Readings!

Sequence homology is a powerful method to detect genes in a genome. However, it assumes that an mRNA sequence is present, which could have been obtained in other (

transcriptomics

) experiments.

An mRNA is an expressed gene. Thus, if we are able to align the mRNA to the genome, then we know the location of the gene. Since the mRNA does not contain

introns

while the gene on the DNA may contain

introns

, the alignment can even provide information about the

intron-exon

structure of the gene.

Note that if we have a protein sequence then we can first translated it back into a mRNA sequence and use this mRNA sequence in a homology search.

Slide30

Alignment of ESTs against a genome

mRNA / EST sequences from

GenBank

(NCBI)Alignments of these sequences to the genome (UCSC)Lecture 3. Gene Finding and Sequence Annotation

DNA

Alignments

of mRNA/ESTs against genome

Intron in DNA (thus missing in mRNA). You will see a ‘gapped’ alignment.

EST

is a short sub-sequence of a

cDNA

sequence.

[1]

They may be used to identify gene

transcripts

, and are instrumental in gene discovery and gene sequence determination.

EST2Genome is one of the programs that aligns Expressed Sequence Tags (ESTs; small parts of mRNA sequences) to a genome sequence.

Slide31

DNA

Assign orientation (

polyA

signal/tail, exon boundaries, annotation)

- strand

+ strand

Alignment of ESTs against a genome

Lecture 3. Gene Finding and Sequence Annotation

After alignment you must determine the correct strand on which the gene is located. Sometimes this is straightforward. If not, you can use information about polyA signal/tail, exon/intron structure or other annotation.

Slide32

DNA

Determine overlap: 3 genes

- strand

+ strand

Alignment of ESTs against a genome

Lecture 3. Gene Finding and Sequence Annotation

If this is the case!

When there is an overlapping alignments are considered to belong to the same gene and can be grouped to obtain a more complete ‘model’ of the gene.

Slide33

Gene annotation in prokaryotes

Algorithms for Gene Detection in prokaryotes

Some of the programs available

GeneMarkGeneMark.hmmGLIMMEREcoParseORPHEUS

Prodigal

Many programs for gene identification are available. You don’t have to memorize all these programs for the examination.

Lecture 3. Gene Finding and Sequence Annotation

Slide34

Eukaryotic gene detection

Many principles of prokaryotic gene detection apply to eukaryotes

Similar base statistics

equivalent transcription, translation start/stop signalsHowever, much larger genome sizesRequire approaches with far lower rates of false positivesGene density is lessJunk DNA / repetitive sequencesCrucial difference: intronssplice sites do not have very strong signalsLecture 3. Gene Finding and Sequence Annotation

Slide35

Gene annotation in

eukaryotes

Intron, exons and splice sites

E

xons in eukaryotes are more difficult to recognize

Smaller

Variable number

Final exon may not contain coding sequence

Exons are delimited by (variable) splice signals (and not by start/stop codons) as for prokaryotes

Lecture 3. Gene Finding and Sequence Annotation

Prokaryote

gene

length

length much smaller

than for prokaryotes

Large variation in exon (and intron) lengths in EukaryotesEukaryote

Eukaryote

Slide36

Gene annotation in eukaryotes

GC - content

Lander (2001) Nature

higher GC content in genes

GC Vs. Gene density

more genes in GC rich

areas

Lecture 3. Gene Finding and Sequence Annotation

Explanation!

The percentage of GC in the genome is a rough indication for the presence of genes.

a). the percentage of GC for genes (red bars) is higher than for other parts of the genome (blue bars).

b). You can see that the percentage of GC correlates with gene density.

Thus

, GC gives a first indication but tells you nothing about the precise location of a gene nor its structure.

Slide37

Gene annotation in eukaryotes

Complexity

EukaryotesFinding genes in Eukaryotes is difficult due to variation in gene structureAverage vertebrate gene is 30kb long out of which coding sequence is only about 1kbAverage coding region consists of 6 exons of about 150bpBUTDystrophin

: 2.4Mb long

Blood coagulation factor VIII

: 26 exons (69bp to 3106bp)

Intron 22 produces 2 transcripts unrelated to this gene.

Lecture 3. Gene Finding and Sequence Annotation

Gene finding algorithms are often capable of detecting an ‘average’ gene. However, genes that somehow deviate in length, structure, etc can be missed by gene finding programs.

Slide38

Gene annotation in eukaryotes

Eukaryotic genome structure

Lecture 3. Gene Finding and Sequence Annotation

Gene A

Gene B

DNA

CpG island

(higher G+C content,

gene marker

Tandemly repeated DNA elements

Dispersed repeats (SINEs (e.g., Alu), LINEs)

Slide39

Gene annotation in eukaryotes

Eukaryotic genome structure

Lecture 3. Gene Finding and Sequence Annotation

DNA

Gene A

Gene B

Regulatory sequences (e.g., enhancers)

Exon

Intron

DNA

pre-mRNA

Transcription

RNA polymerase II

Promoter elements

transcription start site

transcription end site

Slide40

Gene annotation in eukaryotes

Eukaryotic genome structure

Lecture 3. Gene Finding and Sequence Annotation

mRNA

pre-mRNA

AAAAAAAAAAAAAAAAAAAA

Splicing

Translation of codons

protein

coding sequence

5' UTR

3' UTR

Slide41

Exon

Exon

Intron

Intron

Splice

Sites

Acceptor:

CAG/

G

Gene annotation in eukaryotes

Exon

– Intron structure

Donor:

(C,A)AG

/GT(A,G)AGT

Branch point signal :

CT(G,A)A(C,T)

(10-50bp upstream from acceptor)

Lecture 3. Gene Finding and Sequence Annotation

Readings!

The boundaries between exons and introns are characterized by certain sequence features.

An exon will start with a G end with an AG -------An intron will start with a GT and will end with a CAG

The full sequence feature of the exon/intron boundary is (C,A)AG/GT(A,G)AGT. This means that the last 3 nucleotides of an exon are CAG or AAG and the the first 6 nucleotides of the intron are GTAAGT or GTGAGT.

Note

that these are all very short sequences which may also occur by chance in a DNA sequence and which may mislead gene finding programs.

Slide42

Eukaryotic mRNAs are

polyadenylated

, i.e., have up to 250 A’s added to their 3’ end after transcription terminates (T) Signals:

Gene annotation in eukaryotes

Polyadenylation

signal

Lecture 3. Gene Finding and Sequence Annotation

The polyA signal is another example of a signal (sequence feature) that signals the end of transcription.

For Detail: http://themedicalbiochemistrypage.org/rna.php#processing

Slide43

Gene annotation in eukaryotes

Anatomy of a Eukaryotic Gene

Lecture 3. Gene Finding and Sequence Annotation

TATA Box

CAAT

Box

http://en.wikipedia.org/wiki/CAAT_box

Cis

-regulatory Elements

may be located thousands of bases away;

Regulatory TFs

bind.

Pol II, Basal TFs

bind

The structure of a human gene. It is the task of gene finding algorithms to elucidate this structure.

Slide44

Gene annotation in eukaryotes

Promotor

sequences and binding sites for transcription factors

Further differences between prokaryotic and eukaryotic gene structures:Sequence signals in upstream regions are much more variable in eukaryotes Both in position and compositionsControl of gene expression is more complex in eukaryotesCan be affected by many molecules binding the DNA in the gene regionThis leads to many more potential

promotor

binding sites

These binding sites may be spread over a much larger region (several thousand bases)

Strict control of gene expression

Some genes are known to be poorly expressed because high levels would be damaging (e.g., genes for growth factors)

Such genes sometimes lack the TATA box characteristic for

promotors

.

This complicates the identification of such genes

Lecture 3. Gene Finding and Sequence Annotation

Slide45

Methods to detect eukaryotic gene signals

Promotors

Transcription start/stop signals

e.g. TATA box (30% of genes don’t have TATA box)e.g. polyA signalTranslation start/stop signalsno defined ribosome-binding site in eukaryotic genesLecture 3. Gene Finding and Sequence Annotation

Slide46

Methods to predict the intron/exon structure

ORF identification

methods for prokaryotes

don’t workIf exons are long enough then base statistics can be used.Signals for splice sites are not well definedInitial/terminal exons also contain non-coding sequenceLecture 3. Gene Finding and Sequence Annotation

Slide47

Complete Eukaryotic gene models

Programs that use and combine all features of a gene to make a prediction about the complete gene structure (=model)

E.g.,

GenScanLecture 3. Gene Finding and Sequence Annotation

Slide48

Beyond gene prediction

Functional annotation.

determine the function of a predicted gene

Genome comparisonuse other organisms to refine gene modelUse of experimental data to evaluate gene modele.g. gene expressionLecture 3. Gene Finding and Sequence Annotation

Slide49

Lecture 3. Gene Finding and Sequence Annotation

Gene identification programs based on comparison with related genome sequences:

TWAIN

TWINSCAN Ab initio gene identification programs including those which use homologous gene sequences: GAZE The GeneMark set of programs Genie GenomeScan GenScan GLIMMER, GlimmerM and GlimmerHMM

GrailEXP

ORPHEUS

Wise2 including

GeneWise

Slide50

Lecture 3. Gene Finding and Sequence Annotation

Identifying

tRNA

genes: tRNAscan-SE program and web server Promoter prediction programs: CorePromoterExon prediction programs: FirstEF JTEF MZEF Splice site prediction programs:

GeneSplicer

SplicePredictor

Genome annotation visualization programs:

Apollo

Artemis and Artemis Comparison Tool (ACT)

VISTA

Slide51

Lecture 3. Gene Finding and Sequence Annotation

Web Servers:

The following web sites provide on-line access to gene annotation tools:

Analysis and annotation tool (AAT) FirstEF FGENES family of programs FunSiteP GAP2, NAP and other DNA alignment programs GeneBuilder

GeneSplicer

GeneWalker

GeneWise

is part of the

Wise2 suite

GenScan

GrailEXP HMMGene McPromoter

NetPlantGene NNPP ProScan