BCB 660 October 20 2011 From Carson Holt Annotations Automated Ab initio based on genomic sequence alone Involves comparisons to known proteins BLAST similarity Sequence motifs such as startstop ID: 277878
Download Presentation The PPT/PDF document "Genome Annotation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Genome Annotation
BCB 660
October 20, 2011Slide2
From Carson HoltSlide3
Annotations
Automated
Ab
initio
(based on genomic sequence alone)
Involves comparisons to known proteins (BLAST similarity)
Sequence motifs such as start/stop
codons
,
intron/exon
boundaries
Evidence-based (
ESTs
)
Involves alignment of experimental EST (
cDNA
) data to a gene prediction
Manual
Manual
curation
of genes predicted automatically
Check gene structure, presence of conserved domains, match of
ESTs
to gene prediction
Align to related genes/proteins and look for oddities (missing
exons
, early stop
codons
, etc).
Annotation can then be manually edited
May also involve assigning function (based on sequence similarity, conserved domains) via Gene Ontology
Structural:
exons
,
introns
,
UTRs
, splice forms etc.
Functional: process a gene is involved in (metabolism), molecular function (
hydrolase
), location of expression (expressed in the mitochondria), etc.Slide4
Classic strategy
Combine
ab
initio
and evidence-based gene predictors together to come up with a
concensus
predicted gene set
Ask community to pitch in and manually annotate as many genes as possible
Leads to great variability in quality of different genome annotations, often many versions of official gene setsSlide5
NGS and the future of genome annotation
In 2010, 1300 eukaryotic genome projects were underway
-- assuming 10,000 genes per genome, that’s 13,000,000 new annotations will be needed
-- quality control and maintenance become an issue
Some organizations dedicated to genome annotation (
i.e
ENSEMBL and
VectorBase
) but 1300 genomes will not be feasible
Need for high quality, automated annotation pipelines, that are easy to use by small research groups without extensive bioinformatics expertiseSlide6
MAKER Pipeline:
Especially effective for Emerging Eukaryote Model Organisms
Incorporates
ab
initio
and evidence-based gene predictors
Gene predictions are run a first time
Then a small subset of the genome assembly is used to train gene predictors (building genome-specific
HMMs
)
Then trained gene predictors are run again on whole genome
** Really nice if you don’t have a basis to start from (e.g.
de novo
gene prediction)Slide7
What does MAKER do?
* Identifies and masks out repeat elements
* Aligns
ESTs
to the genome
* Aligns proteins to the genome
* Produces
ab
initio gene predictions
* Synthesizes these data into final annotations
* Produces evidence-based quality values for downstream annotation management Slide8
MAKER Steps involved
1. Compute phase
RepeatMasker
BLAST
Exonerate
SNAP (and other gene predictors)
2. Filter/cluster phase
Identify/remove marginal predictions and alignments based on quality scores/cutoffs, etc
Cluster to identify overlapping alignments/predictions– to remove redundancy and assess weight of evidence
3. Polish
Realigns BLAST hits to obtain greater precision at
exon
boundaries (Exonerate)
4. Synthesis
Collect evidence for each annotation, using EST evidence
Evidences scores plus sequences (genomic, EST, coding,
intron) passed to SNAPSNAP then uses this evidence to retrain and alter its internal HMM5. AnnotatePost-processing of SNAP prediction, recombine with evidence to generate complete annotationsOutput is a gff3 annotation that can be imported into genome browsersSlide9
Inputs to MAKER
Genomic sequence
Config
files
External executables
Sequence database locations
Compute parameters
Sequence database files (choice of these turns out to be extremely important)
Transposons
file (default plus known organism-specific)
Repeatmasker
database file (organism-specific,
optionsal
)
Proteins file (known proteins from related organisms you want to align to the genome)
ESTs
/mRNAs file (the evidence)Slide10
MAKER Output (Apollo browser)