/
Microbial Genome Annotation Microbial Genome Annotation

Microbial Genome Annotation - PowerPoint Presentation

cora
cora . @cora
Follow
27 views
Uploaded On 2024-02-02

Microbial Genome Annotation - PPT Presentation

Nikos Kyrpides DOE Joint Genome institute Two main goals of genome analysis Evolutionary analysis How does an organism compare to the rest Metabolic reconstruction What can an organism do and how ID: 1044077

http genes protein sequence genes http sequence protein tools www annotation genome gene cdss genomes microbial sequences finding features

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Microbial Genome Annotation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Microbial Genome AnnotationNikos KyrpidesDOE Joint Genome institute

2. Two main goals of genome analysis:Evolutionary analysisHow does an organism compare to the rest?Metabolic reconstructionWhat can an organism do and how?

3. Overview of Annotation StepsDNA sequence>Contig1ataacaacacattagcggcasacacacaacaggatattaggagagagagaaagttacIdentify Genes(Proteins, RNAs)Identify Regulatory elementsIdentify Repeat elementsGene FindingBlastClusters(BBH, COGs, TIGRFam)Motifs(HMM, Pfam, InterPro)Gene QCAutomaticManualFunction PredictionGene Context(Fusions, Operons, Regulons)Missing Genes

4. Introduction 2. Tools out there3. Basic principles behind tools4. Known problems of the tools: why you may need manual curation1. Finding the genes in microbial genomes

5. Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) …Finding the genes in microbial genomesfeatures

6. Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction)Open reading frame (ORF): reading frame between a start and stop codonTools out there: finding protein-coding genes (not ORFs!)

7. Well-annotated bacterial genome in Artemis genome viewer:Finding features in microbial genomes

8. Introduction 2. Tools out there 3. Basic principles behind tools4. Known problems of the tools: why you may need manual curationFinding the genes in microbial genomes

9. IMG-ERhttp://img.jgi.doe.gov/erIMG-ER submission page:http://img.jgi.doe.gov/submitRASThttp://rast.nmpdr.org/JCVI Annotation Servicehttp://www.jcvi.org/cms/research/projects/annotation-service/overview/Output: rRNAs and tRNAs, CDSs, functional annotationsoutput in several formatsTools out there: servers for microbial genome annotation - IOutput: stable RNA-encoding genes, CDSs, functional annotationsoutput in GenBank format Output: CDSs, stable RNAs? functional annotationsformat?

10. Tools out there: servers for microbial genome annotation - II AMIGENEhttp://www.genoscope.cns.fr/agc/tools/amiga/Form/form.phpRefSeqhttp://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgihttp://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgiEasyGenehttp://www.cbs.dtu.dk/services/EasyGene/Output: CDSs, output in tbl formatOutput: CDSs, size restriction <1MbOutput: CDSs, output in gff format

11. Artemishttp://www.sanger.ac.uk/Software/Artemis/Manateehttp://manatee.sourceforge.net/Argohttp://www.broad.mit.edu/annotation/argo/Major difference: viewer vs editor?Tools out there: genome browsers for manual annotation of microbial genomesLinux versions only;genome needs to be annotated by the JCVI Annotation ServiceWindows and Linux versions;works with files in many formats, annotated by any pipelineWindows and Linux;works with files in many formats

12. Large structural RNAs (23S and 16S rRNAs) RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ ARAGORN http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE/Tools out there: tools for finding stable (“non-coding”) RNAs - IWeb service: sequence search is limited to 2 kbWeb service: sequence search is limited to 15 kb, finds tRNAs and tmRNAs onlyWeb service: sequence search is limited to 5 Mb, finds tRNAs only

13. Short regulatory RNAs Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ Other (less popular) tools:Pipeline for discovering cis-regulatory ncRNA motifs: http://bio.cs.washington.edu/supplements/yzizhen/pipeline/RNAz http://www.tbi.univie.ac.at/~wash/RNAz/ Tools out there: tools for finding “non-coding” RNAs - IIWeb service: sequence search is limited to 2 kb;Provides list of pre-calculated RNAs for publicly available genomes

14. Tools out there: most popular CDS-finding toolsCRITICAGlimmer family (Glimmer2, Glimmer3, RBS finder) http://glimmer.sourceforge.net/GeneMark family (GeneMark-hmm, GeneMarkS) http://exon.gatech.edu/GeneMark/EasyGeneAMIGENEPRODIGAL (default JGI gene finder) http://compbio.ornl.gov/prodigal/Combinations and variations of the aboveRAST (Glimmer2 + pre- and post-processing)

15. Basic principles: finding CDSs using evidence-based vs ab initio algorithmsTwo major approaches to prediction of protein-coding genes:“evidence-based” (ORFs with translations homologous to the known proteins are CDSs)Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictionsLimitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation toolsab initio (ORFs with nucleotide composition similar to CDSs are also CDSs)Advantages: finds “unique” genes; high sensitivityLimitations: often misses “unusual” genes; high rate of false positives

16. IntroductionTools out thereBasic principles behind toolsKnown problems of the tools: why you may need manual curationFinding the genes in microbial genomes

17. Known problems: CDSsShort CDSs: many are missed, others are overpredictedshort ribosomal proteins (30-40 aa long) are often missedshort proteins in the promoter region are often overpredictedN-terminal sequences are often inaccurate (many features of the sequence around start codon are not accounted for)Glimmer2.0 is calling genes longer than they should beGeneMark, Glimmer3.0 mostly call genes shorterPseudogenes and sequencing errors (artificial frameshift)all tools are looking for ORFs (needs valid start and stop codons)“unique” genes are often predicted on the opposite strand of a pseudogene or a gene with a sequencing errorProteins with unusual translational features (recoding, programmed frameshifts)these genes are often mistaken for pseudogenessee pseudogenes

18. Known problems: CDSsLack of Standards

19. Finding unique genesObligate parasite of horsesCauses human disease in tropical areas (melioidosis)

20. Phylogenetic profiler finds 548 unique genes in B. malleiHowever, 497 of them in fact exist in B. pseudomallei, but they have not been called as real genes.The difference in gene models reveals 89.2% error rate in unique genes

21.

22. GenePRIMPGene Prediction Improvement PipelineGenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features.APPLICATIONSIdentify gene prediction anomaliesBenchmark the quality of gene prediction algorithmsBenchmark the quality of combination / coverage of sequencing platformsImprove the sequence qualityPati A. et al, Nature Methods June 2010GenePRIMPhttp://geneprimp.jgi-psf.org

23. GenePRIMP steps

24. Intergenic regions identify missed ORFs …Find missing genes

25. … and wrong ORFsor2654 is unique and hides a real CDS which is acyl carrier protein

26. Everything looks perfect in this area of Nitrobacter winogradskyi, but …

27. … hides a real ORF

28. Guinness Book of protein-coding genesThe longest human gene is 2,220,223 nucleotides long. It has 79 exons, with a total of only 11,058 nucleotides, which specify the sequence of the 3,685 amino acids and codes for a protein dystrophin. It is part of a protein complex located in the cell membrane, which transfers the force generated by the actin-myosin structure inside the muscle fiber to the entire fiber. The smallest human gene is 252 nucleotides long, it specifies a polypeptide of 67 amino acids and codes for an insulin-like growth factor II. The longest bacterial gene is 110,418 nucleotides long, which specify the sequence of 36,805 amino acids. Its function is unknown, most likely a surface protein.The smallest bacterial gene is 54 nucleotides long, it specifies a polypeptide of 17 amino acids and codes for a regulatory protein in cyanobacteria

29. Genome nameCDSs with no hits < 100 aa% with tBLASTn hit% tBLASTn hits with frameshifts/stop codonsProchlorococcus AS96011888.968.8Prochlorococcus MIT 92116240.380Prochlorococcus MIT 92152458.364.2Prochlorococus MIT 9301127566.7Prochlorococcus MIT93035018361.8Prochlorococcus MIT 9313358.666.7Prochlrococcus MIT 95153281.350Prochlorococcus NATL1A20995.248.2Prochlorococcus CCMP1375503482.4Synechococcus PCC 79423900Synechococcus CC931131311.583.3Synethococcus CC96058338.681.3Synechococcus CC99022157.1100Synechococcus JA-2-3Ba17626.785.1Synechococcus JA-3-3Ab14235.292Synechococcus PCC 70029317.256.3Synechococcus RCC30718410.368.4SSynechococcus WH 78033218.883.3Synechococcus WH 81023938.446.7False positives

30. Introduction 2. Tools out there3. Known problems2. Finding the functions in microbial genomes

31. what is function?cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)molecular/enzymatic (methyltransferase)Reaction (methylation)Substrate (cobalt-precorrin-4)Ligand (S-adenosyl-L-methionine)metabolic (cobalamin biosynthesis) physiological (maintenance of healthy nerve and red blood cells, through B12).

32. Functional characterization

33. Computational approaches to Functional characterization

34. Sequence HomologyTwo sequences are homologous, if there existed a molecule in the past that is ancestral to both of the sequences.Types of Homology:Orthology: bifurcation in molecular tree reflects speciationParalogy: bifurcation in molecular tree reflects gene duplication

35. Homology & analogyThe term homology is confounded & abused in the literature!sequences are homologous if they’re related by divergence from a common ancestoranalogy relates to the acquisition of common features from unrelated ancestors via convergent evolutione.g., b-barrels occur in soluble serine proteases & integral membrane porins; chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities Homology is not a measure of similarity & is not quantifiableit is an absolute statement that sequences have a divergent rather than a convergent relationshipthe phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless!

36. Function predictionFunction transfer by homologyHomology implies a common evolutionary origin. not retention of similarity in any of their properties.Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008

37. Dos and Don’tsTypeDon’tDoHomologySame functionProbability for same functionOrthologySame functionProbability for same functionParalogySame functionProbability for same functionSequence similaritySame functionProbability for same functionHigh sequence similarity Same functionProbability for same functionSame sequenceSame functionProbability for same function

38. Application areas of analysis toolsThe scale indicates % identity between aligned sequencesAlignment of 2 random seqs can produce ~20% identityless than 20% does not constitute a significant alignmentaround this threshold is the Twilight Zone, where alignments may appear plausible to the eye, but can’t be proved by conventional methods

39. Introduction 2. Tools out there 3. Known problemsFinding the functions in microbial genomes

40. Function predictionSimilarity searches (BLAST).Domain identification(Pfam).Small sequence identification(PROSITE).

41. What if nothing is similar ? Subcellular localizationGene contextStructurePrediction of binding residues (DISIS, bindN)CytoplasmS ~ SS ~ SPeriplasm

42. Genome annotationModel pathwayAnnotation should make senseSubstrateASubstrateBSubstrateCSubstrateDEnzyme 2Enzyme 1Enzyme 3Enzyme 2??Enzyme 1Enzyme 3

43. Annotation should make sense

44. DatabasesDatabases used for the analysis of biological molecules.Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.Why bother?Store information.Organize data.Predict features (genes, functions ...).Predict the functional role of a feature (annotation).Understand relationships (metabolic reconstruction).

45. Primary nucleotide databasesEMBL/GenBank/DDBJ (http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl)Archive containing all sequences from:genome projectssequencing centers individual scientistspatent officesThe sequences are exchanged between the three centers on a daily basis.Database is doubling every 10 months.Sequences from >140,000 different species.1400 new species added every month.Database name nt / nrYear Base pairs Sequences2004 44,575,745,176 40,604,3192005 56,037,734,462 52,016,7622006 69,019,290,705 64,893,7472007 83,874,179,730 80,388,3822008 99,116,431,942 98,868,465

46. Primary protein sequence databasesContain coding sequences derived from the translation of nucleotide sequencesGenBankValid translations (CDS) from nt GenBank entries.UniProtKB/TrEMBL (1996)Automatic CDS translations from EMBL.TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.

47. RefSeqCurated transcripts and proteins.reviewed by NCBI staff.Model transcripts and proteins.generated by computer algorithms.Assembled Genomic Regions (contigs).Chromosome records.

48. Groups (families/clusters) of proteins based on…Overall sequence similarity.Local sequence similarity.Presence / absence of specific features.Structural similarity....These groups contain proteins with similar properties.Specific function, enzymatic activity.Broad function.Evolutionary relationship.…Classification databases

49. Overall sequence similarity

50. Clusters of orthologous groups (COGs)COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages.Each Cluster has representatives of at least 3 lineagesA function (specific or broad) has been assigned to each COG.http://www.ncbi.nlm.nih.gov/COG/

51. How it worksReciprocal best hitBidirectional best hitBlast best hitUnidirectional best hitCOG1COG2

52. Profiles & PfamA method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

53. Regions similarity

54. http://pfam.sanger.ac.ukHMMs of protein alignments(local) for domains, or global (cover whole protein)Pfam

55. PROSITEhttp://au.expasy.org/prosite/R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)

56. KEGG orthology

57. Composite pattern databasesTo simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterProRelease 32.0 (Apr 11) contains 21516 entriesCentral annotation resource, with pointers to its satellite dbshttp://www.ebi.ac.uk/interpro/

58. * It is up to the user to decide if the annotation is correct *

59. KEGGContains information about biochemical pathways, and protein interactions.http://www.kegg.com

60. SummaryWe have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam).This is the tip of the iceberg of databases.They help predict the function, or the network of functions.Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required

61. Functional annotation in IMGAutomated protein product assignment pipelineFunctional context in IMGKEGG Pathways, Modules, KEGG OrthologyMetaCyc PathwaysIMG PathwaysNo longer maintained:TIGR Role CategoriesTIGR Genome PropertiesCOG Functional Categories

62. Lack of Standards