/
1  MICROBIAL GENOME ANNOTATION 1  MICROBIAL GENOME ANNOTATION

1 MICROBIAL GENOME ANNOTATION - PowerPoint Presentation

BunnyBoo
BunnyBoo . @BunnyBoo
Follow
342 views
Uploaded On 2022-08-03

1 MICROBIAL GENOME ANNOTATION - PPT Presentation

Loren Hauser Miriam Land YunJuan Chang Frank Larimer Doug Hyatt Cynthia Jeffries NEB Educational Support 2 httpwwwnebcomnebecommcoursesupportasp Why study Computational Biology ID: 933343

genome gene microbial http gene genome http microbial genomes scoring www sequencing pseudogenes gov prodigal micro motifs analysis start

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 MICROBIAL GENOME ANNOTATION" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

MICROBIAL GENOME ANNOTATION

Loren HauserMiriam LandYun-Juan ChangFrank LarimerDoug HyattCynthia Jeffries

Slide2

NEB Educational Support

2

http://www.neb.com/nebecomm/course_support.asp?

Slide3

Why study Computational Biology

and Bioinformatics?

DNA sequencing output is growing faster than Moore’s law!1 Illumina sequencing machine = 0.5 Tbp/weekThere are hundreds of these and thousands of other sequencing machines around the world.New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day!

3

Slide4

Why study Medical

Bioinformatics?In the near future, most cancer diagnostics will involved DNA or RNA sequencing!

In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments!Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections.

4

Slide5

DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics

5

http://www.jgi.doe.gov/education

Slide6

6

Why Study Microbial Genomes?

Large biological mass (50% of total)photosynthetic (Prochlorococcus)fix N2 gas to NH

3

(

Rhodopseudomonas

)

NH

3

to NO

2

(

Nitrosomonas

)

bioremediation (

Shewanella

, Burkholderia)

pathogens, BW (Yersinia

pestis

- plague)

food production (Lactobacillus)CH4 production (Methanosarcina)H2 production (Rhodopseudomonas)

Slide7

Example of Current Microbial Genome Projects

UC Davis – FDA funded 100K bacterial genomes project associated with food.

5 years = 20K per year / 200 days/year = 100 genomes/day!7

Slide8

8

Web

Resources and Contact Information

http://genome.ornl.gov/microbial/

http://www.jgi.doe.gov/

http://genome.jgi-psf.org/

http://www.jcvi.org/

http://www.ncbi.nlm.nih.gov/

http://www.sanger.ac.uk/

http://www.ebi.ac.uk/

ftp://ftp.lsd.ornl.gov/pub/JGI

artemis ready files for each scaffold = (feature table plus fasta sequence file)

Contact:

landml@ornl.gov

; hauserlj@ornl.gov

Slide9

9

Slide10

Evolution of Sequencing Throughput

Slide11

11

Sequenced Microbial Genomes

ARCHAEAL GENOMES159 FINISHED; 218 IN

PROGRESS

BACTERIAL GENOMES

3363 FINISHED

;

11831 IN

PROGRESS

ENVIRONMENTAL

COMMUNITIES

> 50,000 samples (see

MGRast

)

as of

Sept 6, 2012

http://www.expasy.ch/alinks.html

http://

www.genomesonline.org

http://metagenomics.anl.gov/

Slide12

12

Published Genomes

Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003)Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003)

Synechococcus WH8102 - Nature 424:1037-1042 (2003)

Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004)

Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004)

Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006)

Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006)

Burkholderia xenovorans – PNAS 103(42):15280-7 (2006)

Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006)

Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007)

Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008)

Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008)

Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008)

Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008)

R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008)

L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)

Slide13

13

Basic Annotation

ImpactsDesign of oligonucleotide arrays

D

esign

& prioritize protein expression constructs

D

esign

& prioritize gene knockouts

A

ssessment

of overall metabolic capacity

D

atabase

for

proteomics

Allows visualization of whole genome

Slide14

14

Additional Analysis

ImpactsRevised functional assignments based on domain fusions, functional clustering, phylogenetic profileR

egulatory

motif discovery

O

peron

and regulon discovery

R

egulatory

and protein association network discovery

Slide15

15

Scaffolds

orcontigs

Prodigal

Model

correction

Final Gene

List

InterPro

COGs

Web

Pages

Blast

Complex

Repeats

Simple

repeats

GC Content,

GC skew

PRIAM

Function call

tRNAs

rRNA,

Misc_RNAs

Feature

table

TMHMM

SignalP

Microbial

Annotation

Genome

Pipeline

Slide16

16

Prodigal (

Prokaryotic Dynamic Programming Genefinding Al

gorithm)

Unsupervised

:  Automatically learns the statistical properties of the genome

.

Indifferent

to GC Content:  Prodigal performs well irrespective of the GC content of the organism

.

Draft

:  Prodigal can train on multiple sequences then analyze individual draft sequences

.

Open

Source:  Prodigal is freely available under the GPL

.

Reference

Hyatt D, Chen GL,

Locascio

PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)

Slide17

17

G+C Frame Plot Training

Takes all ORFs above a specified length in the genome.Examines the G+C bias in each frame position of these ORFs.

Does

a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes

.

Takes

those predicted genes and gathers

dicodon

usage statistics.

Slide18

18

Gene Prediction

Dicodon usage coding score

Length

factor added to coding score (GC-content-dependent

)

Coding/noncoding

thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference

).

Dynamic

programming to put genes

together.

Bonuses

for operon distances, larger bonus for -1/-4

overlaps.

Same

strand overlap allowed (up to 60 bases

).

Opposite

strand -->3'r 5'f<- allowed (up to 250 bases)

Slide19

19

Start Site Scoring

Shine Dalgarno MotifExamines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG

vs

GTG

vs

TTG frequency

)

Moves

starts based on these discoveries

.

Gathers

statistics on the new set of starts and repeats this process until convergence (5-10 iterations

).

RBS

motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG

).

Does

a final dynamic programming with the start scoring function.

Slide20

20

Start Site Scoring

Other MotifsIf Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes.

If Shine-

Dalgarno

scoring is weak, look for other motifs

If a strong scoring motif is found, use it (example GGTG in A.

pernix

)

If no strong scoring motif is found, use highest score of all found motifs (example –

Crenarchaea

,

Tc

and

Tl

start sites are the same, but internal operon genes use weak Shine-

Dalgarno

motifs)

Slide21

Annotated Gene Prediction

21

Slide22

Prodigal Scoring

22

Slide23

23

Gene Prediction Problems – Pseudogenes

Slide24

24

Pseudogenes – Internal deletion

Slide25

25

Pseudogenes – Premature stop codon

Slide26

26

Pseudogenes – N-terminal deletion

Slide27

27

Pseudogenes – Transposon insertion

Slide28

28

Pseudogenes – Multiple

frameshifts

Slide29

29

Pseudogenes – Premature Stop and

Frameshift

Slide30

30

Pseudogenes – Dead Start Codon

Slide31

31

Slide32

32

GENE PAGE

Slide33

33

Slide34

34

Slide35

35

Slide36

36

ORGANISM’S (PSYC) COGS LIST

Slide37

37

Taxonomic Distribution of Top KEGG BLAST Hits

Slide38

38

Frequency distance distributions

Salgado et al.

PNAS (2000)

97:6652

Fig. 2

Slide39

39

Frequency distance distributions

Salgado et al.

PNAS (2000)

97:6652

Fig. 3b

Slide40

40

Branched Chain Amino Acid Transporter family

Slide41

41

Probable Ancient Gene (Liv Operon)

Slide42

42

Branched Chain Amino Acid Transporter family –

Rhodopseudomonas palustris

Slide43

43

Example of Lateral Transfer

Slide44

44

Transporter Gene Loss

in Yersina Pestis36 Genes involved in transport from YPSE are nonfunctional in YPES13 lost due to frameshifts

11 lost due to deletions

6 lost due to IS element insertions

4 (2 pair) lost due to recombination causing deletions and

frameshifts

2 lost due to premature stop codons

Slide45

45

Slide46

46

Nostoc punctiforme

Signal Transduction Histidine Kinases

Slide47

47

Nostoc punctiforme

Signal Transduction Histidine Kinases

Slide48

48

Nostoc punctiforme

Signal Transduction Histidine Kinases

Slide49

49

Nostoc punctiforme

Signal Transduction Histidine Kinases

Slide50

50

Nostoc punctiforme

Regulatory Proteins

Slide51

51

Burkholderia xenovorans

Regulatory Proteins

Slide52

52

Regulatory Protein

Identification Scheme

Slide53

53

Summary of automated transporter annotation --- Zymomonas

Slide54

54

Zymomonas transporters

complete listing

Slide55

Transcriptome Analysis Pipeline:

RNA sequences to GRN

Collect RNAseq data

Map reads to genomes

Calculate reads/

bp

Display frequency plot

Determine operons from frequency plot

Compare operon determinations (genome co-ordinates)

Predict operons In

silico

Improve algorithm

Determine

orthologous

operons

Determine orthologs with OrthoMCL

Align

orthologous

promoters

Determine TFBS from alignments

Determine TISs with 5’ RACE.

Cluster analysis from gene expression arrays

Predict TFBS

In

silico

Cluster analysis of gene expression changes

GRN genetic regulatory network

Slide56

Dynamic range and sensitivity

Slide57

New gene, wrong start, riboswitch

Slide58

Small Regulatory RNA ???

Slide59

Differential gene expression

Slide60

Operon with Internal Promoter

60

Slide61

Long Term Vision

Develop

TPing SOPs, and an automated analysis pipeline.Initially produce TPs and preliminary GRNs for all important DOE microbial genomes

(

i.e.

BESC),

and eventually

all DOE microbial genomes.

Incorporate

the

TP

analysis pipeline into ORNL’s automated microbial annotation pipeline, and

eventually

into IMG and GenBank files.

Add additional experimental methods to improve the GRN

determinations

.