Loren Hauser Miriam Land YunJuan Chang Frank Larimer Doug Hyatt Cynthia Jeffries NEB Educational Support 2 httpwwwnebcomnebecommcoursesupportasp Why study Computational Biology ID: 933343
Download Presentation The PPT/PDF document "1 MICROBIAL GENOME ANNOTATION" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
MICROBIAL GENOME ANNOTATION
Loren HauserMiriam LandYun-Juan ChangFrank LarimerDoug HyattCynthia Jeffries
Slide2NEB Educational Support
2
http://www.neb.com/nebecomm/course_support.asp?
Slide3Why study Computational Biology
and Bioinformatics?
DNA sequencing output is growing faster than Moore’s law!1 Illumina sequencing machine = 0.5 Tbp/weekThere are hundreds of these and thousands of other sequencing machines around the world.New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day!
3
Slide4Why study Medical
Bioinformatics?In the near future, most cancer diagnostics will involved DNA or RNA sequencing!
In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments!Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections.
4
Slide5DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics
5
http://www.jgi.doe.gov/education
Slide66
Why Study Microbial Genomes?
Large biological mass (50% of total)photosynthetic (Prochlorococcus)fix N2 gas to NH
3
(
Rhodopseudomonas
)
NH
3
to NO
2
(
Nitrosomonas
)
bioremediation (
Shewanella
, Burkholderia)
pathogens, BW (Yersinia
pestis
- plague)
food production (Lactobacillus)CH4 production (Methanosarcina)H2 production (Rhodopseudomonas)
Slide7Example of Current Microbial Genome Projects
UC Davis – FDA funded 100K bacterial genomes project associated with food.
5 years = 20K per year / 200 days/year = 100 genomes/day!7
Slide88
Web
Resources and Contact Information
http://genome.ornl.gov/microbial/
http://www.jgi.doe.gov/
http://genome.jgi-psf.org/
http://www.jcvi.org/
http://www.ncbi.nlm.nih.gov/
http://www.sanger.ac.uk/
http://www.ebi.ac.uk/
ftp://ftp.lsd.ornl.gov/pub/JGI
artemis ready files for each scaffold = (feature table plus fasta sequence file)
Contact:
landml@ornl.gov
; hauserlj@ornl.gov
Slide99
Slide10Evolution of Sequencing Throughput
Slide1111
Sequenced Microbial Genomes
ARCHAEAL GENOMES159 FINISHED; 218 IN
PROGRESS
BACTERIAL GENOMES
3363 FINISHED
;
11831 IN
PROGRESS
ENVIRONMENTAL
COMMUNITIES
> 50,000 samples (see
MGRast
)
as of
Sept 6, 2012
http://www.expasy.ch/alinks.html
http://
www.genomesonline.org
http://metagenomics.anl.gov/
Slide1212
Published Genomes
Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003)Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003)
Synechococcus WH8102 - Nature 424:1037-1042 (2003)
Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004)
Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004)
Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006)
Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006)
Burkholderia xenovorans – PNAS 103(42):15280-7 (2006)
Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006)
Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007)
Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008)
Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008)
Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008)
Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008)
R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008)
L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press)
Slide1313
Basic Annotation
ImpactsDesign of oligonucleotide arrays
D
esign
& prioritize protein expression constructs
D
esign
& prioritize gene knockouts
A
ssessment
of overall metabolic capacity
D
atabase
for
proteomics
Allows visualization of whole genome
Slide1414
Additional Analysis
ImpactsRevised functional assignments based on domain fusions, functional clustering, phylogenetic profileR
egulatory
motif discovery
O
peron
and regulon discovery
R
egulatory
and protein association network discovery
Slide1515
Scaffolds
orcontigs
Prodigal
Model
correction
Final Gene
List
InterPro
COGs
Web
Pages
Blast
Complex
Repeats
Simple
repeats
GC Content,
GC skew
PRIAM
Function call
tRNAs
rRNA,
Misc_RNAs
Feature
table
TMHMM
SignalP
Microbial
Annotation
Genome
Pipeline
Slide1616
Prodigal (
Prokaryotic Dynamic Programming Genefinding Al
gorithm)
Unsupervised
: Automatically learns the statistical properties of the genome
.
Indifferent
to GC Content: Prodigal performs well irrespective of the GC content of the organism
.
Draft
: Prodigal can train on multiple sequences then analyze individual draft sequences
.
Open
Source: Prodigal is freely available under the GPL
.
Reference
:
Hyatt D, Chen GL,
Locascio
PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed)
Slide1717
G+C Frame Plot Training
Takes all ORFs above a specified length in the genome.Examines the G+C bias in each frame position of these ORFs.
Does
a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes
.
Takes
those predicted genes and gathers
dicodon
usage statistics.
Slide1818
Gene Prediction
Dicodon usage coding score
Length
factor added to coding score (GC-content-dependent
)
Coding/noncoding
thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference
).
Dynamic
programming to put genes
together.
Bonuses
for operon distances, larger bonus for -1/-4
overlaps.
Same
strand overlap allowed (up to 60 bases
).
Opposite
strand -->3'r 5'f<- allowed (up to 250 bases)
Slide1919
Start Site Scoring
Shine Dalgarno MotifExamines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG
vs
GTG
vs
TTG frequency
)
Moves
starts based on these discoveries
.
Gathers
statistics on the new set of starts and repeats this process until convergence (5-10 iterations
).
RBS
motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG
).
Does
a final dynamic programming with the start scoring function.
Slide2020
Start Site Scoring
Other MotifsIf Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes.
If Shine-
Dalgarno
scoring is weak, look for other motifs
If a strong scoring motif is found, use it (example GGTG in A.
pernix
)
If no strong scoring motif is found, use highest score of all found motifs (example –
Crenarchaea
,
Tc
and
Tl
start sites are the same, but internal operon genes use weak Shine-
Dalgarno
motifs)
Slide21Annotated Gene Prediction
21
Slide22Prodigal Scoring
22
Slide2323
Gene Prediction Problems – Pseudogenes
Slide2424
Pseudogenes – Internal deletion
Slide2525
Pseudogenes – Premature stop codon
Slide2626
Pseudogenes – N-terminal deletion
Slide2727
Pseudogenes – Transposon insertion
Slide2828
Pseudogenes – Multiple
frameshifts
Slide2929
Pseudogenes – Premature Stop and
Frameshift
Slide3030
Pseudogenes – Dead Start Codon
Slide3131
Slide3232
GENE PAGE
Slide3333
Slide3434
Slide3535
Slide3636
ORGANISM’S (PSYC) COGS LIST
Slide3737
Taxonomic Distribution of Top KEGG BLAST Hits
Slide3838
Frequency distance distributions
Salgado et al.
PNAS (2000)
97:6652
Fig. 2
Slide3939
Frequency distance distributions
Salgado et al.
PNAS (2000)
97:6652
Fig. 3b
Slide4040
Branched Chain Amino Acid Transporter family
Slide4141
Probable Ancient Gene (Liv Operon)
Slide4242
Branched Chain Amino Acid Transporter family –
Rhodopseudomonas palustris
Slide4343
Example of Lateral Transfer
Slide4444
Transporter Gene Loss
in Yersina Pestis36 Genes involved in transport from YPSE are nonfunctional in YPES13 lost due to frameshifts
11 lost due to deletions
6 lost due to IS element insertions
4 (2 pair) lost due to recombination causing deletions and
frameshifts
2 lost due to premature stop codons
Slide4545
Slide4646
Nostoc punctiforme
Signal Transduction Histidine Kinases
Slide4747
Nostoc punctiforme
Signal Transduction Histidine Kinases
Slide4848
Nostoc punctiforme
Signal Transduction Histidine Kinases
Slide4949
Nostoc punctiforme
Signal Transduction Histidine Kinases
Slide5050
Nostoc punctiforme
Regulatory Proteins
Slide5151
Burkholderia xenovorans
Regulatory Proteins
Slide5252
Regulatory Protein
Identification Scheme
Slide5353
Summary of automated transporter annotation --- Zymomonas
Slide5454
Zymomonas transporters
complete listing
Slide55Transcriptome Analysis Pipeline:
RNA sequences to GRN
Collect RNAseq data
Map reads to genomes
Calculate reads/
bp
Display frequency plot
Determine operons from frequency plot
Compare operon determinations (genome co-ordinates)
Predict operons In
silico
Improve algorithm
Determine
orthologous
operons
Determine orthologs with OrthoMCL
Align
orthologous
promoters
Determine TFBS from alignments
Determine TISs with 5’ RACE.
Cluster analysis from gene expression arrays
Predict TFBS
In
silico
Cluster analysis of gene expression changes
GRN genetic regulatory network
Slide56Dynamic range and sensitivity
Slide57New gene, wrong start, riboswitch
Slide58Small Regulatory RNA ???
Slide59Differential gene expression
Slide60Operon with Internal Promoter
60
Slide61Long Term Vision
Develop
TPing SOPs, and an automated analysis pipeline.Initially produce TPs and preliminary GRNs for all important DOE microbial genomes
(
i.e.
BESC),
and eventually
all DOE microbial genomes.
Incorporate
the
TP
analysis pipeline into ORNL’s automated microbial annotation pipeline, and
eventually
into IMG and GenBank files.
Add additional experimental methods to improve the GRN
determinations
.