Hsin Yu Chang wwwebiacuk Classifying proteins into families and identifying protein homologues can help scientists to characterise unknown proteins Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in 2009 Which model organism they used fo ID: 917514
Download Presentation The PPT/PDF document "Protein function and classification" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Protein function and classification
Hsin
-Yu Chang
www.ebi.ac.uk
Slide2Classifying proteins into families and identifying
protein homologues can help scientists
to
characterise
unknown proteins
.
Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in 2009. Which model organism they used for this study ?
1.
Tetrahymena
thermophila
2.
S
accharomyces
cerevisiae
3. Mouse
4. Human
Slide4A single
T
etrahymena
thermophila
cell has 40,000 telomeres, whereas a human cell only has 92.1984Discovery of telomerase Greider and Blackburn1989Telomere hypothesis of cell senescenceSzostak1995 Clone hTR1995/1997 Clone hTERT1997 Telomerase knockout mouse1998 Ectopic expression of telomerase in normal human epithelial cells cause the extension of their lifespan
1999/2000…Telomerase/telomere dysfunctions and cancerGilson and Ségal-Bendirdjian, Biochimie, 2010.
Slide5Can we identify human telomerase from T
etrahymea
protein sequence?
Slide6Let’s pretend that human telomerase has not been identified and we only know the protein sequences of
Tetrahymena
telomerase. How can we find the human telomerase?
Slide7BLAST (Basic Local Alignment Tool)
:
compares protein
sequences to sequence databases and calculates the statistical significance of matches.
Slide8BLAST
Advantages:
Relatively
fast
User friendlyVery good at recognising similarity between closely related sequences Drawbacks:sometimes struggle with multi-domain proteinsless useful for weakly-similar sequences (e.g., divergent homologues)
Slide9Using
Tetrahymena
telomerase protein sequences as a query in BLAST, you will find a few human proteins that have very low identity.
Slide10Tetrahymena
and putative human telomerase (AAC51724.1) have poor protein sequence match.
Slide11Can we presume this protein is a telomerase homologue from humans?
Can
we
find more information about
it before pursuing it further?
Slide12Telomerase
ribonucleoprotein
complex - RNA binding domain
Reverse
transcriptase
domainSearch for protein signatures (such as domains) in AAC51724.1
Slide13Plan
experiments and find out more!
AAC51724.1 shares 23% identity with
Tetrahymena
telomerase. It also contains the same domains as telomerase.
But, where can we search for information about the protein domains?
Slide15Structural
domains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models
Finger printsProfilesPatterns
Protein databases that use signature approachesHAMAP
Slide16Construction of protein signatures
Construction of a multiple sequence alignment (MSA) from characterised protein sequences.
Modelling the pattern of conserved amino acids at specific positions within a MSA.
Use these
models to infer relationships with the characterised sequences
Slide17Three different protein signature approaches
Patterns
Single motif methods
Fingerprints
Multiple
motif methods
Profiles &
Hidden Markov Models (HMMs)Full alignment methodsSequence alignment
Slide18Patterns
Patterns
Sequence alignment
Motif
Pattern signature
[AC] – x -V- x(4) - {ED}
R
egular expression
PS00000Pattern sequencesALVKLISGAIVHESATCHVRDLSCCPVESTIS
Patterns are usually directed against functional sequence features such as: active sites, binding
sites, etc.
Slide20PDOC00199
[SAG]-G-G-T-G-[SA]-G
Tubulin
signature
A conserved motif in tubulins
Slide21Patterns
Advantages:
Strict
-
a pattern with very little variability and can produce highly accurate matchesDrawbacks:Simple but less flexible
Slide22Fingerprints
Fingerprints:
a
multiple motif approach
Sequence alignment
Motif 2
Motif 3
Motif 1
Define motifsFingerprint signaturePR00000
Motif
sequences
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Weight matrices
Slide24Telomerase signature (PR01365)
Motif 1
Motif 2
Motif 3
Motif 4
Slide25The significance of motif context
order
interval
Identify small conserved regions in proteins
Several motifs
characterise family1
23
Slide26G
ood
at modeling the often small differences between closely related proteins
D
istinguish
individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificityFingerprintsAmino acids relatively well conserved across all chloride channel protein family members Amino acids uniquely conserved in chloride channel protein 3 subfamily members.
Slide27Profiles & HMMs
Sequence alignment
Entire domain
Define coverage
Whole protein
Use
entire alignment of domain or protein family
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxBuild model (Profile or HMMs)
Profile or HMM signature
Profiles & HMMs
Slide29Profiles
Start with a multiple sequence alignment
Amino acids at each position in the alignment are scored according to the frequency with which they occur
Scores are weighted according to evolutionary distance using a BLOSUM matrix
Good at identifying homologues
Slide30HMMs
Amino acid frequency at each position in the alignment and their transition probabilities are encoded
Insertions and deletions are also modelled
Start with a multiple sequence alignment
Very good at identifying evolutionarily distant homologues
Can model very divergent regions of alignment
Advantages
Slide31Three different protein signature approaches
Patterns
Single motif methods
Fingerprints
Multiple
motif methods
Profiles &
HMMshidden Markov models Full alignment methods
Slide32www.ebi.ac.uk/interpro
Fingerprints
Patterns
Profiles &
HMMs
hidden Markov models
Slide33Structural
domains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models
Finger printsProfilesPatterns
HAMAP
Slide34The aim of InterPro
Family entry:
description, proteins matched and more information.
Domain entry:
description, proteins matched and more information.
Site entry: description, proteins matched and more information. Protein sequences
Slide35What is
InterPro
?
I
nterPro
is an integrated sequence analysis resourceIt combines predictive models (known as signatures) from different databasesIt provides functional analysis of protein sequences by classifying them into families and predicting domains and important sites
Slide36First release in 1999
11 partner databases
Add annotation to
UniProtKB
/TrEMBL Provides matches to over 80% of UniProtKBSource of >85 million Gene Ontology (GO) mappings to >24 million distinct UniProtKB sequences50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of softwareFacts about InterPro
Slide37Signatures are provided by member databases
They are scanned against the
UniProt
database to see which sequences they match
Curators manually inspect the matches before integrating the signatures into
InterProInterPro signature integration processInterPro curators
Slide38InterPro
signature integration process
Signatures representing the same entity are integrated
together
Relationships between entries are traced, where
possibleCurators add literature referenced abstracts, cross-refs to other databases, and GO terms
Slide39http://www.ebi.ac.uk/interpro/
Slide40Search using protein sequences
Family
Slide42Type
Slide43InterPro entry types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or
structure. Ex.
T
elomerase family.
FamilyDistinct functional, structural or sequence units that may exist in a variety of biological contexts. Ex. DNA binding domain.Domain
Short sequences typically repeated within a protein. Ex. Tubulin binding repeats in microtubule associated protein Tau. Repeats
PTMActive Site
Binding
Site
Conserved
Site
Sites
Ex. Phosphorylation sites, ion binding sites, tubulin conserved site.
Slide44Type
Name
Identifier
Contributing signatures
Description
GO termsReferences
Slide45Slide46Slide47Slide48Slide49Type
Name
Identifier
Contributing signatures
Description
ReferencesRelationships
Slide50InterPro
family and domain relationships
Slide51Family relationships in InterPro
:
Interleukin-15/Interleukin-21 family
(
IPR003443)
Interleukin-15 (IPR020439)Interleukin-15Avian (IPR020451)Interleukin-15Fish(IPR020410)Interleukin-15Mammal(IPR020466)
Interleukin-21(IPR028151)
Slide52Relationships
Slide53InterPro
relationships: domains
Protein kinase-like
domain
Protein kinase
domain
Serine/threoninekinase catalyticdomain
Tyrosinekinase catalyticdomain
Slide54Slide55Gene Ontology
Allow
cross-species and/or cross-database
comparisons
Unify the representation of gene and gene product attributes across species
Slide56The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
protein kinase activity
insulin receptor activity
Cell cycleMicrotubule cytoskeleton organisation
Slide57GO:0003677 DNA binding
GO:0003721
telomeric
template RNA
reverse transcriptase activityGO:0005634 Nucleus
Slide58Search using keywords
Slide59Slide60Slide61Summary
Protein
classification could help scientists to gain information about protein functions.
Blast is fast and easy to use but has its drawbacks.
Alternative approach: protein signature
databases build models (protein signatures) by using different methods (patterns, fingerprints, profile and HMMs).InterPro integrates these signatures from 11 member databases. It serves as a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites.
Slide62Slide63Why use
InterPro
?
Large amounts of manually curated data
35,634
signatures integrated into 25,214 entriesCites 38,877 PubMed publicationsLarge coverage of protein sequence spaceRegularly updated~ 8 week release scheduleNew signatures addedScanned against latest version of UniProtKB
Slide64Caution
We need your feedback!
missing/additional references
reporting problems
requests
InterPro is a predictive protein signature database - results are predictions, and should be treated as such InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry!EBI support page.And one more thing…..
Slide65The InterPro Team:
Amaia Sangrador
Craig
McAnulla
Matthew
Fraser
Maxim ScheremetjewSiew-Yit YongAlex MitchellSebastien Pesseat
SarahHunterGiftNukaHsin-YuChangwww.ebi.ac.uk/interproTwitter: @InterProDB
Slide66Database
Basis
Institution
Built from
Focus
URLPfamHMMSanger InstituteSequence alignment
Family & Domain based on conserved sequencehttp://pfam.sanger.ac.uk/Gene3DHMMUCLStructure alignmentStructural Domainhttp://gene3d.biochem.ucl.ac.uk/Gene3D/SuperfamilyHMM
Uni. of BristolStructure alignmentEvolutionary domain relationshipshttp://supfam.cs.bris.ac.uk/SUPERFAMILY/SMARTHMMEMBL HeidelbergSequence alignmentFunctional domain annotationhttp://smart.embl-heidelberg.de/
TIGRFAM
HMM
J. Craig Venter Inst.
Sequence alignment
Microbial Functional Family Classification
http://www.jcvi.org/cms/research/projects/tigrfams/overview/
Panther
HMM
Uni. S. California
Sequence alignment
Family functional classification
http://www.pantherdb.org/
PIRSF
HMM
PIR, Georgetown, Washington D.C.
Sequence alignment
Functional classification
http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml
PRINTS
Fingerprints
Uni. of Manchester
Sequence alignment
Family functional classification
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
PROSITE
Patterns & Profiles
SIB
Sequence alignment
Functional annotation
http://expasy.org/prosite/
HAMAP
Profiles
SIB
Sequence alignment
Microbial protein family classification
http://expasy.org/sprot/hamap/
ProDom
Sequence clustering
PRABI :
Rhône-Alpes Bioinformatics Center
Sequence alignment
Conserved domain prediction
http://prodom.prabi.fr/prodom/current/html/home.php
Slide67Thank you!
www.ebi.ac.uk
Twitter: @
emblebi
Facebook: EMBLEBI
YouTube: EMBLMedia
Slide68The
BLOSUM
(
BLO
cks
SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.
Slide69The
BLOSUM
(
BLO
cks
SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.