The linear comparison of more than two sequences Places residues in columns per position specific similarity scores reflects relationships of the sequences the scores are based on ID: 554486
Download Presentation The PPT/PDF document "Multiple alignment" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Multiple alignment
The linear comparison of more than two sequences
Places residues in columns
per
position specific similarity scores
reflects
relationships
of the
sequences
the scores are based on
indels
(gaps) and substitutions.
The alignment of residues implies that they have similar roles in the proteins or DNA sequences being aligned
e.g.
protein active sites or transcription factor binding sites
Strength in numbers: the structure/function message from a multiple alignment is stronger than that of a pairwise alignment.Slide2
Uses of multiple alignments
Identification of functionally or structurally conserved domain/motif
- biological meaning, domain groups, motif matrices
-
ProSite
,
InterPro
,
etc
Classification of
domains into
families
- biological or structural meaning
-
Pfam
, SCOP
Evolutionary studies
- phylogenetic inference of
gene or
species evolution
Structure prediction
- homology modelingSlide3
Multiple alignment method
Find homologous sequences
Homologous sequences share a common ancestor, usually relatively high sequence similarity
Not all similar proteins are homologous:
Similarity may have come about due to convergent evolution or by chance.Slide4
Homology is detectable
When there is consensus over a relatively long stretch of sequence
OR
When the
conservation is
high within functionally relevant regions
THUS
Statistical methods based on position-specific matrices help to provide some evidence
BUT
You
usually need
to
check
your alignment by eye
to
make sure it makes sense
AND
May need structural data to recognize homologySlide5
Finding how many sequences?
Use different BLAST algorithms.
The more
seqs
you have the stronger your alignment will be …
Depends on your sequence type and your question
Beware of redundant sequences (choose a threshold relevant to your question)Beware of pulling in unrelated sequences (take a good look at your dataset)(More sequences means longer computational time, but this is why we have Pegasus)Slide6
Global multiple alignment
Assumes conserved regions occur in same order
Begins by aligning them from the beginning of the sequence
Allows gaps
Builds a consensus sequence,
or a profile if based on statistical calculations
Most useful for defining protein families and evolutionary workSlide7
Assumes conserved regions can be duplicated
and can occur in different order along the seqs
Block A
Block B
Block C
Block D
Local multiple alignment
Most useful for finding motifs (shorter sequence lengths)Slide8
Gaps and substitutions
For protein
msa
, PAM, BLOSUM, or other scoring
matrices are used for gaps and
substitutions – but with position specific weighting.
Clustal default is BLOSUM68 MUSCLE uses 200PAM plus their own log-expectation matrixPAM is based on number of changes per evolutionary rate – the higher, the less stringent, eg 250 PAM is casting a wide net
BLOSUM is based on frequency of changes in closely conserved blocks of motifs – the higher the more stringent,
eg
BLOSUM80 is biased towards finding motifs that are highly conserved (to 80%), BLOSUM68 less so etc.Slide9Slide10
Gaps and substitutions
PAM, BLOSUM, or other scoring
matrices are used for gaps and
substitutions – but with position specific weighting.
ClustalW
default is BLOSUM
MUSCLE uses 200PAM plus their own log-expectation matrix
For protein sequences, more chance of having
indels
in the outer loops than inner core or catalytic domain
For non-coding DNA,
repeats and transposons
may occur
For structure RNAs, loop regions are more variable
than stem regionsSlide11
Evolution of Algorithms
Profiles
position specific scoring matrix based on
amino
acid conservation
PSI-BLAST
position specific iterative scoring matrix
plus BLAST Hidden Markov Models position specific scoring matrix plus position specific gap penalties Structural information? Not trivial…Slide12
multiple alignment
1. Exhaustive approaches -
mathematically very accurate
alignments are optimal
BUT these are very complex and take a huge amount of time
2. Heuristic methods -
slightly less accurate,
alignments are good but not optimal AND are usually enough for biological questionsSlide13
Multiple alignment method
Find
homologous sequences
Place the sequences in a relevant format (usually FASTA), and
edit
to similar length. Example>ACTB cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtagacccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcaggcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacgggccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca>AGPAT1 tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactgatgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctcacattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagatctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct You can use a list of accession numbers if you already know that the sequences are of similar lengths.Slide14
Multiple alignment method
Find
homologous sequences
Place the sequences in a relevant format (usually FASTA), and
edit
to similar length.3. Run a multiple alignment program ClustalW - oldest, flexible, robust ClustalΩ
- latest version, scalable, more accurate with addition of HMM MUSCLE - fast, good for finding short motifs in small datasets PRALINE
- includes secondary structure information
T-Coffee
- good for small datasets of shorter
sequences, has a module for checking input
seqs
against the PDB
COBALT
- uses domain conservation information
(from BLAST
page
) which by definition has some structural informationSlide15
Clustal
family
Clustal
X
Uses progressive global alignment algorithm
Graphic user interface only
Clustal W and W2Command line tool, W2 also had a web interface Has a parallelized version, to cope with larger datasetsClustalΩHMM searches added to algorithmCommand line and web interfaceScalable to very large datasets Slide16
Input of Data
3 or more sequences are needed, nucleic or amino acids, several formats are accepted:
eg
FASTA text files
Remove any white space or empty lines
The analysis will fail if two sequences have the same name
Can copy/paste sequences into Clustal or upload a txt fileSlide17
Groups
Group 1:
Joey (BI;G), Alan (CS;U),
Chingis
(CS, U)
Group 2: Wei (BI;G), Toni (BI,CS;U), William (CS; U), Federico (CS; U), Group 3: Robert (BI; U), Yifan (BI; G), Shiv (CS; U), Travis (CS; U)