48K - views

Multiple alignment

The linear comparison of more than two sequences. Places residues in columns . per . position specific similarity scores . reflects . relationships . of the . sequences. the scores are based on .

Embed :
Presentation Download Link

Download Presentation - The PPT/PDF document "Multiple alignment" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Multiple alignment

Presentation on theme: "Multiple alignment"— Presentation transcript:


Multiple alignment

The linear comparison of more than two sequences

Places residues in columns


position specific similarity scores



of the


the scores are based on


(gaps) and substitutions.

The alignment of residues implies that they have similar roles in the proteins or DNA sequences being aligned


protein active sites or transcription factor binding sites

Strength in numbers: the structure/function message from a multiple alignment is stronger than that of a pairwise alignment.Slide2

Uses of multiple alignments

Identification of functionally or structurally conserved domain/motif

- biological meaning, domain groups, motif matrices







Classification of

domains into


- biological or structural meaning




Evolutionary studies

- phylogenetic inference of

gene or

species evolution

Structure prediction

- homology modelingSlide3

Multiple alignment method

Find homologous sequences

Homologous sequences share a common ancestor, usually relatively high sequence similarity

Not all similar proteins are homologous:

Similarity may have come about due to convergent evolution or by chance.Slide4

Homology is detectable

When there is consensus over a relatively long stretch of sequence


When the

conservation is

high within functionally relevant regions


Statistical methods based on position-specific matrices help to provide some evidence



usually need



your alignment by eye


make sure it makes sense


May need structural data to recognize homologySlide5

Finding how many sequences?

Use different BLAST algorithms.

The more


you have the stronger your alignment will be …

Depends on your sequence type and your question

Beware of redundant sequences (choose a threshold relevant to your question)Beware of pulling in unrelated sequences (take a good look at your dataset)(More sequences means longer computational time, but this is why we have Pegasus)Slide6

Global multiple alignment

Assumes conserved regions occur in same order

Begins by aligning them from the beginning of the sequence

Allows gaps

Builds a consensus sequence,

or a profile if based on statistical calculations

Most useful for defining protein families and evolutionary workSlide7

Assumes conserved regions can be duplicated

and can occur in different order along the seqs

Block A

Block B

Block C

Block D

Local multiple alignment

Most useful for finding motifs (shorter sequence lengths)Slide8

Gaps and substitutions

For protein


, PAM, BLOSUM, or other scoring

matrices are used for gaps and

substitutions – but with position specific weighting.

Clustal default is BLOSUM68 MUSCLE uses 200PAM plus their own log-expectation matrixPAM is based on number of changes per evolutionary rate – the higher, the less stringent, eg 250 PAM is casting a wide net

BLOSUM is based on frequency of changes in closely conserved blocks of motifs – the higher the more stringent,


BLOSUM80 is biased towards finding motifs that are highly conserved (to 80%), BLOSUM68 less so etc.Slide9

Gaps and substitutions

PAM, BLOSUM, or other scoring

matrices are used for gaps and

substitutions – but with position specific weighting.


default is BLOSUM

MUSCLE uses 200PAM plus their own log-expectation matrix

For protein sequences, more chance of having


in the outer loops than inner core or catalytic domain

For non-coding DNA,

repeats and transposons

may occur

For structure RNAs, loop regions are more variable

than stem regionsSlide11

Evolution of Algorithms


position specific scoring matrix based on


acid conservation


position specific iterative scoring matrix

plus BLAST Hidden Markov Models position specific scoring matrix plus position specific gap penalties Structural information? Not trivial…Slide12

multiple alignment

1. Exhaustive approaches -

mathematically very accurate

alignments are optimal

BUT these are very complex and take a huge amount of time

2. Heuristic methods -

slightly less accurate,

alignments are good but not optimal AND are usually enough for biological questionsSlide13

Multiple alignment method


homologous sequences

Place the sequences in a relevant format (usually FASTA), and


to similar length. Example>ACTB cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtagacccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcaggcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacgggccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca>AGPAT1 tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactgatgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctcacattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagatctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct You can use a list of accession numbers if you already know that the sequences are of similar lengths.Slide14

Multiple alignment method


homologous sequences

Place the sequences in a relevant format (usually FASTA), and


to similar length.3. Run a multiple alignment program ClustalW - oldest, flexible, robust ClustalΩ

- latest version, scalable, more accurate with addition of HMM MUSCLE - fast, good for finding short motifs in small datasets PRALINE

- includes secondary structure information


- good for small datasets of shorter

sequences, has a module for checking input


against the PDB


- uses domain conservation information

(from BLAST


) which by definition has some structural informationSlide15





Uses progressive global alignment algorithm

Graphic user interface only

Clustal W and W2Command line tool, W2 also had a web interface Has a parallelized version, to cope with larger datasetsClustalΩHMM searches added to algorithmCommand line and web interfaceScalable to very large datasets Slide16

Input of Data

3 or more sequences are needed, nucleic or amino acids, several formats are accepted:


FASTA text files

Remove any white space or empty lines

The analysis will fail if two sequences have the same name

Can copy/paste sequences into Clustal or upload a txt fileSlide17


Group 1:

Joey (BI;G), Alan (CS;U),


(CS, U)

Group 2: Wei (BI;G), Toni (BI,CS;U), William (CS; U), Federico (CS; U), Group 3: Robert (BI; U), Yifan (BI; G), Shiv (CS; U), Travis (CS; U)