48K - views

Multiple alignment

The linear comparison of more than two sequences. Places residues in columns . per . position specific similarity scores . reflects . relationships . of the . sequences. the scores are based on .

Embed :
Presentation Download Link

Download Presentation - The PPT/PDF document "Multiple alignment" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Multiple alignment






Presentation on theme: "Multiple alignment"— Presentation transcript:

Slide1

Multiple alignment

The linear comparison of more than two sequences

Places residues in columns

per

position specific similarity scores

reflects

relationships

of the

sequences

the scores are based on

indels

(gaps) and substitutions.

The alignment of residues implies that they have similar roles in the proteins or DNA sequences being aligned

e.g.

protein active sites or transcription factor binding sites

Strength in numbers: the structure/function message from a multiple alignment is stronger than that of a pairwise alignment.Slide2

Uses of multiple alignments

Identification of functionally or structurally conserved domain/motif

- biological meaning, domain groups, motif matrices

-

ProSite

,

InterPro

,

etc

Classification of

domains into

families

- biological or structural meaning

-

Pfam

, SCOP

Evolutionary studies

- phylogenetic inference of

gene or

species evolution

Structure prediction

- homology modelingSlide3

Multiple alignment method

Find homologous sequences

Homologous sequences share a common ancestor, usually relatively high sequence similarity

Not all similar proteins are homologous:

Similarity may have come about due to convergent evolution or by chance.Slide4

Homology is detectable

When there is consensus over a relatively long stretch of sequence

OR

When the

conservation is

high within functionally relevant regions

THUS

Statistical methods based on position-specific matrices help to provide some evidence

BUT

You

usually need

to

check

your alignment by eye

to

make sure it makes sense

AND

May need structural data to recognize homologySlide5

Finding how many sequences?

Use different BLAST algorithms.

The more

seqs

you have the stronger your alignment will be …

Depends on your sequence type and your question

Beware of redundant sequences (choose a threshold relevant to your question)Beware of pulling in unrelated sequences (take a good look at your dataset)(More sequences means longer computational time, but this is why we have Pegasus)Slide6

Global multiple alignment

Assumes conserved regions occur in same order

Begins by aligning them from the beginning of the sequence

Allows gaps

Builds a consensus sequence,

or a profile if based on statistical calculations

Most useful for defining protein families and evolutionary workSlide7

Assumes conserved regions can be duplicated

and can occur in different order along the seqs

Block A

Block B

Block C

Block D

Local multiple alignment

Most useful for finding motifs (shorter sequence lengths)Slide8

Gaps and substitutions

For protein

msa

, PAM, BLOSUM, or other scoring

matrices are used for gaps and

substitutions – but with position specific weighting.

Clustal default is BLOSUM68 MUSCLE uses 200PAM plus their own log-expectation matrixPAM is based on number of changes per evolutionary rate – the higher, the less stringent, eg 250 PAM is casting a wide net

BLOSUM is based on frequency of changes in closely conserved blocks of motifs – the higher the more stringent,

eg

BLOSUM80 is biased towards finding motifs that are highly conserved (to 80%), BLOSUM68 less so etc.Slide9
Slide10

Gaps and substitutions

PAM, BLOSUM, or other scoring

matrices are used for gaps and

substitutions – but with position specific weighting.

ClustalW

default is BLOSUM

MUSCLE uses 200PAM plus their own log-expectation matrix

For protein sequences, more chance of having

indels

in the outer loops than inner core or catalytic domain

For non-coding DNA,

repeats and transposons

may occur

For structure RNAs, loop regions are more variable

than stem regionsSlide11

Evolution of Algorithms

Profiles

position specific scoring matrix based on

amino

acid conservation

PSI-BLAST

position specific iterative scoring matrix

plus BLAST Hidden Markov Models position specific scoring matrix plus position specific gap penalties Structural information? Not trivial…Slide12

multiple alignment

1. Exhaustive approaches -

mathematically very accurate

alignments are optimal

BUT these are very complex and take a huge amount of time

2. Heuristic methods -

slightly less accurate,

alignments are good but not optimal AND are usually enough for biological questionsSlide13

Multiple alignment method

Find

homologous sequences

Place the sequences in a relevant format (usually FASTA), and

edit

to similar length. Example>ACTB cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtagacccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcaggcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacgggccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca>AGPAT1 tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactgatgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctcacattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagatctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct You can use a list of accession numbers if you already know that the sequences are of similar lengths.Slide14

Multiple alignment method

Find

homologous sequences

Place the sequences in a relevant format (usually FASTA), and

edit

to similar length.3. Run a multiple alignment program ClustalW - oldest, flexible, robust ClustalΩ

- latest version, scalable, more accurate with addition of HMM MUSCLE - fast, good for finding short motifs in small datasets PRALINE

- includes secondary structure information

T-Coffee

- good for small datasets of shorter

sequences, has a module for checking input

seqs

against the PDB

COBALT

- uses domain conservation information

(from BLAST

page

) which by definition has some structural informationSlide15

Clustal

family

Clustal

X

Uses progressive global alignment algorithm

Graphic user interface only

Clustal W and W2Command line tool, W2 also had a web interface Has a parallelized version, to cope with larger datasetsClustalΩHMM searches added to algorithmCommand line and web interfaceScalable to very large datasets Slide16

Input of Data

3 or more sequences are needed, nucleic or amino acids, several formats are accepted:

eg

FASTA text files

Remove any white space or empty lines

The analysis will fail if two sequences have the same name

Can copy/paste sequences into Clustal or upload a txt fileSlide17

Groups

Group 1:

Joey (BI;G), Alan (CS;U),

Chingis

(CS, U)

Group 2: Wei (BI;G), Toni (BI,CS;U), William (CS; U), Federico (CS; U), Group 3: Robert (BI; U), Yifan (BI; G), Shiv (CS; U), Travis (CS; U)