Niroula Department of Experimental Medical Science Lund University 1 20151209 Sequence alignment A way of arranging two or more sequences to identify regions of similarity Shows locations of similarities and differences between ID: 912327
Download Presentation The PPT/PDF document "Sequence Alignment Abhishek" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Sequence Alignment
Abhishek NiroulaDepartment of Experimental Medical ScienceLund University
1
2015-12-09
Slide2Sequence alignment
A way of arranging two or more sequences to identify regions of similarityShows locations of similarities and differences between the sequencesAn 'optimal' alignment exhibits the most similarities and the least differences
The aligned residues correspond to original residue in their common ancestorInsertions and deletions are represented by gaps in the alignment
Examples
Protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******
2
2015-12-09
Slide3Sequence alignment: Purpose
Reveal structural, functional and evolutionary relationship between biological sequencesSimilar sequences may have similar structure and functionSimilar sequences are likely to have common ancestral sequenceAnnotation of new sequencesModelling
of protein structuresDesign and analysis of gene expression experiments
3
2015-12-09
Slide4Sequence alignment: Types
Global alignmentAligns each residue in each sequence by introducing gapsExample: Needleman-Wunsch algorithm2015-12-094
L G P S
S
K Q T G K G S - S R I W D
N
L N - I T K S A G K G A I M R L G D
A
Slide5Sequence alignment: Types
Local alignmentFinds regions with the highest density of matches locallyExample: Smith-Waterman algorithm2015-12-095
-
- - - - - - T G K G - - - - - - - -
- - - - - - - A G K G - - - - - - - -
Slide6Sequence alignment: Scoring
Scoring matrices are used to assign scores to each comparison of a pair of charactersIdentities and substitutions by similar amino acids are assigned positive scoresMismatches, or matches that are unlikely to have been a result of evolution, are given negative scores
A
C
D
E
F
G
H
I
K
A
C
Y
E
F
G
R
I
K
+5
+5
-5
+5
+5
+5
-5+5+5
6
2015-12-09
TACGGGCAG-AC-GGC-GOption 1
TACGGGCAG-ACGG-C-GOption 2
T
A
C
G
G
G
C
A
G
-
A
C
G
-
G
C
-
G
Option 3
Slide7Sequence alignment: Scoring
PAM matricesPAM - Percent Accepted MutationsPAM gives the probability that a given amino acid will be replaced by any other amino acidAn accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selectionDerived from global alignments of closely related sequencesThe numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances)
1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average)
2-PAM matrix does NOT refer to change in 2% of residues
Refers 1-PAM twice
Some variations may change back to original residue72015-12-09
Slide8PAM-1
82015-12-09
Slide9Sequence alignment: Scoring
BLOSUM matricesBLOSUM - Blocks Substitution MatrixScore for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff]. For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.
9
2015-12-09
Slide10BLOSUM62
10
2015-12-09
Slide11Which scoring matrix to use?
For global alignments use PAM matrices.Lower PAM matrices tend to find short alignments of highly similar regionsHigher PAM matrices will find weaker, longer alignmentsFor local alignments use BLOSUM matricesBLOSUM matrices with HIGH number, are better for similar sequencesBLOSUM matrices with LOW number, are better for distant sequences
11
2015-12-09
Slide12Sequence alignment: Methods
Pairwise alignmentFinding best alignment of two sequencesOften used for searching best similar sequences in the sesequence databasesDot Matrix AnalysisDynamic Programming (DP)Short word matching
Multiple Sequence Alignment (MSA)Alignment of more than two sequences
Often used to find conserved domains, regions or sites among many sequences
Dynamic programming
Progressive methodsIterative methodsStructural alignmentsAlignments based on structure122015-12-09
Slide13Dot matrix
Method for comparing two amino acid or nucleotide sequencesLets align two sequences using dot matrix
A: A G C T A G G AB: G A C T A G
G
C
Sequence A is organized in X-axis and sequence B in Y-axisA
G
C
T
A
G
G
A
G
A
C
T
A
G
G
C
13
2015-12-09
Sequence A
Sequence B
Slide14Dot matrix
Starting from the first nucleotide in B, move along the first row placing a dot in columns
with matching nucleotide
Repeat
the
procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches142015-12-09
A
G
C
T
A
G
G
A
G
●
●
●
A
C
T
A
G
G
C
Sequence A
Sequence B
Slide15Dot matrix
Starting from the first nucleotide in B, move along the first row placing a dot in columns
with matching nucleotide
Repeat
the
procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches152015-12-09
A
G
C
T
A
G
G
A
G
●
●
●
A
●
●
●
C
T
A
G
G
C
Sequence A
Sequence B
Slide16Dot matrix
Starting from the first nucleotide in B, move along the first row placing a dot in columns
with matching nucleotide
Repeat
the
procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches162015-12-09
A
G
C
T
A
G
G
A
G
●
●
●
A
●
●
●
C
●
T
●
A
●
●
●
G●●●G●●●C
●
Sequence A
Sequence B
Slide17Dot matrix
Starting from the first nucleotide in B, move along the first row placing a dot in columns
with matching nucleotide
Repeat
the
procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches172015-12-09
A
G
C
T
A
G
G
A
G
●
●
●
A
●
●
●
C
●
T
●
A
●
●
●
G●●●G●●●C
●
Sequence A
Sequence B
Slide18Dot matrix
Two similar, but not identical,
sequences
An
insertion
or
deletion
A tandem
duplication
18
2015-12-09
Slide19Dot matrix
An
inversion
Joining
sequences
19
2015-12-09
Slide20Limitations of
dot matrixSequences with low-complexity regions give false diagonalsSequence regions with little diversityNoisy and space inefficientLimited to 2 sequences202015-12-09
Slide21Dotplot exercise
Use the following three tools to generate dot plots for the given two sequencesYASS:: genomic similarity search toolhttp://bioinfo.lifl.fr/yass/yass.phpLalign/Palignhttp://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign
multi-zPicture
http
://zpicture.dcode.org
/212015-12-09
Slide22Dynamic programming
Breaks down the alignment problem into smaller problemsExampleNeedleman-Wunsch algorithm: global alignmentSmith-Waterman algorithm: local alignmentThree stepsInitializationScoringTraceback22
2015-12-09
Slide23Gap penalties
Insertion of gaps in the alignmentGaps should be penalizedGap opening should be penalized higher than gap extension (or at least equal)In BLOSUM62Gap opening score = -11Gap extension score = -1232015-12-09
A
A
A
G
A
G
A
A
A
A
A
A
-
-
A
A
A
A
Gap
extention
A
A
A
G
A
G
A
AAAAA-A-AAAGap initiation
Slide24Needleman-Wunsch vs Smith-Waterman
-
A
G
T
T
A
-
0
-1
-2
-3
-4
-5
A
-1
2
G
-2
T
-3
G
-4
C
-5
A
-6
-
A
G
TTA-000000A0
2
G
0
T
0
G
0
C
0
A
0
Needleman-Wunsch
Match =+2
Mismatch =-1
Gap =-1
Smith-Waterman
Match =+2
Mismatch =-1
Gap =-1
All negative values are replaced by 0
Traceback starts at the highest value and ends at 0
24
2015-12-09
Slide25Needleman-Wunsch vs Smith-Waterman
Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)
25
2015-12-09
Slide26Dynamic programming: example
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.htmlScoringMatch = +2Mismatch = -2Gap = -1262015-12-09
Slide27Dynamic programming exercise
Generate a scoring matrix for nucleotides (A, C, G, and T)Align two sequences using dynamic programmingAlign two sequences using following toolsEMBOSS Needlehttp://www.ebi.ac.uk/Tools/psa/emboss_needle/EMBOSS Waterhttp://www.ebi.ac.uk/Tools/psa/emboss_water/
27
2015-12-09
Slide28Multiple sequence alignment
A multiple sequence alignment (MSA) is an alignment of three or more sequencesWhy MSA?To identify patterns of conservation across more than 2 sequencesTo characterize protein families and generate profiles of protein familiesTo infer relationships within and among gene familiesTo predict secondary and tertiary structures of new sequencesTo
perform phylogenetic studies
28
2015-12-09
Slide29Recall: dynamic programming
2 sequences
http://ai.stanford.edu/~
serafim/CS262_2005/LectureNotes/Lecture17.pdf
3
sequences292015-12-09
Slide30MSA methods
Dynamic programmingAlign each pair of sequencesSum scores for each pair at each positionProgressive sequence alignmentHierarchical or tree based methodE.g. ClustalW, T-Coffee
Iterative sequence alignmentImproved
progressive
alignment
Realigns the sequences repeatedlyE.g. MUSCLE302015-12-09
Slide31Tools for MSA
312015-12-09
Slide32ClustalW
Progressive sequence alignmentBasic stepsCalculate pairwise distances based on pairwise alignments between the sequencesBuild a guide tree, which is an inferred phylogeny for the sequencesAlign the sequences32
2015-12-09
Slide33Progressive MSA
1
3
2
5
1
3
d
4
5
1
3
2
5
1
3
2
33
2015-12-09
Slide34MUSCLE
Iterative sequence alignmentFollows 3 steps
Second progressive
alignment
Refinement
Progressive
alignment
34
2015-12-09
Slide35Phylogenetic tree
A phylogenetic tree shows evolutionary relationships between the sequencesTypes:RootedNodes represent most recent common ancestorEdge lengths represents time estimatesUnrootedNo ancestry and time estimatesAlgorithms to generate phylogenetic treeNeighbor-joiningUnweighted Pair Group Method with Arithmetic Mean (UPGMA)Maximum parsimony
35
2015-12-09
Slide36Neighbor joining method
http://en.wikipedia.org/wiki/Neighbor_joining
36
2015-12-09
Slide37MSA exercise
Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignmentsClustalw2http://www.ebi.ac.uk/Tools/msa/clustalw2/MUSCLEhttp://www.ebi.ac.uk/Tools/msa/muscle/
37
2015-12-09
Slide38What to align: DNA or protein sequence?
Many mis-matches in DNA sequences are synonymousDNA sequences contain non-coding regions, which should be avaided in homology searchingMatches are more reliable in protein sequenceProbability to occur randomly at any position in a sequenceAmino acids: 1/20 = 0.05Nucleotides: 1/4 = 0.25Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar
ACT
TTT
CAT
GGG
...
Thr
Phe
His
Gly
...
ACT
TTT
T
CA
TGG
G..
Thr
Phe
Ser
Trp
If ORF exists, then always align at protein level
38
2015-12-09
Slide39Searching bioinformatics databases using:
keywords and,sequences2015-12-0939
Slide40Search strategy
Keyword searchFind information related to specific keywordsEach bioinformatics database has its own search toolSome search tools have a wide spectrum which access multiple databases and gather results togetherGquery, EBI searchSequence searchUse a sequence of interest to find more information about the sequenceBLAST, FASTA2015-12-09
40
Slide41Keyword search
Find information related to specific keywordsGqueryA central search tool to find information in NCBI databasesSearches in large number of NCBI databases and shows them in one pagehttp://www.ncbi.nlm.nih.gov/gqueryEBI searchSearch tool to find infroamtion from databases developed, managed and hosted by EMBL-EBIhttp://www.ebi.ac.uk/services
2015-12-09
41
Slide42Gquery
2015-12-0942
Slide43EBI search
2015-12-0943
Slide44Limitations
SynonymsMisspellingsOld and new names/termsNOTES:Use different synonyms and read literature to find more approriate keywordsUse boolean operators to combine different keywordsDo not expect to find all the information using keyword search aloneNote the database version or the version of entries in the databases you used
8
64
110
ELA2
ELANE
59
20
HIV 1
HIV-1
PubMed
ClinVar
2015-12-09
44
Slide45Gene nomenclature
HUGO Gene Nomenclature Committee (HGNC)Assigns standardized nomenclature to human genesEach symbol is unique and each gene is given only one nameSpecies specific nomenclature committeesMouse Genome Informatics Databasehttp://www.informatics.jax.org/mgihome/nomen/Rat Genome Databasehttp://rgd.mcw.edu/nomen/nomen.shtml
2015-12-09
45
Slide46HGNC symbol report
Approved symbolApproved nameSynonymsTerms used in literature to indicate the geneHGNC, Ensembl, Entrez Gene, OMIMPrevious symbols and namesPrevious HGNC approved symbolNOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics.
2015-12-09
46
Slide47HGNC search
2015-12-0947
Slide48Keyword search
Exercise2015-12-0948