/
Sequence  Alignment Abhishek Sequence  Alignment Abhishek

Sequence Alignment Abhishek - PowerPoint Presentation

tabitha
tabitha . @tabitha
Follow
342 views
Uploaded On 2022-05-31

Sequence Alignment Abhishek - PPT Presentation

Niroula Department of Experimental Medical Science Lund University 1 20151209 Sequence alignment A way of arranging two or more sequences to identify regions of similarity Shows locations of similarities and differences between ID: 912327

2015 sequence sequences alignment sequence 2015 alignment sequences matrix dot protein search nucleotide msa www find row dynamic tools

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Sequence Alignment Abhishek" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Sequence Alignment

Abhishek NiroulaDepartment of Experimental Medical ScienceLund University

1

2015-12-09

Slide2

Sequence alignment

A way of arranging two or more sequences to identify regions of similarityShows locations of similarities and differences between the sequencesAn 'optimal' alignment exhibits the most similarities and the least differences

The aligned residues correspond to original residue in their common ancestorInsertions and deletions are represented by gaps in the alignment

Examples

Protein sequence alignment

MSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******

2

2015-12-09

Slide3

Sequence alignment: Purpose

Reveal structural, functional and evolutionary relationship between biological sequencesSimilar sequences may have similar structure and functionSimilar sequences are likely to have common ancestral sequenceAnnotation of new sequencesModelling

of protein structuresDesign and analysis of gene expression experiments

3

2015-12-09

Slide4

Sequence alignment: Types

Global alignmentAligns each residue in each sequence by introducing gapsExample: Needleman-Wunsch algorithm2015-12-094

L G P S

S

K Q T G K G S - S R I W D

N

L N - I T K S A G K G A I M R L G D

A

Slide5

Sequence alignment: Types

Local alignmentFinds regions with the highest density of matches locallyExample: Smith-Waterman algorithm2015-12-095

-

- - - - - - T G K G - - - - - - - -

- - - - - - - A G K G - - - - - - - -

Slide6

Sequence alignment: Scoring

Scoring matrices are used to assign scores to each comparison of a pair of charactersIdentities and substitutions by similar amino acids are assigned positive scoresMismatches, or matches that are unlikely to have been a result of evolution, are given negative scores

A

C

D

E

F

G

H

I

K

A

C

Y

E

F

G

R

I

K

+5

+5

-5

+5

+5

+5

-5+5+5

6

2015-12-09

TACGGGCAG-AC-GGC-GOption 1

TACGGGCAG-ACGG-C-GOption 2

T

A

C

G

G

G

C

A

G

-

A

C

G

-

G

C

-

G

Option 3

Slide7

Sequence alignment: Scoring

PAM matricesPAM - Percent Accepted MutationsPAM gives the probability that a given amino acid will be replaced by any other amino acidAn accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selectionDerived from global alignments of closely related sequencesThe numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances)

1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average)

2-PAM matrix does NOT refer to change in 2% of residues

Refers 1-PAM twice

Some variations may change back to original residue72015-12-09

Slide8

PAM-1

82015-12-09

Slide9

Sequence alignment: Scoring

BLOSUM matricesBLOSUM - Blocks Substitution MatrixScore for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff]. For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

9

2015-12-09

Slide10

BLOSUM62

10

2015-12-09

Slide11

Which scoring matrix to use?

For global alignments use PAM matrices.Lower PAM matrices tend to find short alignments of highly similar regionsHigher PAM matrices will find weaker, longer alignmentsFor local alignments use BLOSUM matricesBLOSUM matrices with HIGH number, are better for similar sequencesBLOSUM matrices with LOW number, are better for distant sequences

11

2015-12-09

Slide12

Sequence alignment: Methods

Pairwise alignmentFinding best alignment of two sequencesOften used for searching best similar sequences in the sesequence databasesDot Matrix AnalysisDynamic Programming (DP)Short word matching

Multiple Sequence Alignment (MSA)Alignment of more than two sequences

Often used to find conserved domains, regions or sites among many sequences

Dynamic programming

Progressive methodsIterative methodsStructural alignmentsAlignments based on structure122015-12-09

Slide13

Dot matrix

Method for comparing two amino acid or nucleotide sequencesLets align two sequences using dot matrix

A: A G C T A G G AB: G A C T A G

G

C

Sequence A is organized in X-axis and sequence B in Y-axisA

G

C

T

A

G

G

A

G

A

C

T

A

G

G

C

13

2015-12-09

Sequence A

Sequence B

Slide14

Dot matrix

Starting from the first nucleotide in B, move along the first row placing a dot in columns

with matching nucleotide

Repeat

the

procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches142015-12-09

A

G

C

T

A

G

G

A

G

A

C

T

A

G

G

C

Sequence A

Sequence B

Slide15

Dot matrix

Starting from the first nucleotide in B, move along the first row placing a dot in columns

with matching nucleotide

Repeat

the

procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches152015-12-09

A

G

C

T

A

G

G

A

G

A

C

T

A

G

G

C

Sequence A

Sequence B

Slide16

Dot matrix

Starting from the first nucleotide in B, move along the first row placing a dot in columns

with matching nucleotide

Repeat

the

procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches162015-12-09

A

G

C

T

A

G

G

A

G

A

C

T

A

G●●●G●●●C

Sequence A

Sequence B

Slide17

Dot matrix

Starting from the first nucleotide in B, move along the first row placing a dot in columns

with matching nucleotide

Repeat

the

procedure for all the nucleotides in BRegion of similarity is revealed by a diagonal row of dotsOther isolated dots represent random matches172015-12-09

A

G

C

T

A

G

G

A

G

A

C

T

A

G●●●G●●●C

Sequence A

Sequence B

Slide18

Dot matrix

Two similar, but not identical,

sequences

An

insertion

or

deletion

A tandem

duplication

18

2015-12-09

Slide19

Dot matrix

An

inversion

Joining

sequences

19

2015-12-09

Slide20

Limitations of

dot matrixSequences with low-complexity regions give false diagonalsSequence regions with little diversityNoisy and space inefficientLimited to 2 sequences202015-12-09

Slide21

Dotplot exercise

Use the following three tools to generate dot plots for the given two sequencesYASS:: genomic similarity search toolhttp://bioinfo.lifl.fr/yass/yass.phpLalign/Palignhttp://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign

multi-zPicture

http

://zpicture.dcode.org

/212015-12-09

Slide22

Dynamic programming

Breaks down the alignment problem into smaller problemsExampleNeedleman-Wunsch algorithm: global alignmentSmith-Waterman algorithm: local alignmentThree stepsInitializationScoringTraceback22

2015-12-09

Slide23

Gap penalties

Insertion of gaps in the alignmentGaps should be penalizedGap opening should be penalized higher than gap extension (or at least equal)In BLOSUM62Gap opening score = -11Gap extension score = -1232015-12-09

A

A

A

G

A

G

A

A

A

A

A

A

-

-

A

A

A

A

Gap

extention

A

A

A

G

A

G

A

AAAAA-A-AAAGap initiation

Slide24

Needleman-Wunsch vs Smith-Waterman

-

A

G

T

T

A

-

0

-1

-2

-3

-4

-5

A

-1

2

G

-2

T

-3

G

-4

C

-5

A

-6

-

A

G

TTA-000000A0

2

G

0

T

0

G

0

C

0

A

0

Needleman-Wunsch

Match =+2

Mismatch =-1

Gap =-1

Smith-Waterman

Match =+2

Mismatch =-1

Gap =-1

All negative values are replaced by 0

Traceback starts at the highest value and ends at 0

24

2015-12-09

Slide25

Needleman-Wunsch vs Smith-Waterman

Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)

25

2015-12-09

Slide26

Dynamic programming: example

http://www.avatar.se/molbioinfo2001/dynprog/dynamic.htmlScoringMatch = +2Mismatch = -2Gap = -1262015-12-09

Slide27

Dynamic programming exercise

Generate a scoring matrix for nucleotides (A, C, G, and T)Align two sequences using dynamic programmingAlign two sequences using following toolsEMBOSS Needlehttp://www.ebi.ac.uk/Tools/psa/emboss_needle/EMBOSS Waterhttp://www.ebi.ac.uk/Tools/psa/emboss_water/

27

2015-12-09

Slide28

Multiple sequence alignment

A multiple sequence alignment (MSA) is an alignment of three or more sequencesWhy MSA?To identify patterns of conservation across more than 2 sequencesTo characterize protein families and generate profiles of protein familiesTo infer relationships within and among gene familiesTo predict secondary and tertiary structures of new sequencesTo

perform phylogenetic studies

28

2015-12-09

Slide29

Recall: dynamic programming

2 sequences

http://ai.stanford.edu/~

serafim/CS262_2005/LectureNotes/Lecture17.pdf

3

sequences292015-12-09

Slide30

MSA methods

Dynamic programmingAlign each pair of sequencesSum scores for each pair at each positionProgressive sequence alignmentHierarchical or tree based methodE.g. ClustalW, T-Coffee

Iterative sequence alignmentImproved

progressive

alignment

Realigns the sequences repeatedlyE.g. MUSCLE302015-12-09

Slide31

Tools for MSA

312015-12-09

Slide32

ClustalW

Progressive sequence alignmentBasic stepsCalculate pairwise distances based on pairwise alignments between the sequencesBuild a guide tree, which is an inferred phylogeny for the sequencesAlign the sequences32

2015-12-09

Slide33

Progressive MSA

1

3

2

5

1

3

d

4

5

1

3

2

5

1

3

2

33

2015-12-09

Slide34

MUSCLE

Iterative sequence alignmentFollows 3 steps

Second progressive

alignment

Refinement

Progressive

alignment

34

2015-12-09

Slide35

Phylogenetic tree

A phylogenetic tree shows evolutionary relationships between the sequencesTypes:RootedNodes represent most recent common ancestorEdge lengths represents time estimatesUnrootedNo ancestry and time estimatesAlgorithms to generate phylogenetic treeNeighbor-joiningUnweighted Pair Group Method with Arithmetic Mean (UPGMA)Maximum parsimony

35

2015-12-09

Slide36

Neighbor joining method

http://en.wikipedia.org/wiki/Neighbor_joining

36

2015-12-09

Slide37

MSA exercise

Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignmentsClustalw2http://www.ebi.ac.uk/Tools/msa/clustalw2/MUSCLEhttp://www.ebi.ac.uk/Tools/msa/muscle/

37

2015-12-09

Slide38

What to align: DNA or protein sequence?

Many mis-matches in DNA sequences are synonymousDNA sequences contain non-coding regions, which should be avaided in homology searchingMatches are more reliable in protein sequenceProbability to occur randomly at any position in a sequenceAmino acids: 1/20 = 0.05Nucleotides: 1/4 = 0.25Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar

ACT

TTT

CAT

GGG

...

Thr

Phe

His

Gly

...

ACT

TTT

T

CA

TGG

G..

Thr

Phe

Ser

Trp

If ORF exists, then always align at protein level

38

2015-12-09

Slide39

Searching bioinformatics databases using:

keywords and,sequences2015-12-0939

Slide40

Search strategy

Keyword searchFind information related to specific keywordsEach bioinformatics database has its own search toolSome search tools have a wide spectrum which access multiple databases and gather results togetherGquery, EBI searchSequence searchUse a sequence of interest to find more information about the sequenceBLAST, FASTA2015-12-09

40

Slide41

Keyword search

Find information related to specific keywordsGqueryA central search tool to find information in NCBI databasesSearches in large number of NCBI databases and shows them in one pagehttp://www.ncbi.nlm.nih.gov/gqueryEBI searchSearch tool to find infroamtion from databases developed, managed and hosted by EMBL-EBIhttp://www.ebi.ac.uk/services

2015-12-09

41

Slide42

Gquery

2015-12-0942

Slide43

EBI search

2015-12-0943

Slide44

Limitations

SynonymsMisspellingsOld and new names/termsNOTES:Use different synonyms and read literature to find more approriate keywordsUse boolean operators to combine different keywordsDo not expect to find all the information using keyword search aloneNote the database version or the version of entries in the databases you used

8

64

110

ELA2

ELANE

59

20

HIV 1

HIV-1

PubMed

ClinVar

2015-12-09

44

Slide45

Gene nomenclature

HUGO Gene Nomenclature Committee (HGNC)Assigns standardized nomenclature to human genesEach symbol is unique and each gene is given only one nameSpecies specific nomenclature committeesMouse Genome Informatics Databasehttp://www.informatics.jax.org/mgihome/nomen/Rat Genome Databasehttp://rgd.mcw.edu/nomen/nomen.shtml

2015-12-09

45

Slide46

HGNC symbol report

Approved symbolApproved nameSynonymsTerms used in literature to indicate the geneHGNC, Ensembl, Entrez Gene, OMIMPrevious symbols and namesPrevious HGNC approved symbolNOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics.

2015-12-09

46

Slide47

HGNC search

2015-12-0947

Slide48

Keyword search

Exercise2015-12-0948