/
Andrew Cowley External Services, EMBL-EBI Andrew Cowley External Services, EMBL-EBI

Andrew Cowley External Services, EMBL-EBI - PowerPoint Presentation

molly
molly . @molly
Follow
342 views
Uploaded On 2022-05-31

Andrew Cowley External Services, EMBL-EBI - PPT Presentation

Sequence Searching and Alignments External Services Sequence searching and alignments Andrew Cowley 08052012 2 Andrew Cowley Bioinformatics Trainer Hamish McWilliam Software engineer Rodrigo Lopez ID: 912339

alignments sequence andrew searching sequence alignments searching andrew cowley 2012 sequences alignment blast search ebi protein fasta database gap

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Andrew Cowley External Services, EMBL-EB..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Andrew CowleyExternal Services, EMBL-EBI

Sequence Searching and Alignments

Slide2

External ServicesSequence searching and alignments - Andrew

Cowley

08/05/2012

2

Andrew Cowley

Bioinformatics Trainer

Hamish McWilliam

Software engineer

Rodrigo Lopez

Head of External Services

+ many others!

Slide3

Contents

Sequence databases

Database browsing tools

Similarity searching and alignments

Alignment basics

Similarity searching toolsMore advanced tools

Alignment toolsGuidelinesProblem sequences

Sequence searching and alignments - Andrew Cowley

08/05/2012

3

Slide4

MaterialsSequence searching and alignments - Andrew Cowley

08/05/2012

4

www.ebi.ac.uk/~apc/Courses/Rotterdam

Slide5

DataSimplistically, much of the data at the EBI can be thought of as a container

One part being the raw data (eg. Sequence)

Another part being annotation on this data

Sequence searching and alignments - Andrew Cowley

08/05/2012

5

Slide6

ExampleSequence searching and alignments - Andrew Cowley

08/05/2012

6

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.

XX

AC AJ131285;

XX

DT 24-APR-2001 (Rel. 67, Created)

DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)

XX

DE Sabella

spallanzanii mRNA for globin

3

XX

KW globin; globin 3; globin gene.

XX

OS

Sabella

spallanzanii

OC

Eukaryota

;

Metazoa

;

Annelida

;

Polychaeta

;

Palpata

;

Canalipalpata

;

OC

Sabellida

;

Sabellidae

;

Sabella

.

XX

RN [1]

RP 1-919

RA

Negrisolo

E.M.;

RT ;

RL Submitted (11-DEC-1998) to the EMBL/

GenBank

/DDBJ databases.

RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. Bassi

RL 58/B, Padova,35131, ITALY.

FH Key Location/Qualifiers

FH

FT source 1..919

FT /organism="

Sabella

spallanzanii

"

FT /

mol_type

="mRNA"

FT /

db_xref

="taxon:85702"

FT CDS 73..552

FT /gene="

globin

"

FT /product="

globin

3"

FT /function="respiratory pigment"

FT /

db_xref

="GOA:Q9BHK1"

FT /

db_xref

="InterPro:IPR000971"

FT /

db_xref

="InterPro:IPR014610"

FT /

db_xref

="

UniProtKB

/TrEMBL:Q9BHK1"

FT /experiment="experimental evidence, no additional details

FT recorded"

FT /

protein_id

="CAC37412.1"

FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTA

FT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLA

FT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"

XX

SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other;

caaacagtca

rttaattcac

agagccctga

ggtctctcgc

tcctttctgc

gtcactctct

60

cttaccgtca

tcatgtacaa

gtggttgctt

tgcctggctc

tgattggctg

cgtcagcggc

120

tgcaacatcc

tccagaggct

gaaggtcaag

aaccagtggc

aggaggcttt

cggctatgct

180

gacgacagga

catcccycgg

taccgcattg

tggagatcca

tcatcatgca

gaagcccgag

240

//

Slide7

Data - Nucleotide

ENA/EMBL-Bank:

Release and updates

Divided into classes and divisions

Supplementary sets: EMBL-CDS, EMBL-MGA

Specialist data sets, e.g.:

Immunoglobulins

: IMGT/HLA, IMGT/LIGM, etc.

Alternative splicing: ASD, ASTD, etc.

Completed genomes:

Ensembl

, Integr8, etc.

Variation:

HGVBase

,

dbSNP

, etc.

Sequence searching and alignments - Andrew Cowley

08/05/2012

7

Slide8

Nucleotides: European Nucleotide Archive (ENA)

08/05/2012

8

Figure adapted from: Cochrane, G.

et al

. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In:

Metagenomics: Theory, Methods and Applications

(Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010).

The ENA has a three-tiered data architecture.

It consolidates information from

EMBL-Bank

, the

European Trace Archive

(containing raw data from electrophoresis-based sequencing machines) and the

Sequence Read Archive

(containing raw data from next-generation sequencing platforms).

Sequence searching and alignments - Andrew Cowley

Slide9

08/05/2012

9

Keyword and sequence searching

Map-based search of environmental samples

Downloads

Nucleotides: EMBL-Bank

EMBL-Bank

DDBJ

GenBank

www.insdc.org

Direct submissions

Patents

Genome-sequencing projects

Updates

Third-party annotation

Sequence searching and alignments - Andrew Cowley

Slide10

EMBL-Bank: Classes

CON – Constructed Sequence

EST – Expressed Sequence Tag

GSS – Genome Survey Sequence

HTC – High Throughput cDNA

HTG – High Throughput Genome

PAT – Patent

STS – Sequence Tagged Site

STD – Standard

TPA – Third Party Annotation

TSA – Transcriptome Shotgun Assembly (from r95)

WGS – Whole Genome Shotgun

Sequence searching and alignments - Andrew Cowley

08/05/2012

10

Slide11

EMBL-Bank: Taxonomic Divisions

ENV - Environmental

FUN - Fungi

INV - Invertebrate

HUM - Human

MAM – Mammal (excluding human, mouse and rodent)

MUS - Mouse

PHG - Phage

PLN - Plant

PRO - Prokaryote

ROD – Rodent (excluding mouse)

SYN - Synthetic

TGN – Transgenic

UNC - Unclassified

VRL - Viral

VRT – Vertebrate (excluding human, mammal, mouse and rodent)

Sequence searching and alignments - Andrew Cowley

08/05/2012

11

Slide12

Data – Protein Sequence

UniProt

databases:

UniProtKB

: human

curated

and automatic translation sections

UniRef

: non-redundant sequence clusters

UniParc

: non-identical sequence archive

Sequence from structures:

PDB

SGT

Specialist data sets, e.g.:

Immunoglobulins

: IMGT/HLA

Alternative splicing: ASD, ASTD

Completed proteomes:

Ensembl

, Integr8

Protein Interactions:

IntAct

Patent Proteins: EPO, JPO, KIPO and USPTO

Sequence searching and alignments - Andrew Cowley

08/05/2012

12

Slide13

08/05/2012

13

Protein sequence: UniProt

UniProt

Manual curation

Literature-based annotation

Sequence analysis

Automated annotation

PRIDE

GO

InterPro

IntAct

IntEnz

HAMAP

RESID

Functional info

Protein identification data

Protein families and domains

Molecular interactions

Enzymes

Microbial protein families

Post-translational modifications

Some data sources for annotation

Transmembrane prediction

InterPro classification

Signal prediction

Other predictions

Protein

classification

Sequence searching and alignments - Andrew Cowley

Slide14

Databases

Many databases and they are getting bigger

Efficient searching involves knowledge of what is stored in these

Not everything in the databases is correct

Changes can happen...

Deletions, sequence modifications

Daily updates, identifier changes, etc.

Sequence searching and alignments - Andrew Cowley

08/05/2012

14

Slide15

Searching databases

Sequence searching and alignments - Andrew Cowley

08/05/2012

15

Slide16

SearchingMany ways of searching databases

Annotation/title

Know something about your sequence

Gene name

FunctionAccession

Sequence searching and alignments - Andrew Cowley

08/05/2012

16

Slide17

EBI Search

Sequence searching and alignments - Andrew Cowley

08/05/2012

17

Slide18

EBI SearchSequence searching and alignments - Andrew Cowley

08/05/2012

18

Slide19

Database webpagesSequence searching and alignments - Andrew Cowley

08/05/2012

19

Slide20

Database searchingSequence searching and alignments - Andrew Cowley

08/05/2012

20

Slide21

SearchingMany ways of searching databases

Annotation/title

Know something about your sequence

Gene name

FunctionAccession

Raw dataDon’t know!Or want to check...Infer extra informationHomology?

Annotation?Function?

Sequence searching and alignments - Andrew Cowley

08/05/2012

21

Slide22

Sequence alignmentRelatively easy if we have an exact match

.. But sequence is variable

Between individuals, species, location etc.

That variability is useful data too!

Need a search method that allows for some variabilityAnd even better – helps us assess that variability

Sequence searching and alignments - Andrew Cowley

08/05/2012

22

Slide23

Sequence alignmentSequence searching and alignments - Andrew Cowley

08/05/2012

23

ACATAGGT

TCATAGAT

AAATTCTG

Query:

1

2

Slide24

Sequence alignmentSequence searching and alignments - Andrew Cowley

08/05/2012

24

ACATAGGT

TCATAGAT

AAATTCTG

Query:

1

2

ACATAGGT

ACATAGGT

Slide25

Sequence alignmentSequence searching and alignments - Andrew Cowley

08/05/2012

25

ACATAGGT

T

CATAG

A

T

A

AATTCTG

Query:

1

2

Score:

6/8

3/8

A

CATAG

G

T

A

C

AT

AGGT

Slide26

Sequence alignmentSequence searching and alignments - Andrew Cowley

08/05/2012

26

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaag

atgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttct

ttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaagg

cacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatct

caagggcacctttgcccagcttgagt

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggc

catggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttcctt

gggagatgccataaagcacctggatgatctcaagggcaQuery:

1

2

Slide27

Dot plotMaybe a dot plot will help

ENA, sequence alignments and similarity searching - Andrew Cowley

08/05/2012

27

Query

Sequence 1

A C A T A G

GATACT

Slide28

Dot plotSequence searching and alignments - Andrew Cowley

08/05/2012

28

Query vs Sequence 1

Query vs Sequence 2

Query

Query

1

2

Slide29

We can see the difference, but how to turn that into something a computer can evaluate?

Computers rely on algorithms which give them a score

They can then compare scores

Sequence searching and alignments - Andrew Cowley

08/05/2012

29

Slide30

Simple algorithm – penalise movement away from diagonal – gap penalty

Sequence searching and alignments - Andrew Cowley

08/05/2012

30

0

-10

-10

0

-10

-10

Slide31

Gap extend penalty?Single block of insertions/deletions is more likely than multiple in/del events

Sequence searching and alignments - Andrew Cowley

08/05/2012

31

NVELKAET

NVDEATNFELKAET

NV-ELKAET

NVDE--A-TNFELKAET

NV------ELKAET

NV

DEATNF

ELKAET

Slide32

To encourage this we apply a low penalty per each gap, and a high one just to open a gap.

-10.5

Gap extend

Sequence searching and alignments - Andrew Cowley

08/05/2012

32

0

-10.5

-10.5

0

-10

-0.5

-10

-0.5

-11

0

-10.5

-0.5

-11

-0.5

-10.5

-10.5

Gap open = 10

Gap extend = 0.5

Slide33

Match/mismatchOf course, we need to tell the algorithm that matching letters are better than mismatches too

This is done via a scoring matrix

Sequence searching and alignments - Andrew Cowley

08/05/2012

33

A C G T

A

C

G

T

5 -4 -4 -4-4

5 -4 -4-4 -4

5

-4

-4 -4 -4

5

Slide34

Putting the two together gives us a scoring mechanism

Sequence searching and alignments - Andrew Cowley

08/05/2012

34

-4

-18.5

-18.5

1

-14

-13.5

-23

-13.5

T

A

C

A

C

A

6

Slide35

To pick the optimal alignment, start at the end and trace back the highest scoring route.

Sequence searching and alignments - Andrew Cowley

08/05/2012

35

-4

-18.5

-18.5

1

-14

-13.5

-23

-13.5

T

A

C

A

C

A

6

Slide36

Needleman-WunschCongratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!

An example of

dynamic programming

Comparing the full length of both sequences is called a

global-global

or just global alignment

Sequence searching and alignments - Andrew Cowley

08/05/201236

Slide37

Global vs LocalBut global-global might not be suitable for sequences that are very different lengths

A modified form of this algorithm for local alignment is called the

Smith-Waterman

algorithm.

Sets negative scores in matrix to 0, and allows trace back to end and restart

Sequence searching and alignments - Andrew Cowley

08/05/2012

37

Slide38

Global vs Local

Sequence searching and alignments - Andrew Cowley

08/05/2012

38

A T G T A T A C G C

A G T A T A - G C

A - T G T A T A C G C

A G T A T A - - - G C

Slide39

ScoringParameters so far:

Match/mismatch

Gap opening

Gap extending

Can we improve it?

Sequence searching and alignments - Andrew Cowley

08/05/2012

39

Slide40

SubstitutionsSome substitutions are more likely than others

DNA:

Purines (A,G) – dual ring

Pyrimidines (C, T) – single ring

Substitutions of the same type are called transitions, where as exchanging one for another is called a

transversionTransistions occur more frequently than transversions, so we can score them higher in the scoring matrix

Sequence searching and alignments - Andrew Cowley

08/05/2012

40

Slide41

Sequence searching and alignments - Andrew Cowley

08/05/2012

41

Slide42

ProteinsWhat about proteins?

Sequence searching and alignments - Andrew Cowley

08/05/2012

42

Slide43

Protein substitution matricesCan look at closely related proteins to determine substitution rates

Two most commonly used models:

BLOSUM

PAM

Sequence searching and alignments - Andrew Cowley

08/05/2012

43

Slide44

BLOSUM

Blo

cks of Amino Acid

Su

bstitution MatrixAlign conserved regions of evolutionary divergent sequences clustered at a given % identity

Count relative frequencies of amino acids and substitution probabilityTurn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

Higher BLOSUM number = more closely related

Sequence searching and alignments - Andrew Cowley

08/05/2012

44

Slide45

PAMP

oint

A

ccepted

MutationObserved mutations in a set of closely related proteins

Markov chain model created to describe substitutionsNormalised so that PAM1 = 1 mutation per 100 amino acidsExtrapolate matrices from modelHigher PAM number = less closely related

Sequence searching and alignments - Andrew Cowley

08/05/2012

45

PAM 250

Slide46

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

Sequence searching and alignments - Andrew Cowley

08/05/2012

46

10

100

200

400

500

300

Slide47

Sequence searching and alignments - Andrew Cowley

08/05/2012

47

BLOSUM 45

PAM 250

BLOSUM 62

PAM 160

BLOSUM 90

PAM 100

More divergent

Less divergent

Slide48

ScoringParameters:Match/mismatch

Gap opening

Gap extending

Substitution matrix

Sequence searching and alignments - Andrew Cowley

08/05/2012

48

Slide49

Dynamic programming alignments at the EBIEMBOSS Pairwise Alignment Algorithms

European Molecular Biology Open Software Suite

Suite of useful tools for molecular biology

Command line based

Designed to be used as part of scripts/chained programsWe implement selected tools to provide web-based access

Sequence searching and alignments - Andrew Cowley

08/05/2012

49

Slide50

Where to find at the EBI?Sequence searching and alignments - Andrew Cowley

08/05/2012

50

http://www.ebi.ac.uk/Tools/psa

Or...

Slide51

Where to find at the EBI?Sequence searching and alignments - Andrew Cowley

08/05/2012

51

Slide52

Pairwise alignment tools

Global alignment

Local alignment

Sequence searching and alignments - Andrew Cowley

08/05/2012

52

Needle

Water

Stretcher

Matcher

LALIGN

Slide53

Sequence searching and alignments - Andrew Cowley

08/05/2012

53

Submit!

Parameters

Sequence input

Change to nucleotide

Slide54

Sequence searching and alignments - Andrew Cowley

08/05/2012

54

Slide55

Sequence searching and alignments - Andrew Cowley

08/05/2012

55

Key

-

Gap

: Positive match

. Negative match

| Identity

Slide56

Example sequencesSequence searching and alignments - Andrew Cowley

08/05/2012

56

www.ebi.ac.uk/~apc/Courses/Rotterdam

Pairwise_align1.fsa

Pairwise_align2.fsa

Slide57

Dynamic programming sequence search methods at the EBIGlobal alignment

Local alignment

Global query vs local database

Sequence searching and alignments - Andrew Cowley

08/05/2012

57

GGSEARCH

SSEARCH

GLSEARCH

Slide58

Where to find at the EBI?Sequence searching and alignments - Andrew Cowley

08/05/2012

58

www.ebi.ac.uk/Tools/sss/

Or...

Slide59

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley

08/05/2012

59

Slide60

Similarity search

Sequence searching and alignments - Andrew Cowley

08/05/2012

60

Database selection

Sequence input

Parameters

Submit!

Slide61

Dynamic programming methods are

rigorous

and guarantee an

optimal

resultBut take up a lot of memoryAnd evaluate each position of the matrix

Predictably, this makes them slow and demanding when you are aligning large sequences

Sequence searching and alignments - Andrew Cowley08/05/2012

61

Slide62

HeuristicsTherefore we need methods of estimating alignments

Estimation methods are called

heuristics

Try and take short cuts in an intelligent manner

Speed up the searchAt the possible expense of accuracy

Accuracy in sequence searches is important for:Aligning the right bitsScoring the alignment correctlyIdentifying similar sequences -

sensitivitySequence searching and alignments - Andrew Cowley

08/05/2012

62

Slide63

Going back to our dot plot

Sequence searching and alignments - Andrew Cowley

08/05/2012

63

Slide64

Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

Sequence searching and alignments - Andrew Cowley

08/05/2012

64

Slide65

Of course, we have to identify likely regions – not all alignments will be as nice as that one!

This is the method used by

FASTA

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Sequence searching and alignments - Andrew Cowley

08/05/2012

65

Slide66

FASTA – step 1Identify runs of identical sequence and pick regions with highest density of runs

Sequence searching and alignments - Andrew Cowley

08/05/2012

66

Ktup

parameter:

How small are ‘words’ considered before they are ignored

Increase

Ktup

= faster, but less sensitive

Slide67

FASTA – step 2Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Sequence searching and alignments - Andrew Cowley

08/05/2012

67

Parameter:

Substitution matrix

Slide68

FASTA – step 3Discard regions too far from the highest scoring region

Sequence searching and alignments - Andrew Cowley

08/05/2012

68

Joining threshold:

Internally determined

Slide69

FASTA – step 4Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Sequence searching and alignments - Andrew Cowley

08/05/2012

69

Parameters:

Gap open

Gap extend

Substitution matrix

Slide70

FASTARepeat against all sequences in the database

Sequence searching and alignments - Andrew Cowley

08/05/2012

70

Slide71

FASTA – programs available at EBI

FASTA:

”a fast approximation to Smith & Waterman”

FASTA – scan a protein or DNA sequence library for similar sequences.

FASTX/Y – compare a DNA sequence to a protein sequence databases, comparing the translated DNA sequence in forward or reverse translation frames.

TFASTX/Y – compare a protein sequence to a translated DNA data bank.

FASTF – compares ordered peptides (

Edman

degradation) to a protein databank.

FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).

Sequence searching and alignments - Andrew Cowley

08/05/2012

71

Slide72

Where to find at the EBI?Sequence searching and alignments - Andrew Cowley

08/05/2012

72

www.ebi.ac.uk/Tools/sss/

Or...

Slide73

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley

08/05/2012

73

Slide74

Sequence searching and alignments - Andrew Cowley

08/05/2012

74

Slide75

Similarity search

Sequence searching and alignments - Andrew Cowley

08/05/2012

75

Database selection

Sequence input

Parameters

Submit!

Slide76

Example sequenceSequence searching and alignments - Andrew Cowley

08/05/2012

76

www.ebi.ac.uk/~apc/Courses/Rotterdam

test_prot.fasta

Slide77

FASTA - resultsSequence searching and alignments - Andrew Cowley

08/05/2012

77

Slide78

FASTA - resultsSequence searching and alignments - Andrew Cowley

08/05/2012

78

Slide79

FASTA - resultsSequence searching and alignments - Andrew Cowley

08/05/2012

79

Slide80

FASTA - resultsSequence searching and alignments - Andrew Cowley

08/05/2012

80

Key

-

Gap

: Identity

. Similarity

X Filtered

Slide81

BLAST – Basic Local Alignment Search Tool

Instead of narrowing the dynamic programming search space, BLAST works a different way

Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Sequence searching and alignments - Andrew Cowley

08/05/2012

81

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Slide82

BLAST – step 1w=3

Sequence searching and alignments - Andrew Cowley

08/05/2012

82

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEW

EWR

WRF

Parameter:

Word length (w)

Increase = faster, but less sensitive

Slide83

BLAST – step 1(cont.d)

w=3

T=13

Sequence searching and alignments - Andrew Cowley

08/05/2012

83

SEWRFKHIYR

GQP

RRHLLTTGWSTFVT

GQP 18

GEP 15GRP 14GKP 14GNP 13GDP 13

AQP 12NQP 12

Parameters:

Neighbourhood threshold (T)

Substitution matrix

Slide84

BLAST – step 2Then it scans database sequences for exact matches with these words

Sequence searching and alignments - Andrew Cowley

08/05/2012

84

Slide85

If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amountThis results in a High-scoring Segment Pair (HSP)

BLAST – step 3

Sequence searching and alignments - Andrew Cowley

08/05/2012

85

Parameters:

Drop off

Substitution matrix

Slide86

If the total HSP score is above another threshold then a gapped extension is initiated

BLAST – step 4

Sequence searching and alignments - Andrew Cowley

08/05/2012

86

Parameters:

Extension threshold (

Sg

)

Substitution matrix

Slide87

BLASTThe steps rule out many database sequences early on

Large increase in speed

Sequence searching and alignments - Andrew Cowley

08/05/2012

87

Slide88

BLAST – programs available at the EBIBasic Local Alignment Search Tool

NCBI-BLAST programs:

BLASTP – protein sequence vs. protein sequence library

BLASTN – nucleotide query vs. nucleotide database

BLASTX – translated DNA vs. protein sequence library

WU-BLAST programs:

BLASTP – protein query vs. protein database

BLASTN – nucleotide query vs. nucleotide database

BLASTX – translated nucleotide query vs. protein database

TBLASTN – protein query vs. translated nucleotide database

TBLASTX – translated nucleotide query vs. translated nucleotide database

Sequence searching and alignments - Andrew Cowley

08/05/2012

88

Combines several parameters into ‘sensitivity’ option

Slide89

Sequence searching and alignments - Andrew Cowley

08/05/2012

89

Slide90

Example sequenceSequence searching and alignments - Andrew Cowley

08/05/2012

90

www.ebi.ac.uk/~apc/Courses/Rotterdam

test_prot.fasta

Slide91

Sequence searching and alignments - Andrew Cowley

08/05/2012

91

Key

-

Gap

[residue] Identity

+ Similarity

X Filtered

Slide92

Differences between BLAST and FASTA

BLAST

Fast

Good with proteins

Produces good local alignments + short global alignments

Produces HSP (reports internal matches in long sequences)

Might miss a potential alignment due to ruling out sequences early on in the process

Good at finding

siblings

FASTA

Not as fast as BLAST

Much better with DNA than BLASTN

Produces S&W alignments

Checks each possible alignment with database sequences

Good at finding

cousins

Sequence searching and alignments - Andrew Cowley

08/05/2012

92

Slide93

When to use what?Sequence searching and alignments - Andrew Cowley

08/05/2012

93

Database size

Query length

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

Slide94

Sequence searching and alignments - Andrew Cowley

08/05/2012

94

When to use what?

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

time to search

Slide95

Homology and SimilaritySequence searching and alignments - Andrew Cowley

08/05/2012

95

Slide96

SimilaritySequence searching and alignments - Andrew Cowley

08/05/2012

96

Slide97

HomologySequence searching and alignments - Andrew Cowley

08/05/2012

97

Slide98

Unrelated!*

Sequence searching and alignments - Andrew Cowley

08/05/2012

98

*OK, very distantly related!

Slide99

Homology vs. Similarity

Presence of similar features because of common decent

Cannot be observed since the ancestors are not anymore

Is inferred as a conclusion based on ‘similarity’

Homology is like pregnancy: Either one is or one isn’t! (

Gribskov

– 1999)

Quantifies a ‘likeness’

Uses statistics to determine ‘significance’ of a similarity

Statistically significant similar sequences are considered ‘homologous’

Sequence searching and alignments - Andrew Cowley

08/05/2012

99

Slide100

So far, we’ve talked about scoring alignments

Direct function of the algorithm

But what we want is to assign some kind of quality to that score

Sequence searching and alignments - Andrew Cowley

08/05/2012

100

Slide101

Score vs significanceSequence searching and alignments - Andrew Cowley

08/05/2012

101

A A A

A A A

A C A T A A G G C T

A T A C A A G C C T

High score

High significance

Slide102

“Lies, damn lies, and statistics”

Sequence searching and alignments - Andrew Cowley

08/05/2012

102

Slide103

“Lies, damn lies, and statistics”Not just interested in score...

...But how likely we are to get that alignment by chance alone

It is this ‘non-random’ alignment that infers homology

Statistics are used to estimate this chance

Sequence searching and alignments - Andrew Cowley

08/05/2012

103

Slide104

E-value‘Expect’ value

Probability of obtaining this score by chance

Best measure of how biologically significant an alignment is

Used for ranking results by default

Sequence searching and alignments - Andrew Cowley

08/05/2012

104

Slide105

Calculated in slightly different ways for BLAST and FASTA

Short alignments are more likely to be found by chance so have higher E-values

Affected by parameter values like gap penalties and substitution matrices

BLAST and FASTA both optimised for distant relationships

Sequence searching and alignments - Andrew Cowley

08/05/2012

105

Slide106

FASTA statisticsCompares query sequence with every sequence in database

As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance

Sequence searching and alignments - Andrew Cowley

08/05/2012

106

Slide107

FASTA - histogram

Sequence searching and alignments - Andrew Cowley

08/05/2012

107

Predicted distribution of scores

Observed distribution of scores

Key

*

=

High scoring region

Slide108

BLAST statisticsMain reason for speed is that it doesn’t compare query with lots of other sequences

Therefore it pre-estimates statistical values using a random sequence model

Sequence searching and alignments - Andrew Cowley

08/05/2012

108

“Appears to yield fairly accurate results

Slide109

Search Guidelines

Slide110

Search guidelines 1

Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)

Then with translated DNA query sequences (fastx, blastx)

Search with DNA vs. DNA as the next resort

And then against translated DNA database sequences (tfastx, tblastx) as the VERY LAST RESORT!

Sequence searching and alignments - Andrew Cowley

08/05/2012

110

Slide111

Search guidelines 2

Search the smallest database that is likely to contain the sequence(s) of interest

Use sequence statistics (E()-values) rather than

% identity or % similarity, as your primary criterion for sequence homology

Sequence searching and alignments - Andrew Cowley

08/05/2012

111

Slide112

Search guidelines 3

Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence

Examine the histograms

Use programs such as prss3 to confirm the expectation values.

Searching with shuffled sequences (use MLE/Shuffle in

fasta

) which should have an E() ~1.0

Sequence searching and alignments - Andrew Cowley

08/05/2012

112

Slide113

Sequence searching and alignments - Andrew Cowley

08/05/2012

113

Slide114

Search guidelines 4

Sequence searching and alignments - Andrew Cowley

08/05/2012

114

Default parameters are set up for most common queries

Consider searches with different gap penalties and other scoring matrices, especially for short queries/domains

Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences

Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)

Remember to change the gap penalty defaults!

MATRIX open ext.

BLOSUM50 -10 -2

BLOSUM62 -11 -1

BLOSUM80 -16 -4

PAM250 -10 -2

PAM120 -16 -4

Slide115

Search guidelines 5

Homology can be reliably inferred from statistically significant similarity

But remember:

Orthologous

sequences have similar functions

Paralogous

sequences can acquire very different functional roles

So further work might be needed to tease out details

Sequence searching and alignments - Andrew Cowley

08/05/2012

115

Slide116

Sequence searching and alignments - Andrew Cowley

08/05/2012

116

Slide117

Search guidelines 6

Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues

However, motif identity in the absence of significant sequence similarity is usually occurs by chance

Try to produce multiple sequence alignments in order to examine the relatedness of your sequence data

ClustalW/Omega

MUSCLE

T-Coffee

Kalign

MAFFT

Mview (available from EBI FASTA & BLAST services)

DBCLUSTAL (available from EBI BLAST services)

Sequence searching and alignments - Andrew Cowley

08/05/2012

117

Slide118

Advanced

Slide119

In general, the more information we can add to an alignment, the better the result

Sequence searching and alignments - Andrew Cowley

08/05/2012

119

Conserved regions

Structural information

Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Slide120

Conserved regionsWe can add a new ‘position’ parameter to the substitution matrix

Sequence searching and alignments - Andrew Cowley

08/05/2012

120

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

Slide121

PSI-BLASTPosition Specific Iterative – BLAST:

Takes the result of a normal BLAST

Aligns them and generates profile of conserved positions

Uses this to weight scoring on next iteration

Sequence searching and alignments - Andrew Cowley

08/05/2012

121

Slide122

PSI-BLASTBy adding importance to conserved residues we might be able to find more distant sequences

But iterate too far and we might be assigning importance where there is none

Sequence searching and alignments - Andrew Cowley

08/05/2012

122

More sensitive

Slide123

PSI-BLAST

Sequence searching and alignments - Andrew Cowley

08/05/2012

123

Slide124

PSI-BLAST

Sequence searching and alignments - Andrew Cowley

08/05/2012

124

Slide125

PSI-BLASTSequence searching and alignments - Andrew Cowley

08/05/2012

125

Slide126

PHI-BLASTPattern Hit Initiated-BLAST

User provides a pattern alongside a protein

Database hits have to contain this pattern, and similarity to rest of sequence

Results can initiate a PSI-BLAST search as well

Sequence searching and alignments - Andrew Cowley

08/05/2012

126

Slide127

Problem Sequences

Slide128

Short sequences

What about short sequences?

Depends on their nature:

Protein

Use shallow matrices

Reduce word length and/or increase the E() value cut off

DNA

Reduce the word length‏

Ignore gap penalties (force local alignments only)‏

Use rigorous methods

But ask what you are trying to do!

Sequence searching and alignments - Andrew Cowley

08/05/2012

128

Slide129

Low complexity regionsBiologically irrelevant, but likely to skew alignment scoring

E.g. CA repeats, poly-A tails and Proline rich regions

Sequence searching and alignments - Andrew Cowley

08/05/2012

129

Slide130

Sequence searching and alignments - Andrew Cowley

08/05/2012

130

Good Statistics:

The inset shows good correlation

between the observed over expected

numbers of scores.

This is the region of the histogram to

look out for first when evaluating results.

Slide131

Sequence searching and alignments - Andrew Cowley

08/05/2012

131

The inset shows bad correlation

between the observed and expected

scores in this search.

The spaces between the = and * symbols

indicate this poor correlation.

One reason for this can be low complexity

regions.

Bad Statistics:

Slide132

Low complexity regionsBiologically irrelevant, but likely to skew alignment scoring

E.g. CA repeats, poly-A tails and Proline rich regions

Compensate by filtering sequence so these regions don’t contribute to scoring

Filters: seg, xnu, dust, CENSOR

But check what you are filtering!

Sequence searching and alignments - Andrew Cowley

08/05/2012

132

Slide133

Sequence searching and alignments - Andrew Cowley

08/05/2012

133

Inset showing the effect of using a low

complexity filter (seg) and searching

the database using the segment with

highest complexity.

Note that there is now good agreement

between the observed and expected

high score in the search and that the

distance between = and * has been

significantly reduced.

Filtered:

Slide134

Example sequenceSequence searching and alignments - Andrew Cowley

08/05/2012

134

www.ebi.ac.uk/~apc/Courses/Rotterdam

Filtertest_seq.fsa

Slide135

HOE!Homologous Over-Extension

Possible side effect of iteration based methods

Sequence searching and alignments - Andrew Cowley

08/05/2012

135

Slide136

Sequence searching and alignments - Andrew Cowley

08/05/2012

136

Slide137

Sequence searching and alignments - Andrew Cowley

08/05/2012

137

Slide138

Sequence searching and alignments - Andrew Cowley

08/05/2012

138

Slide139

Sequence searching and alignments - Andrew Cowley

08/05/2012

139

Slide140

Sequence searching and alignments - Andrew Cowley

08/05/2012

140

Slide141

Reducing HOELook for domains in results and manually select sequences that form part of PSSM

Mask boundaries according to initial alignment

Results in improvement of false-positives (selectivity)

Sequence searching and alignments - Andrew Cowley

08/05/2012

141

Slide142

PSI-SEARCHSmith-Waterman implementation (SSEARCH)

With iterative position specific scoring

Optional boundary masking to reduce HOE

Sequence searching and alignments - Andrew Cowley

08/05/2012

142

Slide143

PSI-SearchSequence searching and alignments - Andrew Cowley

08/05/2012

143

Slide144

Vector contaminationYou think you know what your sequence is..

.. But the results are really confusing!

Maybe you have vector contamination

Search against known vectors to check

Sequence searching and alignments - Andrew Cowley

08/05/2012

144

Slide145

Vector contaminationSequence searching and alignments - Andrew Cowley

08/05/2012

145

Slide146

Example sequencesSequence searching and alignments - Andrew Cowley

08/05/2012

146

www.ebi.ac.uk/~apc/Courses/Rotterdam

vectortest_seq1.fsa

vectortest_seq2.fsa

Slide147

Multiple Sequence Alignments

Slide148

Uses of MSAFunctional predictionPhylogeny

Structural prediction

Protein analysis

To distinguish between orthology and parology

Sequence searching and alignments - Andrew Cowley

08/05/2012

148

Slide149

Ideally, might think to build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

And doesn’t make much biological sense

Sequence searching and alignments - Andrew Cowley

08/05/2012

149

Slide150

Human beta --------VHLT

PEEKSAVTALWGKV

N–-

VDEVGGEALGRLLVV

YP

WTQR

FFESFGDLST

Horse beta --------VQLS

GEEKAAVLALWDKVN–-

EEEVGGEALGRLLVVYPWTQR

FFDSFGDLSN

Human alpha ---------VLSPADKTNVKAAWGKVGAH

AGEYGAEALERMFLS

FP

TTKT

YFPHF-DLS-

Horse alpha ---------VLS

AADKTNVKAAWSKV

GGH

AGEYGAEALERMFLG

FP

TTKT

YFPHF-DLS-

Whale myoglobin ---------VLS

EGEWQLVLHVWAKV

EAD

VAGHGQDILIRLFKS

HP

ETLE

KFDRFKHLKT

Lamprey globin PIVDTGSVAPLS

AAEKTKIRSAWAPV

YST

YETSGVDILVKFFTS

TP

AAQE

FFPKFKGLTT

Lupin globin --------GALT

ESQAALVKSSWEEF

NAN

IPKHTHRFFILVLEI

APAAKD

LFSFLKGTSE *: : : * . : .: * : * : .

 Human beta PDAVMGN

PKVKAHGKKVLGAFSDGL

AHLDN-----L

KGTFATLSEL

H

CD

KLHVD

PENFRL

Horse beta PGAVMGN

PKVKA

H

GKKVLHSFGEGV

HHLDN-----L

KGTFAALSEL

H

CD

KLHVD

PENFRL

Human alpha ----HGS

AQVKG

H

GKKVADALTNAV

AHVDD-----M

PNALSALSDL

H

AH

KLRVD

PVNFKL

Horse alpha ----HGS

AQVKA

H

GKKVGDALTLAV

GHLDD-----L

PGALSNLSDL

H

AH

KLRVD

PVNFKL

Whale myoglobin EAEMKAS

EDLKK

H

GVTVLTALGAIL

KKKGH-----H

EAELKPLAQS

H

AT

KHKIP

IKYLEF

Lamprey globin ADQLKKS

ADVRW

H

AERIINAVNDAV

ASMDDT--EKM

SMKLRDLSGK

H

AK

SFQVD

PQYFKV

Lupin globin VP--QNN

PELQA

H

AGKVFKLVYEAA

IQLQVTGVVVT

DATLKNLGSV

H

VS

KGVAD

-AHFPV

. .:: *. : . : *. * . : .

 

Human beta

LGNVLVCVLAHH

FGKEFTPPVQA

AYQKVVAGVANALA

HKYH------

Horse beta

LGNVLVVVLARH

FGKDFTPELQA

SYQKVVAGVANALA

HKYH------

Human alpha

LSHCLLVTLAAH

LPAEFTPAVHA

SLDKFLASVSTVLT

SKYR------

Horse alpha

LSHCLLSTLAVH

LPNDFTPAVHA

SLDKFLSSVSTVLT

SKYR------

Whale myoglobin

ISEAIIHVLHSR

HPGDFGADAQG

AMNKALELFRKDIA

AKYKELGYQG

Lamprey globin

LAAVIADTVAAG

---D------A

GFEKLMSMICILLR

SAY-------

Lupin globin

VKEAILKTIKEV

VGAKWSEELNS

AWTIAYDELAIVIK

KEMNDAA---

: : .: . .. . :

Weighted Sums of Pairs: WSP

Sequences Time

2 1 second

3 150 seconds

4 6.25 hours

5 39 days

6 16 years

7 2404 years

Time O(L

N

)

08/05/2012

150

Sequence searching and alignments - Andrew Cowley

Slide151

Ideally, might think to build up multiple alignments through weighted sum of pairs (pairwise scores)

But this is too computationally intensive

And doesn’t make much biological sense

So use heuristics and progressive alignment methods

Sequence searching and alignments - Andrew Cowley

08/05/2012

151

Slide152

ClustalW

>60,000 citations

Clustal1-Clustal4

1988, Paul Sharp, Dublin

Clustal V 1992

EMBL Heidelberg,

Rainer Fuchs

Alan Bleasby

Clustal W, Clustal X 1994-2005Toby Gibson, EMBL, Heidelberg

Julie Thompson, ICGEB, StrasbourgClustal W and Clustal X 2.0 2006University College Dublin

www.clustal.org

08/05/2012

152

Sequence searching and alignments - Andrew Cowley

Slide153

CLUSTALQuick, pairwise alignment of all sequences

Line up pairs, with the most similar first

Sequence searching and alignments - Andrew Cowley

08/05/2012

153

Slide154

CLUSTALFix the alignment between pairs and treat as one sequence

Sequence searching and alignments - Andrew Cowley

08/05/2012

154

Slide155

CLUSTALAlign your fixed pairs with each other

Sequence searching and alignments - Andrew Cowley

08/05/2012

155

Slide156

Note, this is not a phylogram!Only a guide tree for the alignment

Sequence searching and alignments - Andrew Cowley

08/05/2012

156

Slide157

ClustalW at the EBISequence searching and alignments - Andrew Cowley

08/05/2012

157

Slide158

Sequence searching and alignments - Andrew Cowley

08/05/2012

158

Slide159

MSAIntroduction to External Services - Andrew Cowley

08/05/2012

159

Sequence input

Parameters

Submit!

Slide160

ClustalWSequence searching and alignments - Andrew Cowley

08/05/2012

160

Slide161

ClustalWSequence searching and alignments - Andrew Cowley

08/05/2012

161

Slide162

Jalview

Sequence searching and alignments - Andrew Cowley

08/05/2012

162

Slide163

ClustalW Advantages

Fast

Not too demanding

Widely used

Fine for most uses

DisadvantagesFixing of early alignmentsPropagate errors

Doesn’t search farLocal minimaCompresses gaps

Sequence searching and alignments - Andrew Cowley

08/05/2012163

Slide164

Example sequencesSequence searching and alignments - Andrew Cowley

08/05/2012

164

www.ebi.ac.uk/~apc/Courses/Rotterdam

Prot_MSA.fsa

Slide165

Example sequencesSequence searching and alignments - Andrew Cowley

08/05/2012

165

www.ebi.ac.uk/~apc/Courses/Rotterdam

Problem_MSA.fsa

Problem_MSA.fsa

Problem_MSA.fsa

Problem_MSA.fsa

Slide166

NEW!: Clustal OmegaCompletely different way of doing things from ClustalW

Two major areas of improvement:

1) Guide tree generation

2) Profile-profile alignments

Sequence searching and alignments - Andrew Cowley

08/05/2012

166

Slide167

Clustal Omega – Guide Tree improvementsGuide tree generation is one of the slowest steps

Especially with large numbers of sequence

Clustal Omega uses the embed method to sample range of sequences and represent all sequences as vectors to these samples

Results in better scaling with more sequences

Sequence searching and alignments - Andrew Cowley

08/05/2012

167

Slide168

Clustal Omega – Profile-profile alignmentsLike sequence searching, profiles can be used to increase sensitivity

HMMs are a form of profile

Clustal Omega aligns HMMs to HMMs

Sequence searching and alignments - Andrew Cowley

08/05/2012

168

Slide169

Clustal OmegaBetter scaling for many sequences

Speed

Accuracy

Better scaling for many computers

More accurate alignmentsBut currently protein only

Sequence searching and alignments - Andrew Cowley

08/05/2012

169

Slide170

Other Tools

Sequence searching and alignments - Andrew Cowley

08/05/2012

170

Slide171

COFFEE

C

onsistency based

O

bjective

Function For alignm

Ent Evaluation

Maximum Weight Trace (John Kececioglu)Maximise similarity to a LIBRARY of residue pairsNotredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

08/05/2012

171

Sequence searching and alignments - Andrew Cowley

Slide172

COFFEELibrary of reference pairwise alignments

For your given set of sequences

Objective Function

Evaluates consistency between multiple alignment and the library of pairwise alignments

Use SAGA to optimise this functionWeigh depending on quality of alignment

Sequence searching and alignments - Andrew Cowley

08/05/2012

172

SAGA is another alignment method, using genetic algorithms

Slide173

COFFEEMore accurate than ClustalWMuch less prone to problems in early alignment stages

VERY slow!

Sequence searching and alignments - Andrew Cowley

08/05/2012

173

Slide174

T-CoffeeTree-based COFFEEHeuristic approach to COFFEE

Gets rid of genetic algorithm portion

Uses progressive alignments

Changes algorithm based on number of sequences

Sequence searching and alignments - Andrew Cowley

08/05/2012

174

Slide175

T-CoffeeMuch faster than COFFEEAvoids some of ClustalW’s pitfalls

Can take information from several data sources

Still not that fast

Can be very demanding of memory etc.

Sequence searching and alignments - Andrew Cowley

08/05/2012

175

Slide176

Others MUSCLE – Bob Edgar

Iterative/progressive alignment

Fast

Good for big alignments, proteins

MAFFTIterative based Fast Fourier Transform

Fast and accurateGood for huge alignmentsKalignVery fast, local-regions aligning

Good for very large numbers of alignments!Sequence searching and alignments - Andrew Cowley

08/05/2012

176

Slide177

Which tool should I use?Input data

2-100 sequences of typical protein length

100-500 sequences

>500 sequences

Small number of unusually long sequences

Recommendation

MUSCLE, T-Coffee, MAFFT, ClustalW/OmegaClustal Omega, MUSCLE, MAFFTClustal Omega, KALIGN

ClustalW, KALIGN

MSA tools - Andrew Cowley

08/05/2012

177

Slide178

How to evaluate?Use a benchmark BaliBASE

Sequence searching and alignments - Andrew Cowley

08/05/2012

178

Slide179

BaliBASE

Thompson, JD, Plewniak, F. and Poch, O. (1999)

NAR and Bioinformatics

ICGEB Strasbourg

141 manual alignments using structures

5 sections

core alignment regions marked

1. Equidistant

(82)

2. Orphan

(23)

3. Two groups (12)

4. Long internal gaps

(13)

5. Long terminal gaps

(11)

08/05/2012

179

Sequence searching and alignments - Andrew Cowley

Slide180

Benchmark pitfallsBenchmark dataset may not be representative

Danger of over-training towards benchmark

Goldman: Most MSAs have unrealistic gaps

Tend towards multiple, independent deletions

Insertions are rareSequences shrink in length over evolution

No supporting evidence that this is the case

Sequence searching and alignments - Andrew Cowley08/05/2012

180

Slide181

SolutionsUse phylogentic data to guide alignment

Keep track of changes to ancestor sequences

Don’t change them again so easily in decendents

Sequence searching and alignments - Andrew Cowley

08/05/2012

181

Slide182

PRANKProbabilistic Alignment KitwebPRANK

Better suited for closely related sequences

Tied solutions are chosen from at random

Avoids incorrect confidence in result

Means alignments might not be reproducible

Alignments look quite differentMight look worse!But gap patterns make senseGaps are good!

Sequence searching and alignments - Andrew Cowley

08/05/2012

182

Slide183

Sequence searching and alignments - Andrew Cowley

08/05/2012

183

Slide184

Sequence searching and alignments - Andrew Cowley

08/05/2012

184

Slide185

Common problems with MSAInput format

Try using FASTA format

Unique sequence identifiers

Include sequence!

Usually limit of 500 sequences/1MBJob can’t be found/other errorResults deleted after 7 days

Some sequence/program combinations run out of memoryUse a different program

Sequence searching and alignments - Andrew Cowley

08/05/2012

185

Slide186

Common mis-uses of MSAPerforming a sequence assembly

Specialist type of MSA

Use other tools (Staden etc.)

Aligning ESTs to a reference genome

Use EST2GenomeDesigning primersUse primer tools (primer3 etc.)

Aligning two sequencesUse a pairwise alignment tool!

Sequence searching and alignments - Andrew Cowley

08/05/2012

186

Slide187

Putting it all togetherEBI SearchSequence retrieval

Sequence search

Sequences retrieval

Multiple sequence alignment

Analysis

Sequence searching and alignments - Andrew Cowley

08/05/2012

187

Slide188

Final remarks

Don’t assume a single tool will cater for all your needs

Change the parameters of the tools

Remember where the tool excels and what its limitations are

A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

Crazy input will always give crazy results!

Sequence searching and alignments - Andrew

Cowley

08/05/2012

188

Slide189

Getting Help

Slide190

Getting Help

Database documentation

Frequently Asked Questions

http://www.ebi.ac.uk/help/faq.html

2can Support Portal

http://www.ebi.ac.uk/2can/

EBI Support

http://www.ebi.ac.uk/support/

Hands-on training programme

http://www.ebi.ac.uk/training/handson/

Sequence searching and alignments - Andrew Cowley

08/05/2012

190

Slide191

Thanks!Hamish McWilliamVicky Schneider

Rodrigo Lopez

EMBL-EBI

SLING

Sequence searching and alignments - Andrew Cowley

08/05/2012

191

Slide192

Survey!Sequence searching and alignments - Andrew Cowley

08/05/2012

192

https://www.surveymonkey.com/s/SequenceSearching