/
Last lecture summary Last lecture summary

Last lecture summary - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
369 views
Uploaded On 2016-04-09

Last lecture summary - PPT Presentation

New generation sequencing NGS The completion of human genome was just a start of modern DNA sequencing era highthroughput next generation sequencing NGS New approaches reduce time and cost ID: 277163

amino sequences gap score sequences amino score gap pam sequence matrices matrix penalty acids mutations gaps protein alignment time

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Last lecture summary" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Last lecture summarySlide2

New generation sequencing (NGS)

The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS).

New approaches, reduce time and cost.

Holly Grail of sequencing – complete human genome below $ 1000.

1

st

generation – Sanger dideoxy method

2

nd

generation – sequencing by synthesis (pyrosequencing)

3

rd

generation – single molecule sequencingSlide3

Sequence alignment

What is sequence alignment

Three flavors of sequence alignment

Point mutations, indelsSlide4

Homology

'Central dogma of bioinformatics'

Sequences diverge

Conserved residues

Sequences are homologous, orthologous, paralogous

The variation between sequences – changes occurred during evolution in the form of

substitutions

(

mutations

) and/or indels.Slide5

New stuffSlide6

Identity matrix

Scoring systems I

DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized.

Counting the number of matches gives us a score (3 in this case).

Higher score means better alignment

.

This procedure can be formalized using

substitution matrix

.

A T T G - - - T

A –

-

G A C A T

A

T

C

G

A

1

T

0

1

C

0

0

1

G

0

0

0

1Slide7

Scoring systems II

identity matrix: NAs – OK, proteins – not enough

AAs are not exchanged with the same probability as can be conceived theoretically.

For example substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare.

D

E

WSlide8

Scoring systems II

Why is that?

Triplet-based

genetic code

GAT (D) → GAA (E), GAT (D) → TGG (W)

Both

D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.Slide9

Genetic code

http://www.doctortee.com/dsu/tiftickjian/bio100/gene-expression.htmlSlide10

Gaps or no gapsSlide11

Scoring DNA sequence alignment (1)

Match score

:

+1

Mismatch score: +0

Gap penalty

:

1

ACGTCTGAT

ACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---AT

CGTCTATCT

Matches: 18 × (+1)Mismatches: 2 × 0Gaps: 7 × (– 1)

Score = +11Slide12

Length penalties

We want to find alignments that are

evolutionarily likely

.

Which of the following alignments seems more likely to you

?

ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT-

------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT

AC-T-TGA-

-CG-CGT-TA-TCTATCTWe can achieve this by penalizing more for a new gap, than for extending an existing gap

Slide13

Scoring DNA sequence alignment

(2)

Match/mismatch score: +1/+0

Origination/length penalty: –2/–

1

ACGTCTGAT

A

CGCCGTAT

A

GTCTATCT

||||| ||| || ||||||||----CTGATTCGC---ATC

GTCTATCTMatches: 18

× (+1)Mismatches: 2 × 0Origination: 2 × (–2)Length: 7 × (–1)

Score = +7Slide14

Substitution matrices

Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of

mutation

.

Conservative

substitutions

conserve

the physical and chemical properties of the amino

acids, limit structural/functional disruption

Substitution matrices should reflect:Physicochemical properties of amino acids.Different frequencies of individual amino acids occuring in proteins.Interchangeability of the genetic code.Slide15

PAM matrices I

How to assign scores? Let’s get nature

– evolution – involved

!

If you choose set of proteins with very similar sequences, you can do alignment manually

.

Also, if sequences in your set are similar,

then there is high

probability that amino acid difference

are due to single

mutation.From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived.This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices.Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358. Slide16

PAM matrices II

Alignments of 71 groups of very similar

(at least 85% identity) protein

sequences

. 1572 substitutions were found.

These mutations do not significantly alter the protein function. Hence they are called

accepted mutations

(accepted by natural selection).

Probabilities that any one

amino acid would mutate into any other were calculated.

If I know probabilities of individual amino acids, what is the probability for the given sequence?ProductBut to calculate the score, we would like to sum probabilities, not multiply. How to achieve this?Logarithm

Excellent discussion of the derivation and use of PAM matrices: George

DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183:333-51. PMID: 2314281.Slide17

PAM matrices III

Dayhoff’s definition of accepted mutation was thus based on

empirically

observed amino acids substitutions.

The used unit is a PAM.

Two sequences

are 1

PAM apart

if they have

99% identical residues.

PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids.PAM1 matrix represents probabilities of point mutations over certain evolutionary time.in Drosophila 1 PAM corresponds to ~2.62 MYAin Human 1 PAM corresponds to ~4.58 MYASlide18

PAM1 matrix

numbers are

multiplied by 10

000Slide19

Higher PAM matrices

What to do if I want get probabilities over much longer evolutionary time?

Dayhoff

proposed a model of evolution that is a

Markov process

.

A

case of Markov process

is a linear

dynamical system.Slide20

Linear dynamical system I

A new species of frog has been introduced into an area where it has too few natural predators. In an attempt to restore the ecological balance, a team of scientists is considering introducing a species of bird which feeds on this frog. Experimental data suggests that the population of frogs and birds from one year to the next can be modeled by linear relationships. Specifically, it has been found that if the quantities F

k

and B

k

represent the populations of the frogs and birds in the k

th

year,

then

The

question is this: in the long run, will the introduction of the birds reduce or eliminate the frog population growth

?

 Slide21

Linear dynamical system

II

So

this system evolves in time according to

x

(k+1)

=

A

x

(k

)

.

Such a system is called

discrete linear dynamical system

, matrix A is called

transition matrix.

If

we need to know the state of the system in time

k

= 50, we have to compute

x

(50)

= A

50

x

(0)

.

And the same is true for Dayhoff’s model of evolution.

If we need to obtain probability matrices for higher percentage of accepted mutations (i.e. covering longer evolutionary time), we do matrix powers.

Let’s say we want

PAM120

120

mutations fixed on average per 100 residues. We do

PAM1

120

.

 Slide22

Higher PAM matrices

Biologically, the PAM120 matrix means that in 100 amino acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side.

This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine.

These are called

silent substituions

.Slide23

PAM 120

small, polar

small, nonpolar

polar or acidic

basic

large, hydrophobic

aromatic

Zvelebil, Baum, Understanding bioinformatics

.

Positive score

– frequency of substitutions is greater than would have occurred by random chance.

Zero

score

– frequency is equal to that expected by chance.

Negative

score

– frequency is less than would have occurred by random

chance.Slide24

PAM matrices assumptions

Mutation of amino acid is independent of previous mutations

at

the same position (Markov process requirement).

Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model).

Each amino acid position is equally mutable.

Mutations are assumed to be independent of surrounding residues.

Forces responsible for sequence evolution over short time are the same as these over longer times.

PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins)

New generation of Dayhoff-type – e.g. PET91Slide25

How to calculate score?

Selzer, Applied bioinformatics

.

substitution matrix

2Slide26

Protein vs. DNA sequences

Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences.

There are several reasons for this

:

Many changes in DNA do not change the amino acid that is specified.

Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships can be accounted for using

scoring systems

.

When is it appropriate to compare nucleic sequences?

confirming the identity of DNA sequence in database search, searching for polymorphisms, confirming identity of cloned cDNASlide27

Similarity vs. identity

Similarity

refers to the percentage of aligned residues that

can

be more

readily

substituted for each other

.

have similar physicochemical characteristics andthe selective pressure results in some mutations being accepted and others being eliminated

S = [(Ls × 2)/(La + Lb)] × 100

number of aligned residues

with similar characteristicstotal lengths of each sequenceSlide28

Homology vs. similarity

Two sequences are

homologous

when they descended from a common ancestor sequence.

Similarity can be quantified: “two sequences share 40% similarity”.

But NOT “two sequences share 40% homology”. Just “two sequences are homologous”

Qualitative statement

And it is a conclusion about a common ancestral relationship drawn from sequence similarity comparisonSlide29

Gaps

How will I score this alignment?

The gaps can’t be inserted freely.

Indels are relatively slow evolutionary processes.

And alignments with large gaps do not make biological sense.

Each gap is penalized – a

gap penalty

The gap penalty is an adjustable parameter.

Let’s use the gap penalty equaling to -11.

V

D

S - C Y

V E

S L

C Y

V

D

S

-

C

Y

V

E

S

L

C

Y

4 2 4 -11 9 7

S

= 4 + 2 + 4 – 11 + 9 + 7=15Slide30

Gap penalty

Affine gap penalty

different for opening and extending

constant for extending

The gap penalty is high – fewer gaps will be inserted

If you’re searching for sequences that are a strict match for your query sequence, the gap penalty should be set high.

This will retrieve regions with very closely related sequences.

The gap penalty is low – more and larger gaps will be inserted

If you are searching for similarity between distantly related sequences, the gap penalty should be set low.Slide31

High gap penalty. Gaps has been inserted only at the beginning and end.

Percentage

identity = 10%

(B) Low gap penalty. More gaps.

Percentage identity =

18%

Zvelebil, Baum, Understanding bioinformatics

.