New generation sequencing NGS The completion of human genome was just a start of modern DNA sequencing era highthroughput next generation sequencing NGS New approaches reduce time and cost ID: 277163
Download Presentation The PPT/PDF document "Last lecture summary" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Last lecture summarySlide2
New generation sequencing (NGS)
The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS).
New approaches, reduce time and cost.
Holly Grail of sequencing – complete human genome below $ 1000.
1
st
generation – Sanger dideoxy method
2
nd
generation – sequencing by synthesis (pyrosequencing)
3
rd
generation – single molecule sequencingSlide3
Sequence alignment
What is sequence alignment
Three flavors of sequence alignment
Point mutations, indelsSlide4
Homology
'Central dogma of bioinformatics'
Sequences diverge
Conserved residues
Sequences are homologous, orthologous, paralogous
The variation between sequences – changes occurred during evolution in the form of
substitutions
(
mutations
) and/or indels.Slide5
New stuffSlide6
Identity matrix
Scoring systems I
DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized.
Counting the number of matches gives us a score (3 in this case).
Higher score means better alignment
.
This procedure can be formalized using
substitution matrix
.
A T T G - - - T
A –
-
G A C A T
A
T
C
G
A
1
T
0
1
C
0
0
1
G
0
0
0
1Slide7
Scoring systems II
identity matrix: NAs – OK, proteins – not enough
AAs are not exchanged with the same probability as can be conceived theoretically.
For example substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare.
D
E
WSlide8
Scoring systems II
Why is that?
Triplet-based
genetic code
GAT (D) → GAA (E), GAT (D) → TGG (W)
Both
D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.Slide9
Genetic code
http://www.doctortee.com/dsu/tiftickjian/bio100/gene-expression.htmlSlide10
Gaps or no gapsSlide11
Scoring DNA sequence alignment (1)
Match score
:
+1
Mismatch score: +0
Gap penalty
:
–
1
ACGTCTGAT
ACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---AT
CGTCTATCT
Matches: 18 × (+1)Mismatches: 2 × 0Gaps: 7 × (– 1)
Score = +11Slide12
Length penalties
We want to find alignments that are
evolutionarily likely
.
Which of the following alignments seems more likely to you
?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-
------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA-
-CG-CGT-TA-TCTATCTWe can achieve this by penalizing more for a new gap, than for extending an existing gap
Slide13
Scoring DNA sequence alignment
(2)
Match/mismatch score: +1/+0
Origination/length penalty: –2/–
1
ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT
||||| ||| || ||||||||----CTGATTCGC---ATC
GTCTATCTMatches: 18
× (+1)Mismatches: 2 × 0Origination: 2 × (–2)Length: 7 × (–1)
Score = +7Slide14
Substitution matrices
Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of
mutation
.
Conservative
substitutions
–
conserve
the physical and chemical properties of the amino
acids, limit structural/functional disruption
Substitution matrices should reflect:Physicochemical properties of amino acids.Different frequencies of individual amino acids occuring in proteins.Interchangeability of the genetic code.Slide15
PAM matrices I
How to assign scores? Let’s get nature
– evolution – involved
!
If you choose set of proteins with very similar sequences, you can do alignment manually
.
Also, if sequences in your set are similar,
then there is high
probability that amino acid difference
are due to single
mutation.From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived.This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices.Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358. Slide16
PAM matrices II
Alignments of 71 groups of very similar
(at least 85% identity) protein
sequences
. 1572 substitutions were found.
These mutations do not significantly alter the protein function. Hence they are called
accepted mutations
(accepted by natural selection).
Probabilities that any one
amino acid would mutate into any other were calculated.
If I know probabilities of individual amino acids, what is the probability for the given sequence?ProductBut to calculate the score, we would like to sum probabilities, not multiply. How to achieve this?Logarithm
Excellent discussion of the derivation and use of PAM matrices: George
DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183:333-51. PMID: 2314281.Slide17
PAM matrices III
Dayhoff’s definition of accepted mutation was thus based on
empirically
observed amino acids substitutions.
The used unit is a PAM.
Two sequences
are 1
PAM apart
if they have
99% identical residues.
PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids.PAM1 matrix represents probabilities of point mutations over certain evolutionary time.in Drosophila 1 PAM corresponds to ~2.62 MYAin Human 1 PAM corresponds to ~4.58 MYASlide18
PAM1 matrix
numbers are
multiplied by 10
000Slide19
Higher PAM matrices
What to do if I want get probabilities over much longer evolutionary time?
Dayhoff
proposed a model of evolution that is a
Markov process
.
A
case of Markov process
is a linear
dynamical system.Slide20
Linear dynamical system I
A new species of frog has been introduced into an area where it has too few natural predators. In an attempt to restore the ecological balance, a team of scientists is considering introducing a species of bird which feeds on this frog. Experimental data suggests that the population of frogs and birds from one year to the next can be modeled by linear relationships. Specifically, it has been found that if the quantities F
k
and B
k
represent the populations of the frogs and birds in the k
th
year,
then
The
question is this: in the long run, will the introduction of the birds reduce or eliminate the frog population growth
?
Slide21
Linear dynamical system
II
So
this system evolves in time according to
x
(k+1)
=
A
x
(k
)
.
Such a system is called
discrete linear dynamical system
, matrix A is called
transition matrix.
If
we need to know the state of the system in time
k
= 50, we have to compute
x
(50)
= A
50
x
(0)
.
And the same is true for Dayhoff’s model of evolution.
If we need to obtain probability matrices for higher percentage of accepted mutations (i.e. covering longer evolutionary time), we do matrix powers.
Let’s say we want
PAM120
–
120
mutations fixed on average per 100 residues. We do
PAM1
120
.
Slide22
Higher PAM matrices
Biologically, the PAM120 matrix means that in 100 amino acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side.
This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine.
These are called
silent substituions
.Slide23
PAM 120
small, polar
small, nonpolar
polar or acidic
basic
large, hydrophobic
aromatic
Zvelebil, Baum, Understanding bioinformatics
.
Positive score
– frequency of substitutions is greater than would have occurred by random chance.
Zero
score
– frequency is equal to that expected by chance.
Negative
score
– frequency is less than would have occurred by random
chance.Slide24
PAM matrices assumptions
Mutation of amino acid is independent of previous mutations
at
the same position (Markov process requirement).
Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model).
Each amino acid position is equally mutable.
Mutations are assumed to be independent of surrounding residues.
Forces responsible for sequence evolution over short time are the same as these over longer times.
PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins)
New generation of Dayhoff-type – e.g. PET91Slide25
How to calculate score?
Selzer, Applied bioinformatics
.
substitution matrix
2Slide26
Protein vs. DNA sequences
Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences.
There are several reasons for this
:
Many changes in DNA do not change the amino acid that is specified.
Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships can be accounted for using
scoring systems
.
When is it appropriate to compare nucleic sequences?
confirming the identity of DNA sequence in database search, searching for polymorphisms, confirming identity of cloned cDNASlide27
Similarity vs. identity
Similarity
refers to the percentage of aligned residues that
can
be more
readily
substituted for each other
.
have similar physicochemical characteristics andthe selective pressure results in some mutations being accepted and others being eliminated
S = [(Ls × 2)/(La + Lb)] × 100
number of aligned residues
with similar characteristicstotal lengths of each sequenceSlide28
Homology vs. similarity
Two sequences are
homologous
when they descended from a common ancestor sequence.
Similarity can be quantified: “two sequences share 40% similarity”.
But NOT “two sequences share 40% homology”. Just “two sequences are homologous”
Qualitative statement
And it is a conclusion about a common ancestral relationship drawn from sequence similarity comparisonSlide29
Gaps
How will I score this alignment?
The gaps can’t be inserted freely.
Indels are relatively slow evolutionary processes.
And alignments with large gaps do not make biological sense.
Each gap is penalized – a
gap penalty
The gap penalty is an adjustable parameter.
Let’s use the gap penalty equaling to -11.
V
D
S - C Y
V E
S L
C Y
V
D
S
-
C
Y
V
E
S
L
C
Y
4 2 4 -11 9 7
S
= 4 + 2 + 4 – 11 + 9 + 7=15Slide30
Gap penalty
Affine gap penalty
different for opening and extending
constant for extending
The gap penalty is high – fewer gaps will be inserted
If you’re searching for sequences that are a strict match for your query sequence, the gap penalty should be set high.
This will retrieve regions with very closely related sequences.
The gap penalty is low – more and larger gaps will be inserted
If you are searching for similarity between distantly related sequences, the gap penalty should be set low.Slide31
High gap penalty. Gaps has been inserted only at the beginning and end.
Percentage
identity = 10%
(B) Low gap penalty. More gaps.
Percentage identity =
18%
Zvelebil, Baum, Understanding bioinformatics
.