DNA the basics Drew Berry DNA animations http wwwyoutubecom watchv WFCvkkDSfIUampindex 4amplistPL9CBBEA5A85DBCDEF Organisation of DNA DNA is packed in Chromosomes Karyotype chromosome set of a species ID: 930023
Download Presentation The PPT/PDF document "Bioinformatics Lecture 1" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Bioinformatics
Lecture 1
Slide2DNA - the basics
Slide3Drew Berry – DNA animations
http://
www.youtube.com
/
watch?v
=WFCvkkDSfIU&index=4&list=PL9CBBEA5A85DBCDEF
Slide4Organisation of DNA
DNA is packed in Chromosomes
Karyotype: chromosome set of a species
Chromosomes are dynamic structures
The Human karyotype:
23 pairs of chromosomes
46 DNA molecules
Slide5A
C
T G
DNA replication
The ability of DNA to replicate itself is a fundamental driver of life
DNA copy is catalysed by enzymes (DNA polymerases)
The complementary strand is synthesised from a template strand, using
deoxynucleotides
and a primer
Synthesis is directional (5’->3’)
5
’
5
’
3
’
3
’
Primer
reverse complement
copy
Template
DNA strand
Deoxyribonucleotides
dNTPs
DNA
polymerase
Template
A
C
G
T
TCAG
Slide6The polymerase chain reaction
Replication requires a DNA polymerase
Thermostable
DNA polymerase (
eg
Taq polymerase)Efficient DNA amplification No error correction
Kary
Mullis
Nobel prize in chemistry: 1993
Melt DNA
(94-98 °)
Anneal primers
(50-65 °)
Elongation
(72 °)
Exponential
replication
Slide7DNA Sequencing (Sanger)
PCR Reaction is terminated using randomly incorporated
dideoxynucleosides
(
ddNP
)Older methods use radiolabelled phosphateNewer methods use ddNP incorporating dyesTruncated DNA strands are separated on a gel or by
capillary electrophoresis
Slide8Next Generation Sequencing
Next generation sequencing refers to methods newer than the Sanger approach
A variety of techniques developed by different companies
DNA is generally immobilized on a solid support
Very large numbers of small reads
Multiple reads of a each section of genomic DNA (eg 30x)Assembling the genome becomes a significant computational problem
Some ‘single molecule’ methods do not require PCR (reduces errors)Cost has reduced substantially
the $1000 genome!
Refs: Metzker, M. L. Sequencing Technologies — the Next Generation. Nat. Rev. Genet. 2009, 11, 31–46.
Slide9The Human Genome Project
Funded by US government
The human genome was published in February 2001
Project completed in 2003
Cost $US 2.7
billion in 1991 dollarsHierarchical shotgun sequencing (genome is broken down into many smaller fragments)Automated Sanger type sequencing
Ref:
http://
www.nature.com
/
scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828
Slide10Human genome by function
The human genome contains about 21K genes (about 100,000 were expected!)
98% of the human genome is noncoding DNA
Noncoding DNA can code for regulatory RNAs or otherwise regulate transcription
Ref:
Häggström
,
Wikiversity
Journal of Medicine 1 (2).
DOI:
10.15347/wjm/2014.008. ISSN 20018762
Slide11The
druggable
genome – Current drug targets
Ref: Hopkins
, A. L.; Groom, C. R. The
Druggable
Genome. Nat Rev Drug Discov
2002
,
1, 727–730.
Slide12The
druggable
genome –
Human genes
Ref: Hopkins
, A. L.; Groom, C. R. The
Druggable Genome. Nat Rev Drug Discov
2002
, 1, 727–730.
Slide13Human genome resources
Three useful sites
providing a huge number of
resources such as genome browsers
NCBI:
National center of biological informationhttp://www.ncbi.nlm.nih.gov/http://
www.ncbi.nlm.nih.gov/genome/guide/human/UCSC
genome browser
http://genome.ucsc.edu
/
Ensembl: European site at the Sanger centrehttp://www.ensembl.org
Slide14Next-gen Sequencing Overview
Ref: http://
res.illumina.com
/documents/products/
illumina_sequencing_introduction.pdf
Slide15Multiple Genomes
Ref
:
McVean
et al. An Integrated Map of Genetic Variation From 1,092 Human Genomes. Nature 2012, 491, 56-65.
Slide16Bioinformatics
Sequencing technologies produce
enormous
amounts of sequence data. What do we want to do with this?
Identify genes
Identify functions of gene products (proteins)Compare genes between speciesIdentify relationships (similarities) between species
Slide17The Genetic Code
In general:
Amino acids that share the same biosynthetic pathway tend to have the same first base in their
codons
Amino
acids with similar physical properties
have
similar
codons causing conservative substitutions in the case of mutations or mistranslation
Slide18Genetic mutation
The genetic code can be changed by a variety of processes
Small scale:
Damage to DNA (radiation or chemical damage)
Translation errors
Large scale:Duplication of sections of DNA Deletion of sections of DNATransposition of sections of DNA
Slide19The rate of genetic mutation
The mutation rate (per year or per generation) differs between species and even between different sections of the genome
Different types of mutations occur with different frequencies
The average mutation rate
is
estimated to be ~2.5 × 10−8 mutations per nucleotide site or 175 mutations per diploid genome per generationRef: Nachman
, M. W.; Crowell, S. L. Estimate of the Mutation Rate Per Nucleotide in Humans. Genetics, 156, 297 (2000).
Slide20Amino acid substitution matrices
Substitution matrices describe the probability that one AA is converted to another and ‘accepted’
Matrix is a ‘log odds’ matrix – i.e. here the probability of conversion from
Ala
to
Arg is 1/log(30)
Slide21PAM and BLOSUM matrices
Scoring
matrices are
used to:
produce sequence
alignments and score similarity between two or more protein to search a database to find sequences similar to a test sequence
Commonly used families of matrices:PAM (Accepted Point Mutation)
matrices (
Dayhof
)Derived from global alignments of entire proteins
Better for closely related protensBLOSUM (BLocks SUbstitution Matrices) matrices (Steven and Henikof)Derived from local alignments of blocks of sequencesBetter for evolutionally divergent sequences
Slide22BLAST - Searching genomes
BLAST is a
rapid
method for searching protein or DNA sequences in large databases
Sequences are divided into groups
k AAs or BasesPGFHJIQMQVVS
PGF, GFH, FHJ, HJI, etc (
k
=3)
Common or repeated sequences are discarded
Sections of exact sequence match are searched forThe sequence alignment is expanded from sections that are exact matchesBlast can miss difficult matches
Slide23http://blast.ncbi.nlm.nih.gov/
Slide24Slide25Slide26Sequence alignment
Protein or DNA sequences can be aligned
Differences between sequences are interpreted as mutations, insertions or deletions
Substitution matrices are used to score the likelihood of a match
Alignment scores are calculated between
pairs of sequencesMultiple alignments can be performedMany alignment programs: Clustal, T-coffee,
Slide27Clustal
Slide28Sequence alignments and protein structural similarity
Sequence alignments are based on protein/DNA sequence similarity and
not
on structural similarity
High sequence similarity implies (but does not guarantee) structural similarity
High sequence similarity implies (but does not garuantee) similar protein function
Comparison of RMSD when pairs of similar proteins are superimposed using the sequence alignment (
X axis
) and the protein 3D structures (
Y axis
)Ref: Kosloff, M.; Kolodny
, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71,
891
Slide29Differences between sequence and structural alignment
Chain
A versus chain D from PDB ID 1vr4. The two chains are 100% identical in sequence
A: Alignment by sequence
B: Alignment by structure
C: Overlaid structures
Ref:
Kosloff
, M.;
Kolodny, R. Sequence-Similar, Structure-Dissimilar Protein Pairs in the PDB. Proteins 2008, 71, 891
Slide30Improving sequence alignments
Adding structural information to sequence alignments can improve
their quality
Slide31Summary
This lecture should provide an overview of:
DNA sequencing and the Polymerase Chain Reaction
Genome sequencing
BLAST searching
Sequence alignments and their limitations