monkeys Observations from real and random genomes Environmental genomics When an organism dies it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments ie reads figure out which organisms they came from ID: 259213
Download Presentation The PPT/PDF document "Zipf’s" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Zipf’s monkeys
Observations from real and random genomesSlide2
Environmental genomics
When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments
Given a collection of DNA fragments (i.e. reads), figure out which organisms they came fromSlide3
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT…Slide4
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG…Slide5
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG…Slide6
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG…Slide7
The data
___________________________________________________________________________________________________________________________________________________________Slide8
The data
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________Slide9
The data
___________
_______
_________________
_________________
___________
______
______________________
__________
_______________
________
__________________
_____________
______________
_____________
______________________________
_____________
__________________________
___________
______________________________________
______
____
________
__________
_____________
_________
_______________________________________
_______
_____________________________
________________
________________________
_____________________
_____
_________________________
_____________
_______
___________________
_________________
_____
________________________
________
___________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________Slide10
The data
___________
_________________
___________
______________________
_______________
__________________
______________
_____________
______________________________
_____________
__________________________
___________
______________________________________
______
____
________
__________
_____________
_________
_______________________________________
_______
_____________________________
________________
________________________
_____________________
_____
_________________________
_____________
_______
___________________
_________________
_____
________________________
________
___________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________Slide11
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________
_______________________________________
_____________________________
________________________
_____________________
_________________________
_______
_________________
_______________________
___________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________Slide12
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________
_______________________________________
_____________________________
________________________
_____________________
_________________________
_______
_________________
_______________________
___________
_______ _____________ ____ ______________ ___________________________ __________ ________________
_____ ____________________________ ______________________ ________________________________________________________ ________ _______
__ _______ ______________ ________________ _______________________________________ ______________ ___________________________
______ _______________________ ____________________ ______________ _______________________________ _________________ __
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________Slide13
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________ _______________________________________ _____________________________ _____________________________________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ _____________________ ____________________________ ______________________ ________________________________________________________ ________ _________ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ _________________ ___________________ ____________ _____ _______ ________________ _________________ _____________________________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ _____ _____________ ___________________________________ ____________________ _______________________ __________
How can we reconstruct the original genomes?Slide14
Approaches
Jigsaw puzzle
Find common subsequences
Align overlapping regions
Statistics
Compute histograms of
oligonucleotides
(n-grams)
Match to distributions for known organisms
Use rare polymers to select anchor points (BLAST-like)Slide15
Compression distance
Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome
(normalized) universal compression distance
max[ C(
xy
) – C(x), C(
yx
) – C(y) ]
UCD(
x,y
) = ---------------------------------------------
max[ C(x), C(y)]Slide16
CM clustering
Compression Maximization
Adopt compression into a kind of EM clustering
Partition reads randomly into [say] two groups
For each read, compute compression distance to each group (à la
leave-one-out
)
Reassign read to closest group
Iterate until some stopping criterion
Apply recursively to each groupSlide17
Experiment
groupA
groupB
DG2
AF2
NM1
DE2
MR2
AD4
DE3
CA4
AD5
DE5
AF1
DG1
DE1
AD1AF3 NM3DG4 AF4AF5 DG5CA1 MR1MR4 AD3CA3 CS5DE4 CA2CA5 MR5NM4 CS3
CS2
NM2
AD2
DG3
CS4
CS1
MR3
NM5Slide18
Experiment: result
groupA
groupB
AD1
DE1
AD2
DE2
AD3
DE3
AD4
DE4
AD5
DE5
AF1
DG1
AF2
DG2AF3 DG3AF4 DG4AF5 DG5CA1 MR1CA2 MR2NM1 MR3NM2 MR4NM3 MR5CS1 CA3CS2 CA4CS3 CA5CS4
NM4
CS4
NM5
stop when µCD > 70Slide19
Reassembly
Can the LZ
trie
be used to reassemble reads into genomes?
The LZ
trie
is a regular grammar of the set of reads
A long phrase is an extension of a shorter phrase
The start of one read is the end of another
The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phraseSlide20
Along the way ….
While setting up the initial experiments, we started to ponder things that might go wrong
Different genomes might have a lot of common subsequences that will conflate the clustering result
SNPs and missing fragments might thwart compression
Compression model might take too long to converge on a useful model (paucity of data)
What is the underlying principle being leveraged?Slide21
Information theory
A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity
If a sequence is entirely random, it is noise
If a sequence is entirely predictable, it is
redundant
Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information)
Compression attempts to minimize redundancySlide22
Information theory
Human languages exhibit non-uniform distributions over letters, phonemes, words, etcSlide23
Brown Corpus word frequenciesSlide24
DNA primary sequences
Four nucleotide symbols: A, C, G, T
Much of a genome codes nothing, and the rest is
genes
A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation)
Three consecutive nucleotides form a
codon
, which codes for a specific amino acid
A sequence of amino acids (residues) constitutes a protein
Proteins are where structure definitely existsSlide25
DNA primary sequences
4
3
= 64 possible
codons
20 possible amino acids
Many amino acids have more than one
codonSlide26Slide27Slide28
Genomic regularities
Most genes start with ATG
and end with a stop
codon
(TAG, TAA, and TGA most frequent)
TATA-box in regulatory region (for binding)
GC rich regions (for stability)
But
Frequency of individual nucleotides or residues is not-so interesting (no syntax)
Tertiary structure of proteins is
The Thing
: the interactions of amino residues are paramountSlide29
Genomic regularities
Do genomes have sequential syntactic structures?Slide30
Codon frequencies in real DNASlide31
4-gram frequencies in real DNASlide32
5-gram frequencies in real DNASlide33
6-gram frequencies in real DNASlide34
6-gram probabilities in real DNASlide35
Problems from paucity of data
Takes time for an LZ compression
trie
to become saturated with characteristic phrases
Experimental data somewhat small, thus interesting sequences may not manifest quickly enough
Prime the
trie
by
prepending
some random DNA to the data prior to computing CD
How much? How about a million?Slide36
bigram frequency in random DNASlide37
codon frequency in random DNASlide38
10-gram frequency in random DNASlide39
4-gram frequency in random DNASlide40
5-gram frequency in random DNASlide41
5-gram frequency in random DNASlide42
7-gram frequency in random DNASlide43
8-gram frequency in random DNASlide44
9-gram frequency in random DNASlide45
Miller’s monkey
19
th
century –
Wilfried
Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data
1949 – G.K.
Zipf
argued that power-law distributions are an interesting linguistic phenomenon
1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with
Zipfian
distribution
1968 – David
Howes
argued that Miller’s proof is flawed
2004 – Michael
Mitzenmacher
demonstrated the connection between power-law distributions and log-normal distributionsSlide46
conclusion
Probably nothing!