/
Zipf’s Zipf’s

Zipf’s - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
381 views
Uploaded On 2016-03-17

Zipf’s - PPT Presentation

monkeys Observations from real and random genomes Environmental genomics When an organism dies it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments ie reads figure out which organisms they came from ID: 259213

data dna gram random dna data random gram frequency compression distributions phrase real frequencies genomes codon sequence amino long

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Zipf’s" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Zipf’s monkeys

Observations from real and random genomesSlide2

Environmental genomics

When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments

Given a collection of DNA fragments (i.e. reads), figure out which organisms they came fromSlide3

The data

AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT…Slide4

The data

AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG…Slide5

The data

AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG…Slide6

The data

AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG…Slide7

The data

___________________________________________________________________________________________________________________________________________________________Slide8

The data

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________Slide9

The data

___________

_______

_________________

_________________

___________

______

______________________

__________

_______________

________

__________________

_____________

______________

_____________

______________________________

_____________

__________________________

___________

______________________________________

______

____

________

__________

_____________

_________

_______________________________________

_______

_____________________________

________________

________________________

_____________________

_____

_________________________

_____________

_______

___________________

_________________

_____

________________________

________

___________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________Slide10

The data

___________

_________________

___________

______________________

_______________

__________________

______________

_____________

______________________________

_____________

__________________________

___________

______________________________________

______

____

________

__________

_____________

_________

_______________________________________

_______

_____________________________

________________

________________________

_____________________

_____

_________________________

_____________

_______

___________________

_________________

_____

________________________

________

___________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________Slide11

The data

___________

_________________

___________

______________________

_______________

__________________

______________

______________________________

__________________________

______________________________________

____

________

_____________

_______________________________________

_____________________________

________________________

_____________________

_________________________

_______

_________________

_______________________

___________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________Slide12

The data

___________

_________________

___________

______________________

_______________

__________________

______________

______________________________

__________________________

______________________________________

____

________

_____________

_______________________________________

_____________________________

________________________

_____________________

_________________________

_______

_________________

_______________________

___________

_______ _____________ ____ ______________ ___________________________ __________ ________________

_____ ____________________________ ______________________ ________________________________________________________ ________ _______

__ _______ ______________ ________________ _______________________________________ ______________ ___________________________

______ _______________________ ____________________ ______________ _______________________________ _________________ __

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________

___________________________________________________________________________________________________________________________________________________________Slide13

The data

___________

_________________

___________

______________________

_______________

__________________

______________

______________________________

__________________________

______________________________________

____

________

_____________ _______________________________________ _____________________________ _____________________________________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ _____________________ ____________________________ ______________________ ________________________________________________________ ________ _________ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ _________________ ___________________ ____________ _____ _______ ________________ _________________ _____________________________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ _____ _____________ ___________________________________ ____________________ _______________________ __________

How can we reconstruct the original genomes?Slide14

Approaches

Jigsaw puzzle

Find common subsequences

Align overlapping regions

Statistics

Compute histograms of

oligonucleotides

(n-grams)

Match to distributions for known organisms

Use rare polymers to select anchor points (BLAST-like)Slide15

Compression distance

Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome

(normalized) universal compression distance

max[ C(

xy

) – C(x), C(

yx

) – C(y) ]

UCD(

x,y

) = ---------------------------------------------

max[ C(x), C(y)]Slide16

CM clustering

Compression Maximization

Adopt compression into a kind of EM clustering

Partition reads randomly into [say] two groups

For each read, compute compression distance to each group (à la

leave-one-out

)

Reassign read to closest group

Iterate until some stopping criterion

Apply recursively to each groupSlide17

Experiment

groupA

groupB

DG2

AF2

NM1

DE2

MR2

AD4

DE3

CA4

AD5

DE5

AF1

DG1

DE1

AD1AF3 NM3DG4 AF4AF5 DG5CA1 MR1MR4 AD3CA3 CS5DE4 CA2CA5 MR5NM4 CS3

CS2

NM2

AD2

DG3

CS4

CS1

MR3

NM5Slide18

Experiment: result

groupA

groupB

AD1

DE1

AD2

DE2

AD3

DE3

AD4

DE4

AD5

DE5

AF1

DG1

AF2

DG2AF3 DG3AF4 DG4AF5 DG5CA1 MR1CA2 MR2NM1 MR3NM2 MR4NM3 MR5CS1 CA3CS2 CA4CS3 CA5CS4

NM4

CS4

NM5

stop when µCD > 70Slide19

Reassembly

Can the LZ

trie

be used to reassemble reads into genomes?

The LZ

trie

is a regular grammar of the set of reads

A long phrase is an extension of a shorter phrase

The start of one read is the end of another

The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phraseSlide20

Along the way ….

While setting up the initial experiments, we started to ponder things that might go wrong

Different genomes might have a lot of common subsequences that will conflate the clustering result

SNPs and missing fragments might thwart compression

Compression model might take too long to converge on a useful model (paucity of data)

What is the underlying principle being leveraged?Slide21

Information theory

A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity

If a sequence is entirely random, it is noise

If a sequence is entirely predictable, it is

redundant

Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information)

Compression attempts to minimize redundancySlide22

Information theory

Human languages exhibit non-uniform distributions over letters, phonemes, words, etcSlide23

Brown Corpus word frequenciesSlide24

DNA primary sequences

Four nucleotide symbols: A, C, G, T

Much of a genome codes nothing, and the rest is

genes

A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation)

Three consecutive nucleotides form a

codon

, which codes for a specific amino acid

A sequence of amino acids (residues) constitutes a protein

Proteins are where structure definitely existsSlide25

DNA primary sequences

4

3

= 64 possible

codons

20 possible amino acids

Many amino acids have more than one

codonSlide26
Slide27
Slide28

Genomic regularities

Most genes start with ATG

and end with a stop

codon

(TAG, TAA, and TGA most frequent)

TATA-box in regulatory region (for binding)

GC rich regions (for stability)

But

Frequency of individual nucleotides or residues is not-so interesting (no syntax)

Tertiary structure of proteins is

The Thing

: the interactions of amino residues are paramountSlide29

Genomic regularities

Do genomes have sequential syntactic structures?Slide30

Codon frequencies in real DNASlide31

4-gram frequencies in real DNASlide32

5-gram frequencies in real DNASlide33

6-gram frequencies in real DNASlide34

6-gram probabilities in real DNASlide35

Problems from paucity of data

Takes time for an LZ compression

trie

to become saturated with characteristic phrases

Experimental data somewhat small, thus interesting sequences may not manifest quickly enough

Prime the

trie

by

prepending

some random DNA to the data prior to computing CD

How much? How about a million?Slide36

bigram frequency in random DNASlide37

codon frequency in random DNASlide38

10-gram frequency in random DNASlide39

4-gram frequency in random DNASlide40

5-gram frequency in random DNASlide41

5-gram frequency in random DNASlide42

7-gram frequency in random DNASlide43

8-gram frequency in random DNASlide44

9-gram frequency in random DNASlide45

Miller’s monkey

19

th

century –

Wilfried

Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data

1949 – G.K.

Zipf

argued that power-law distributions are an interesting linguistic phenomenon

1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with

Zipfian

distribution

1968 – David

Howes

argued that Miller’s proof is flawed

2004 – Michael

Mitzenmacher

demonstrated the connection between power-law distributions and log-normal distributionsSlide46

conclusion

Probably nothing!