/
Information Theory of DNA Sequencing Information Theory of DNA Sequencing

Information Theory of DNA Sequencing - PowerPoint Presentation

wilson
wilson . @wilson
Follow
342 views
Uploaded On 2022-06-11

Information Theory of DNA Sequencing - PPT Presentation

David Tse Dept of EECS UC Berkeley LIDS Student Conference MIT Feb 2 2012 Research supported by NSF Center for Science of Information TexPoint fonts used in EMF A A A A A A ID: 916509

sequencing reads constraint dna reads sequencing dna constraint length read sequence algorithm repeats model merge rate duplication capacity communication

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Information Theory of DNA Sequencing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Information Theory of DNA Sequencing

David Tse Dept. of EECSU.C. BerkeleyLIDS Student ConferenceMITFeb. 2, 2012Research supported by NSF Center for Science of Information.

TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Guy

Bresler

Abolfazl

Motahari

Slide2

DNA sequencing

DNA: the blueprint of lifeProblem: to obtain the sequence of nucleotides.…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAG

ACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…courtesy: Batzoglou

Slide3

Impetus: Human Genome Project

1990: Start2001: Draft

2003: Finished3 billion basepairs

c

ourtesy: Batzoglou

Slide4

Sequencing Gets Cheaper and Faster

Cost of one human genome HGP: $ 3 billion2004: $30,000,0002008: $100,0002010: $10,0002011: $4,000 2012-13: $1,000???: $300

courtesy: Batzoglou

Slide5

But many genomes to sequence

100 million species(e.g. phylogeny)7 billion individuals (SNP, personal genomics)

1013 cells in a human(e.g. somatic mutations such as HIV, cancer)courtesy: Batzoglou

Slide6

Whole Genome Shotgun Sequencing

Reads are assembled to reconstruct the original DNA sequence.

Slide7

Sequencing Technologies

HGP era: single technology (Sanger)Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)All provide massively parallel sequencing.Each technology has different read lengths, noise profiles, etc

Slide8

Assembly AlgorithmsMany proposed algorithms.

Different algorithms tailored to different technologies.Each algorithm deals with the full complexity of the problem while trying to scale well with the massive amount of data.Lots of heuristics used in the design.

Slide9

A Basic QuestionWhat is the minimum number of reads needed to reconstruct with a given reliability?

A benchmark for comparing different algorithms.An algorithm-independent basis for comparing different technologies and designing new ones.

Slide10

Coverage AnalysisPioneered by Lander-Waterman

What is the minimum number of reads to ensure there is no gap between the reads with a desired prob.?Only provides a lower bound.Can one get a tight lower bound?

Slide11

Communication and Sequencing: An Analogy

Communication:Sequencing:

Slide12

Communication: Fundamental Limits

Shannon 48

Asymptotically reliable communication at rate R source symbols per channel output symbol if and only if: Given statistical models for source and channel:

Slide13

DNA Sequencing: Fundamental Limits?

Define: sequencing rate R = G/N basepairs per readQuestion: can one define a sequencing capacity C such that: asymptotically reliable reconstruction is possible if and only if R < C?

Slide14

A Simple ModelDNA sequence: i.i.d

. with distribution p.Starting positions of reads are i.i.d. uniform on the DNA sequence.Read process is noiseless. Will extend to more complex source model and noisy read process later.

Slide15

The read channel

Capacity depends on read length: LDNA length: GNormalized read length:Eg. L = 100, G = 3 £ 109 :

read channelAGCTTATAGGTCCGCATTACCAGGTCC

Slide16

Result: Sequencing Capacity

Renyi entropy of order 2

Slide17

Coverage Constraint

TL

Starting positions of reads ~ Poisson(1/R)

G

N reads

Slide18

No-Duplication Constraint

LLLL

The two possibilities have the same set of length L subsequences.

Slide19

Achievability

c

overage constraint

no-duplication constraint

achievable?

Slide20

Greedy Algorithm

Input: the set of N reads of length L Set the initial set of contigs as the reads.Find two contigs with largest overlap and merge them into a new contig.Repeat step 2 until only one contig remains or no more merging can be done.Algorithm progresses in stages: at stage merge reads at overlap

Slide21

Greedy algorithm: the beginning

gap

M

ost reads have large overlap with neighbors

Expected # of errors in stage

L-1:

probability two disjoint reads are equal

Very small since no-duplication constraint is satisfied.

Slide22

Greedy algorithm: stage

probability two disjoint reads appear to overlap

Expected # of errors at stage

This may get larger, but no larger than when

Very small since coverage constraint is satisfied.

Slide23

Summary: Two Regimes

coverage-limitedregimeduplication-limitedregime

Slide24

Relation to Earlier WorksCoverage constraint: Lander-Waterman 88

No-duplication constraint: Arratia et al 96Arratia et al focused on a model where all length L subsequences are given (seq. by hybridization)Our result: the two constraints together are necessary and sufficient for shotgun sequencing.

Slide25

Rest of Talk

Impact of read noise.Impact of repeats in DNA sequence

Slide26

Read Noise

Model: discrete memoryless channel defined by transition probabilitiesACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT

TATACTTA

Slide27

Modified Greedy algorithm

Y

X

This is a hypothesis testing problem!

We observe two strings:

X

and

Y

.

Are they noisy versions of the same DNA subsequence?

Or from two different locations?

Y

Do we merge the two reads at overlap ?

(merge)

(do not merge)

Slide28

Impact on Sequencing Rate

MAP rule: declare H0 ifHypothesis test:H0: noisy versions of the same DNA subsequence (merge)H1:

from disjoint DNA subsequences (do not merge)Y

X

Y

Two types of error:

missed detection (new type of error)

false positive (same as before)

Slide29

c

overage constraintno-duplication constraint

obtained by optimizing MAP threshold

Impact on Sequencing Rate

Slide30

More Complex DNA Statisticsi.i.d

. is not a very good model for the DNA sequence.More generally, we may want to model it as a correlated random process. For short-scale correlation, H2(p) can be replaced by the Renyi entropy rate of the process.But for higher mammals, DNA contains long repeats, repeat length comparable or longer than reads.This is handled by paired-end reads in practice.

Slide31

A Simple Model for Repeats

K

Model: M repeats of length K placed uniformly into DNA sequence

If repeat length K>> read length L, how to reconstruct sequence?

Use paired-end reads:

J

reads come in pairs

with known separation

These reads can

bridge

the repeats

Slide32

c

overage constraintno-duplication constraint

Impact on Sequencing Rate

c

overage

of repeats

constraint

If J > 2d + K

then capacity

is the same as

without repeats

constant

indep

of K

K= repeat length

J = paired-end

separation

Slide33

Conclusion DNA sequencing is an important problem.

Many new technologies and new applications.An analogy between sequencing and communication is drawn.A notion of sequencing capacity is formulated.A principled design framework?