David Tse Dept of EECS UC Berkeley LIDS Student Conference MIT Feb 2 2012 Research supported by NSF Center for Science of Information TexPoint fonts used in EMF A A A A A A ID: 916509
Download Presentation The PPT/PDF document "Information Theory of DNA Sequencing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Information Theory of DNA Sequencing
David Tse Dept. of EECSU.C. BerkeleyLIDS Student ConferenceMITFeb. 2, 2012Research supported by NSF Center for Science of Information.
TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA
Guy
Bresler
Abolfazl
Motahari
Slide2DNA sequencing
DNA: the blueprint of lifeProblem: to obtain the sequence of nucleotides.…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAG
ACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…courtesy: Batzoglou
Slide3Impetus: Human Genome Project
1990: Start2001: Draft
2003: Finished3 billion basepairs
c
ourtesy: Batzoglou
Slide4Sequencing Gets Cheaper and Faster
Cost of one human genome HGP: $ 3 billion2004: $30,000,0002008: $100,0002010: $10,0002011: $4,000 2012-13: $1,000???: $300
courtesy: Batzoglou
Slide5But many genomes to sequence
100 million species(e.g. phylogeny)7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations such as HIV, cancer)courtesy: Batzoglou
Slide6Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
Slide7Sequencing Technologies
HGP era: single technology (Sanger)Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)All provide massively parallel sequencing.Each technology has different read lengths, noise profiles, etc
Slide8Assembly AlgorithmsMany proposed algorithms.
Different algorithms tailored to different technologies.Each algorithm deals with the full complexity of the problem while trying to scale well with the massive amount of data.Lots of heuristics used in the design.
Slide9A Basic QuestionWhat is the minimum number of reads needed to reconstruct with a given reliability?
A benchmark for comparing different algorithms.An algorithm-independent basis for comparing different technologies and designing new ones.
Slide10Coverage AnalysisPioneered by Lander-Waterman
What is the minimum number of reads to ensure there is no gap between the reads with a desired prob.?Only provides a lower bound.Can one get a tight lower bound?
Slide11Communication and Sequencing: An Analogy
Communication:Sequencing:
Slide12Communication: Fundamental Limits
Shannon 48
Asymptotically reliable communication at rate R source symbols per channel output symbol if and only if: Given statistical models for source and channel:
Slide13DNA Sequencing: Fundamental Limits?
Define: sequencing rate R = G/N basepairs per readQuestion: can one define a sequencing capacity C such that: asymptotically reliable reconstruction is possible if and only if R < C?
Slide14A Simple ModelDNA sequence: i.i.d
. with distribution p.Starting positions of reads are i.i.d. uniform on the DNA sequence.Read process is noiseless. Will extend to more complex source model and noisy read process later.
Slide15The read channel
Capacity depends on read length: LDNA length: GNormalized read length:Eg. L = 100, G = 3 £ 109 :
read channelAGCTTATAGGTCCGCATTACCAGGTCC
Slide16Result: Sequencing Capacity
Renyi entropy of order 2
Slide17Coverage Constraint
TL
Starting positions of reads ~ Poisson(1/R)
G
N reads
Slide18No-Duplication Constraint
LLLL
The two possibilities have the same set of length L subsequences.
Slide19Achievability
c
overage constraint
no-duplication constraint
achievable?
Slide20Greedy Algorithm
Input: the set of N reads of length L Set the initial set of contigs as the reads.Find two contigs with largest overlap and merge them into a new contig.Repeat step 2 until only one contig remains or no more merging can be done.Algorithm progresses in stages: at stage merge reads at overlap
Slide21Greedy algorithm: the beginning
gap
M
ost reads have large overlap with neighbors
Expected # of errors in stage
L-1:
probability two disjoint reads are equal
Very small since no-duplication constraint is satisfied.
Slide22Greedy algorithm: stage
probability two disjoint reads appear to overlap
Expected # of errors at stage
This may get larger, but no larger than when
Very small since coverage constraint is satisfied.
Slide23Summary: Two Regimes
coverage-limitedregimeduplication-limitedregime
Slide24Relation to Earlier WorksCoverage constraint: Lander-Waterman 88
No-duplication constraint: Arratia et al 96Arratia et al focused on a model where all length L subsequences are given (seq. by hybridization)Our result: the two constraints together are necessary and sufficient for shotgun sequencing.
Slide25Rest of Talk
Impact of read noise.Impact of repeats in DNA sequence
Slide26Read Noise
Model: discrete memoryless channel defined by transition probabilitiesACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT
TATACTTA
Slide27Modified Greedy algorithm
Y
X
This is a hypothesis testing problem!
We observe two strings:
X
and
Y
.
Are they noisy versions of the same DNA subsequence?
Or from two different locations?
Y
Do we merge the two reads at overlap ?
(merge)
(do not merge)
Slide28Impact on Sequencing Rate
MAP rule: declare H0 ifHypothesis test:H0: noisy versions of the same DNA subsequence (merge)H1:
from disjoint DNA subsequences (do not merge)Y
X
Y
Two types of error:
missed detection (new type of error)
false positive (same as before)
Slide29c
overage constraintno-duplication constraint
obtained by optimizing MAP threshold
Impact on Sequencing Rate
Slide30More Complex DNA Statisticsi.i.d
. is not a very good model for the DNA sequence.More generally, we may want to model it as a correlated random process. For short-scale correlation, H2(p) can be replaced by the Renyi entropy rate of the process.But for higher mammals, DNA contains long repeats, repeat length comparable or longer than reads.This is handled by paired-end reads in practice.
Slide31A Simple Model for Repeats
K
Model: M repeats of length K placed uniformly into DNA sequence
If repeat length K>> read length L, how to reconstruct sequence?
Use paired-end reads:
J
reads come in pairs
with known separation
These reads can
bridge
the repeats
Slide32c
overage constraintno-duplication constraint
Impact on Sequencing Rate
c
overage
of repeats
constraint
If J > 2d + K
then capacity
is the same as
without repeats
constant
indep
of K
K= repeat length
J = paired-end
separation
Slide33Conclusion DNA sequencing is an important problem.
Many new technologies and new applications.An analogy between sequencing and communication is drawn.A notion of sequencing capacity is formulated.A principled design framework?