Nextgeneration sequencing technology Outline First generation sequencing Next generation sequencing current AKA Second generation sequencing Massively parallel sequencing Ultra highthroughput sequencing ID: 702164
Download Presentation The PPT/PDF document "CS 4233&5263 Bioinformatics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS 4233&5263 Bioinformatics
Next-generation sequencing technologySlide2
OutlineFirst generation sequencing
Next generation sequencing (current)
AKA:
Second generation sequencing
Massively parallel sequencing
Ultra high-throughput sequencing
Future generation sequencing
Analysis challengesSlide3
Sanger sequencing (1st generation)
DNA is fragmented
Cloned to a plasmid vector
Cyclic sequencing reaction
Separation by electrophoresis
Readout with fluorescent tags
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)Slide4
Cyclic-array methods (next-generation)DNA is fragmented
Adaptors ligated to fragments
Several possible protocols yield array of PCR colonies.
Enyzmatic extension with fluorescently tagged nucleotides.
Cyclic readout by imaging the array.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)Slide5
Available next-generation sequencing platformsIllumina/Solexa
ABI SOLiD
Roche 454
Polonator
HeliScope
…Slide6
Emulsion PCR
Rothberg and Leomon Nat Biotechnol. 2008
Shendure and Ji Nat Biotechnol. 2008
Fragments, with adaptors, are PCR amplified within a water drop in oil.
One primer is attached to the surface of a bead.
Used by 454, Polonator and SOLiD.Slide7
Stats:
(2009 data)
read lengths 200-300 bp
accuracy problem with homopolymers
400,000 reads per run
costs $60 per megabase
Rothberg and Leomon Nat Biotechnol. 2008
454 SequencingSlide8
Bridge PCRDNA fragments are flanked with adaptors.
A flat surface coated with two types of primers, corresponding to the adaptors.
Amplification proceeds in cycles, with one end of each bridge tethered to the surface.
Used by illumina/Solexa.Slide9
http://www.illumina.com/pages.ilmn?ID=203Slide10
First Round
All 4 labeled nucleotides
Primers
PolymeraseSlide11
1. Take image of first cycle
2. Remove fluorophore
3. Remove block on 3’ terminusSlide12Slide13
http://seq.molbiol.ru/
Stats:
(2009 data)
read lengths up to 36 bp
error rates 1-1.5%
several million “spots” per lane (8 lanes)
cost $2 per megabaseSlide14
Conventional sequencingCan sequence up to 1,000 bp, and per-base 'raw' accuracies as high as 99.999%. In the context of high-throughput shotgun genomic sequencing, Sanger sequencing costs on the order of $0.50 per kilobase.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)Slide15
Sequencing cost (2014)
http://www.genome.gov/sequencingcosts/Slide16
Sequence qualitiesIn most cases, the quality is poorest toward the ends, with a region of high quality in the middle
Uses of sequence qualities
‘Trimming’ of reads
Removal of low quality ends
Consensus calling in sequence assembly
Confidence metric for variant discovery
In general, newer approaches produce larger amounts of sequences that are shorter and of lower per-base quality
Next-generation sequencing has error rate around 1% or higherSlide17
Phred Quality Score
p=error probability for the base
if p=0.01 (1% chance of error), then q=20
p = 0.00001, (99.999% accuracy), q = 50
Phred quality values are rounded to the nearest integerSlide18
Main Illumina noise factors
Schematic representation of main Illumina noise factors.
(a–d) A DNA cluster comprises identical DNA templates (colored boxes) that are attached to the flow cell. Nascent strands (black boxes) and DNA polymerase (black ovals) are depicted.
(a) In the ideal situation, after several cycles the signal (green arrows) is strong, coherent and corresponds to the interrogated position.
(b) Phasing noise introduces lagging (blue arrows) and leading (red arrow) nascent strands, which transmit a mixture of signals.
(c) Fading is attributed to loss of material that reduces the signal intensity (c).
(d) Changes in the fluorophore cross-talk cause misinterpretation of the received signal (blue arrows; d). For simplicity, the noise factors are presented separately from each other.
Erlich et al. Nature Methods 5: 679-682 (2008) Slide19
Comparison of existing methods
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)Slide20
Read length and pairing
Short reads are problematic, because short sequences do not map uniquely to the genome.
Solution #1: Get longer reads.
Solution #2: Get paired reads.
ACTTAAGGCTGACTAGC
TCGTACCGATATGCTGSlide21
Third generationSingle-molecule sequencing
no DNA amplification is involved
Helicos HeliScope
Pacific Biosciences SMRT
…
Longer reads
Roche/454 > 400bpIllumina/Solexa > 100bpPacific Bioscience > 1000 bp and single moleculeSlide22
Applications of next-generation sequencing
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)Slide23
Analysis tasksBase calling
Mapping to a reference genome
De novo
or assisted genome assemblySlide24
ReferencesNext-generation DNA sequencing, Shendure and Ji, Nat Biotechnol. 2008.
Next-Generation DNA Sequencing Methods, Elaine R. Mardis, Annu. Rev. Genomics Hum. Genet. (2008) 9:387–402