Goals Introduce DNA Assembly and Alignment Practice rebuilding full sequences from reads Sequencing by Synthesis Review Modified PCR builds sequence over multiple cycles Each strand of DNA is amplified into a cluster of identical DNA before sequencing ID: 914882
Download Presentation The PPT/PDF document "Lesson: Sequence processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Lesson: Sequence processing
Goals:
Introduce DNA Assembly and Alignment
Practice
rebuilding full sequences from reads
Slide2Sequencing by Synthesis Review
Modified PCR “builds” sequence over multiple cycles
Each strand of DNA is amplified into a cluster of identical DNA before sequencing
Slide3Sequencing by Synthesis Review
Multiple clusters are sequenced at once
Clusters can be:
From different samples OR from the same sampleShort regions OR long regions that have been broken into shorter piecesUnique tags (indices) identifythe source of each clusterThe sequence from each cluster
is referred to as a “read”
Slide4Before analysis can begin:
Sequence information needs to be stored
FASTA files
store sequence information in a text formatLong regions that were broken up for sequencing need to be rebuiltAssembly rebuilds long regions using overlapping sequencesAlignment rebuilds long regions by matching reads to a reference
“References” are the results from the previous times a genome or region was sequenced
. This can also be called the “consensus” sequence since it is the agreed upon complete version of the sequence.
Slide5Storing Sequencing Information
FASTA files
Used for nucleotide (DNA, RNA) or peptide (protein) sequences.
Contains a header row, marked by “>” with sample informationand then a new row with sequence information.One FASTA file can contain multiple sequences.Can be opened with any text editor
Slide6Rebuilding Long Sequences: 1
Assembly
Sequencing works best with short regions, so long regions of DNA are randomly fragmented before sequencing
Overlaps in the regions are used to reconstruct the full sequence
Slide7Assembly Details
DNA is amplified before fragmentation. Lots of copies being randomly fragmented means a lot of overlap.
The more short fragments which overlap with one another allow more certainty that the long region has been correctly assembled.
Read 1: ATCCGCATTGAC
Read 2: TGACCTAGCGCA
Read 3:
GCAATACGTGACRead 2: TGACCTAGCGCA
OR
Read 4:
CATTGACCTAG
?
Slide8Practice Assembly
Sequence Processing OR Read Assembly Activity
All groups get only the reads
Think about the following:How many “reads” were necessary to cover the entire “genome”?How sure are you of the final sequence?Are there any regions of ambiguity?What information would you want to help resolve that ambiguity?
Slide9Rebuilding Long Sequences: 2
Alignment
Long regions are randomly fragmented into shorter regions for sequencing
Short regions are lined up against previous sequencing results to reconstruct the full sequence
Slide10Alignment Details
Points of variation between the read and reference are noted and stored in a “Variant Call File” (VCF)
The more short fragments which include a variation, the more certain we can be that variation isn’t just a sequencing error.
Reads can vary from a reference in different waysChanges in a nucleotideInsertions Deletions
Reference: ATCCCGGA-TCGTTA
||| |||| ||| ||
Read: ATC-CGGAATCGATA
The | indicates
a perfect match
Slide11Storing Variation Information
Variant Call File (VCF)
Indicates differences compared to a reference.
Contains header rows, marked by “##”, and a table of variantsCan be opened in text or spreadsheet editors
Slide12Practice Aligning
Sequence Processing OR Read Assembly Activity
All groups get
reads and a reference copy of the original textFor more practice with alignment: Aligning Short Texts ActivityThink about the following:How are you deciding on the “best” alignment?What benefit is there to having multiple “reads” for each text?
Multiple Alignment:When more than two sequences are being aligned
Slide13Evaluating Alignments
Goal: maximize overlap between sequences
Scoring
Way of quantifying overlap so different alignments can be comparedDifferent scoring systems exist, but a simple one would beMatches: +1Mismatches: -1Gaps: -2To use this system:
Score = (number of matches) – (number of mismatches) – 2*(number of gaps)
Slide14Comparing Alignments
Score = (number of matches) – (number of mismatches) – 2*(number of gaps)
Alignment 1
Alignment 2Alignment 3
Reference: GTCGAATGAAACGATTAA
|||| | | |||||||
Read: TCGATTTA-ACGATTA
Reference: GTCGAATGAAACGATTAA
|||| | || |Read: TCGATTTAACGATTA
Reference: GTCGAATGAAACGATTAA
|| ||||||||
Read:
TCGATTTAACGATTA
Slide15Coverage
The number of times each nucleotide is “seen” during sequencing
Higher coverage makes it easier to distinguish errors from true sequence variations
What is being sequencedhelps determine how common
a variation has to be beforeit’s considered a “real” variation
Read 1: ATCCGCATTGAC
Read 2: CGCCTTGACCTAGRead 3: CCGCCCTGACCTAGLow Coverage vs High Coverage
Read 1: ATCCGCATTGACRead 2: CGCCTTGACCTAG
Read 3: CCGCCCTGACCTAGRead 4: TCCGCATTGACCT
Read 5: CGCATTGACCTAGCG
Read 6: CGCATTGACCTA
Read 7: ATCCGCATTGACC
Read 8: TCCGCATTGAC
Read 9:
GCATTGACCTACCGC
Read 10:
ATTCCGCATTG
Slide16Types of Sequencing Analysis
De Novo
Sequencing
Used the first time a gene or genome is ever sequencedUses assembly to stitch short regions into a longer wholeResequencingUsed subsequent times a genome is sequencedUses alignment to identify short sequences using
a reference
16
Slide17Compare methods
Sequence Processing OR Read Assembly Activity
Use a different text, provide half the groups a “Reference” sheet
Think about the following:How long are the “reads”?How long is the “genome”?How easy was this task with vs without a “reference” text?How fast was this task with vs without a “reference” text?
How long are sequencing reads?How long are genomes?
How easy/fast would using real sequencing data be?
Slide18Role of computers in analysis
Computers can:
Automate tasks
Work faster than humansProcess long sequences just as easily as short sequencesBioinformatics: use of computers for analyzing complex biological data.Lots of bioinformatics tools exist for you to
use in analyzing your sequence
18