/
Lesson:  Sequence processing Lesson:  Sequence processing

Lesson: Sequence processing - PowerPoint Presentation

ella
ella . @ella
Follow
352 views
Uploaded On 2022-06-07

Lesson: Sequence processing - PPT Presentation

Goals Introduce DNA Assembly and Alignment Practice rebuilding full sequences from reads Sequencing by Synthesis Review Modified PCR builds sequence over multiple cycles Each strand of DNA is amplified into a cluster of identical DNA before sequencing ID: 914882

sequence read regions sequencing read sequence sequencing regions long reference alignment sequences short assembly number text variation reads information

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lesson: Sequence processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Lesson: Sequence processing

Goals:

Introduce DNA Assembly and Alignment

Practice

rebuilding full sequences from reads

Slide2

Sequencing by Synthesis Review

Modified PCR “builds” sequence over multiple cycles

Each strand of DNA is amplified into a cluster of identical DNA before sequencing

Slide3

Sequencing by Synthesis Review

Multiple clusters are sequenced at once

Clusters can be:

From different samples OR from the same sampleShort regions OR long regions that have been broken into shorter piecesUnique tags (indices) identifythe source of each clusterThe sequence from each cluster

is referred to as a “read”

Slide4

Before analysis can begin:

Sequence information needs to be stored

FASTA files

store sequence information in a text formatLong regions that were broken up for sequencing need to be rebuiltAssembly rebuilds long regions using overlapping sequencesAlignment rebuilds long regions by matching reads to a reference

“References” are the results from the previous times a genome or region was sequenced

. This can also be called the “consensus” sequence since it is the agreed upon complete version of the sequence.

Slide5

Storing Sequencing Information

FASTA files

Used for nucleotide (DNA, RNA) or peptide (protein) sequences.

Contains a header row, marked by “>” with sample informationand then a new row with sequence information.One FASTA file can contain multiple sequences.Can be opened with any text editor

Slide6

Rebuilding Long Sequences: 1

Assembly

Sequencing works best with short regions, so long regions of DNA are randomly fragmented before sequencing

Overlaps in the regions are used to reconstruct the full sequence

Slide7

Assembly Details

DNA is amplified before fragmentation. Lots of copies being randomly fragmented means a lot of overlap.

The more short fragments which overlap with one another allow more certainty that the long region has been correctly assembled.

Read 1: ATCCGCATTGAC

Read 2: TGACCTAGCGCA

Read 3:

GCAATACGTGACRead 2: TGACCTAGCGCA

OR

Read 4:

CATTGACCTAG

?

Slide8

Practice Assembly

Sequence Processing OR Read Assembly Activity

All groups get only the reads

Think about the following:How many “reads” were necessary to cover the entire “genome”?How sure are you of the final sequence?Are there any regions of ambiguity?What information would you want to help resolve that ambiguity?

Slide9

Rebuilding Long Sequences: 2

Alignment

Long regions are randomly fragmented into shorter regions for sequencing

Short regions are lined up against previous sequencing results to reconstruct the full sequence

Slide10

Alignment Details

Points of variation between the read and reference are noted and stored in a “Variant Call File” (VCF)

The more short fragments which include a variation, the more certain we can be that variation isn’t just a sequencing error.

Reads can vary from a reference in different waysChanges in a nucleotideInsertions Deletions

Reference: ATCCCGGA-TCGTTA

||| |||| ||| ||

Read: ATC-CGGAATCGATA

 The | indicates

a perfect match

Slide11

Storing Variation Information

Variant Call File (VCF)

Indicates differences compared to a reference.

Contains header rows, marked by “##”, and a table of variantsCan be opened in text or spreadsheet editors

Slide12

Practice Aligning

Sequence Processing OR Read Assembly Activity

All groups get

reads and a reference copy of the original textFor more practice with alignment: Aligning Short Texts ActivityThink about the following:How are you deciding on the “best” alignment?What benefit is there to having multiple “reads” for each text?

Multiple Alignment:When more than two sequences are being aligned

Slide13

Evaluating Alignments

Goal: maximize overlap between sequences

Scoring

Way of quantifying overlap so different alignments can be comparedDifferent scoring systems exist, but a simple one would beMatches: +1Mismatches: -1Gaps: -2To use this system:

Score = (number of matches) – (number of mismatches) – 2*(number of gaps)

Slide14

Comparing Alignments

Score = (number of matches) – (number of mismatches) – 2*(number of gaps)

Alignment 1

Alignment 2Alignment 3

Reference: GTCGAATGAAACGATTAA

|||| | | |||||||

Read: TCGATTTA-ACGATTA

Reference: GTCGAATGAAACGATTAA

|||| | || |Read: TCGATTTAACGATTA

Reference: GTCGAATGAAACGATTAA

|| ||||||||

Read:

TCGATTTAACGATTA

Slide15

Coverage

The number of times each nucleotide is “seen” during sequencing

Higher coverage makes it easier to distinguish errors from true sequence variations

What is being sequencedhelps determine how common

a variation has to be beforeit’s considered a “real” variation

Read 1: ATCCGCATTGAC

Read 2: CGCCTTGACCTAGRead 3: CCGCCCTGACCTAGLow Coverage vs High Coverage

Read 1: ATCCGCATTGACRead 2: CGCCTTGACCTAG

Read 3: CCGCCCTGACCTAGRead 4: TCCGCATTGACCT

Read 5: CGCATTGACCTAGCG

Read 6: CGCATTGACCTA

Read 7: ATCCGCATTGACC

Read 8: TCCGCATTGAC

Read 9:

GCATTGACCTACCGC

Read 10:

ATTCCGCATTG

Slide16

Types of Sequencing Analysis

De Novo

Sequencing

Used the first time a gene or genome is ever sequencedUses assembly to stitch short regions into a longer wholeResequencingUsed subsequent times a genome is sequencedUses alignment to identify short sequences using

a reference

16

Slide17

Compare methods

Sequence Processing OR Read Assembly Activity

Use a different text, provide half the groups a “Reference” sheet

Think about the following:How long are the “reads”?How long is the “genome”?How easy was this task with vs without a “reference” text?How fast was this task with vs without a “reference” text?

How long are sequencing reads?How long are genomes?

How easy/fast would using real sequencing data be?

Slide18

Role of computers in analysis

Computers can:

Automate tasks

Work faster than humansProcess long sequences just as easily as short sequencesBioinformatics: use of computers for analyzing complex biological data.Lots of bioinformatics tools exist for you to

use in analyzing your sequence

18