/
Lectures on Informatics: Lectures on Informatics:

Lectures on Informatics: - PowerPoint Presentation

norah
norah . @norah
Follow
27 views
Uploaded On 2024-02-09

Lectures on Informatics: - PPT Presentation

An Introduction to Computers and Informatics in the Health Sciences Sequencing Intro Common File Formats and Visualization Software Christopher Taylor Associate Professor Department of Microbiology ID: 1045554

sequence sequencing genome score sequencing sequence score genome line fasta reads quality depth file long dna novo format base

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lectures on Informatics:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Lectures on Informatics:An Introduction to Computers and Informatics in the Health SciencesSequencing Intro:Common File Formats and Visualization SoftwareChristopher TaylorAssociate ProfessorDepartment of Microbiology,Immunology & Parasitology

2. DNA SequencingThe process of determining the order of nucleotides (A, G, C, T) within a DNA moleculeApplications in:Gene sequencingForensicsPersonalized medicineMetagenomicsEvolutionary Biologyetc2

3. Applications of SequencingAnalysis of sequencing data is just one area within BioinformaticsThere are also many possible applications of sequencing technologyWe will touch on some of the broad areas that sequencing is being appliedThen we will narrow focus again to delve into a particular application of sequencing3

4. de novo SequencingThe goal of de novo sequencing is to determine the genetic sequence of an organism that we do not already have sequencedThe Human Genome Project: > 3 Billion bp (1990-2003)DNA sequencers only produce short readsEach read may be hundreds of bp longGenome Assembly is required4

5. Genome AssemblySeveral copies of the genome are fragmented into pieces (shotgun approach)These individual pieces are sequenced, producing the short readsSince several copies are involved each portion of the genome appears in several overlapping readsGenome assembly is the process of stitching these pieces back together to infer the original genomic sequence that produced the reads5

6. de novo Sequencing RequirementsLonger reads are better and make it easier to determine how the reads fit togetherIn the limit, if reads are too short, assembly becomes impossibleAreas of the genome containing repetitive sequences pose a difficult problem and lead to potential misassemblyAgain, long reads help to alleviate this problemSanger sequencing, with its long reads, was well suited to this problemHowever, the low throughput and high cost was not6

7. Resequencing ApplicationsOnce the genome of an organism is known, sequencing of an individual can be compared to the referenceIn this case, we may be looking for genetic variationOr determining what parts of genomic sequence are present under certain conditionsRNA Sequencing: Transcripts mapped back to genomeChIP-Seq: Identify binding sites of DNA associated proteinsResequencing applications can utilize much shorter reads (typically depth is a larger need)7

8. Sequencing BreadthIn any application of sequencing, depth and breadth are important considerationsIn de novo sequencing, you want your template to cover the entire genomeOr you will not have sequence info for some areasIn transcriptomics, the breadth of your sequencing will be restricted to the expressed portions of the genome in questionIn amplicon sequencing, your primers will determine the breadth of coverage8

9. Sequencing DepthThe depth of sequencing is how many reads on average cover each base pair of your targetPresuming uniform coverage and a 3 billion bp target genome, 600 GB of sequence on ten samples would provide a depth of 20xThis would be for whole human genome resequencingThe human Exome is roughly 33 MBWhole human exome sequencing would provide 30x coverage with 1GB of sequenceLack of sufficient sequencing depth for a given application leads to incomplete resultsOversequencing (beyond necessary depth) raises cost9

10. The FASTA File FormatHas become a de facto standard for storing nucleotide or peptide sequencesTextual format with letters (e.g. A,C,T,G) representing individual basesHighly compressibleEach sequence begins with a single line descriptor starting with a > symbol“Name” of the sequence follows > symbolThere must be no space between > and “Name”Anything following a space is not part of the nameConsidered simply commentary or description of sequence10

11. The FASTA File Format (cont)The line starting with >Name is the header lineThis line can be as long as desired, return ends itThe following line begins the nucleotide or peptide sequenceThe sequence can all appear on one long line, or can be broken up by newlinesThe sequence is considered to continue until the end of file is reached or the next line that begins with a > character indicating a new sequence header11

12. FASTA Example with Newlines>Seq1 A very interesting sequenceACGCTAGGCTTATTGCCCTAGCCAAAACCGATCCCGTTATTCGGATCGCA>Seq2 A very boring sequenceAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA…12

13. FASTA Example on Single Lines>ABRIDINACGGTACGGACCACGCAGTCCCTGATGACTGAG…>AGRKGRLGAACTCTCCAGTACGTTCAACAGAGACAGATAC…>ACACAGACGATAGCATCGATCATGCGTCTGACG…>CTTAGATTACCAGATCGGAGGGTAGCTAGTCAG…13

14. IUPAC Codes for DNAThe dash “-” symbol is used to represent a gap in an aligned sequence file14

15. FASTA Example with IUPAC and ->EXo123ACGGTACW-GACC-ACGCANTYTGATGRCTGAG…>EXo124NAACTCVCCAGTACDTTCAACWWAGACWGATAC…>EXo125KCACAGM-CGAAGCATYGATCATMCGTCTGACG…>EXo126CATGGATGATTACATAWACTANMWCGTCCCATA…15

16. The FASTQ File FormatThis format is widely used to augment the FASTA format with a quality score for each base callTypically for sequencing data coming off a sequencerLike FASTA each sequence begins with a single header line (@ instead of >), followed by the sequenceInstead of a new header line starting with @ immediately following the sequence, instead the quality line is indicated with a different symbol +The 4th part of each record is a string of characters that each represent the quality score of the corresponding base in the 2nd part of the recordThere must be exactly the same number of quality score characters as there were base call characters in the 2nd part of the record16

17. FASTQ Example on Single Lines@M02095:4:0-A7UYU:1:1101:20033:1065NTTTCCTACTATTACCGCGGCTGCTGGCACGT…+#8ACCFFFGGGGGGGGDGGGGGGGGGGGFGGG…@M02095:4:0-A7UYU:1:1101:20408:1066NTTAAACGTGCGATTACCGCGGCTGCTGGCAC…+#8ACCGGDFGGGGGGGGGGGGGGGCFGFGGGG…17

18. Illumina Quality Score Encoding Table18

19. Interpreting Illumina Quality ScoresThe formula used to interpret is as follows:Q-Score = -10 x log10(Perror)We want to know what the Perror isSolving…Q-Score / -10 = log10(Perror)10(Q-Score / -10) = PerrorPerror = 10-(Q-Score / 10) For example, the symbol ? represents Q-Score 30Perror = 10-(30/ 10) = 10-3 = 1 / 1000 = 0.00119

20. Table of Q-Score and PerrorSymbolQ-ScorePerrorSymbolQ-ScorePerrorSymbolQ-ScorePerror!01.00000/140.03981=280.00158"10.794330150.03162>290.00126#20.630961160.02512?300.00100$30.501192170.01995@310.00079%40.398113180.01585A320.00063&50.316234190.01259B330.00050'60.251195200.01000C340.00040(70.199536210.00794D350.00032)80.158497220.00631E360.00025*90.125898230.00501F370.00020+100.100009240.00398G380.00016,110.07943:250.00316H390.00013-120.06310;260.00251I400.00010.130.05012<270.00200   20

21. FASTQ Example with Newlines@M02095:4:000000000-A7UYU:1:1101:20033:1065 2:N:0:172NTTTCCTACTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGACTTTCTGGTTGATTACCGTCAAATAAAGGCCAGTTACTACCTCTATCCTTCTTCACCAACAACAGAGCTTTACGATCCGAAACCCTTCTTCACTCACGCGGCGTTGCTCCATCAGACTTGCGTCCATTGTGGAAGATTCCCTACTGCTGCCTCCCGTAGGAGTTTGGGCCGTTTCTCAGTCCCAATGTGGCCGATCAGTCTCTCAACCCGGCTATGCATCATCCCCTTGGTAGGCCGTTACCCTTCCAACTAGCCAA+#8ACCFFFGGGGGGGGDGGGGGGGGGGGFGGGGGGGGGGGGGDEFGGGGGGFGGGFFGGCGGGGGGFGGFFFGGGGFGFF9FGGF9FGGGFGGGGGGGGEGEGGGGGGGGGGGGGGGGGGCFFGGGGGCFGGGFGEFFGFEFFGGGGGDCFFFFGGGGGFF7FDGGEEFGGDCCAEG@>8EEEGFFGGGGGFFGFG8FEGFEGGGGGGF?C=CFGE*?@C;99FG6AC<6+=CCCFC*)/*>GC7C<FFE<*/9*9C*22*+2?F4D*85*<5?9*65204><@FFDF)*2:>659**.*821