/
Bioinformatics Assignment #1 Bioinformatics Assignment #1

Bioinformatics Assignment #1 - PowerPoint Presentation

reagan
reagan . @reagan
Follow
66 views
Uploaded On 2023-07-22

Bioinformatics Assignment #1 - PPT Presentation

Instructions Purpose Gene annotation of a prokaryote Annotating a sequence means identifying the gene the ORF the reading frame and the probable function of the gene if it is previously unknown ID: 1010481

gene sequence frame blast sequence gene blast frame reading protein sequences cdna start stop line results ncbi orf translate

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Bioinformatics Assignment #1" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Bioinformatics Assignment #1Instructions

2. PurposeGene annotation of a prokaryoteAnnotating a sequence means identifying the gene, the ORF, the reading frame and the probable function of the gene if it is previously unknownWe have a full chromosome sequence of a pathogenic spirochete bacteriumI’ve divided the genomic sequence into segmentsEach pair of students will use some bioinformatic tools to determine:Which organism it isWhat gene it isThe ORF The reading frame

3. The organismThere are many types of spiral bacteria which are diverse in their environmental contexts and the pathogenicity of some of them.They have a double membrane and are motile, moving through the motion of flagella that around found between the inner and outer membranes One disease caused by a spiral bacteria (spirochete) is syphilisA friend of mine works on the organism we are going to annotate – so I was able to get the genomic sequence from herThis strain has a main chromosome about 950,000 base pairs long, and up to 11 plasmidsThe main chromosome and at least some of the plasmids are linear, very unusual in prokaryotesThis is the bacteria that causes syphilis magnified 300XIt is from wikipedia

4. The tools you will useBLAST: Basic Local Alignment Sequence ToolThis website compares your sequence with reference sequences in the database, identifies regions of similarity and returns a list of positive matchesYou will use the gene sequence I’ve provided and you will paste it into the search windowThe program will then look for other sequences in the data base with the same sequenceIt will return possible matches to you with a score that indicates how good the match isExpasy Translate toolThis website includes a tool that will allow you to translate your DNA sequence in all possible reading framesGenes can be transcribed in either direction on a chromosome so the strand I’ve provided could be the template or the coding strandYou will find the open reading frame in your sequence based on the translation results

5. To begin:Keep your sequence in a word fileCopy itGo to BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi Choose nucleotide blast: blastnPaste your sequence into the search windowLeave all the default settingsClick on blast, (may have to scroll down to it), blue button left sideIt may return results fast or slow- depending on how busy the server is

6. Leave the default settingsClick BLAST

7. Results:Look at the score, and the e-value as well as percent identity

8. Click on graphic summary to see this:It is colour coded- red is a perfect match, pink has some mismatches- there are other colours that show up when the match is worseEach line Is a sequence- you can click on the line to see how the sequences line upNotice the gaps in some alignments - these are cDNA sequences and I searched with a genomic sequenceIgnoring the 5’ and 3’ ends because some cDNA sequences are not complete, I can see splice variants in the cDNAs

9.

10. BLAST alignment scores The BLAST scoring system use matrices to identify and give points for matches but also to identify mismatches and gaps and give penalties for those. The sum provides a value of sequence similarity10+-Larger, depending on size

11. Some detail about BLAST ScoresResults are arranged with the best ones on topThe most important score is the Expect value, or E-value, defined the number of hits any random sequence (with the same length as yours) would have in the database.E-values for good hits are usually written something like: 3e-42, which is the same as 3 x 10-42 , a very small numberBad hits are very common, and they have e-values in a more familiar form: for example, 0. 4 or 1.2A really good e-value is less than 1e-180, which is below the computer’s processing capabilities, so it written as 0.0E-values are affected by the length of the query sequence as well as the size of the database, so even perfect matches with short sequences give poor e-values11

12. Some detail about BLAST ScoresIn this case we see many hits with good e-values, and the top e-values all are quite similar. Percent identity and lengths of query (the sequence you sent) and subject (the sequence it matches)the lengths of the query and subject sequences should be within 20% of each other- the results are most interpretable when the sequences are similar in lengthThere should be at a high degree of sequence identity for the genomic sequence from the same organism – should be a perfect matchIf we have genomic sequence and the match is a cDNA sequence the identity will still be 100% but the percent coverage will be lower because of the introns:12

13. Choose the best sequence (usually the top one) and click on the descriptionIt takes you to “Alignments” viewThe top line shows the description, second line is the sequence ID and additional information, the 3rd line shows you range 1, the nucleotide range and the link to “GenBank”Click on “GenBank”

14.

15. Choose the best protein product to work onScroll down, and you will see under “Features,” that there is “source,” “gene,” and “CDS”Under “CDS,” you will see “/product” and the name of the protein product“/locus_tag” gives you the gene nameStart with a protein that is not “hypothetical” or “putative”If you have more than one confirmed protein product, just pick oneRight click on the link beside “/protein_id” to bring up optionsOpen link in new tab

16.

17. Stay on same page for nowWhile your new tab is loading, scroll back up on the same pageClick on FASTA

18. This is your gDNA sequence (properly formatted)Once the page loads, copy the whole thing from “>” into your reportHighlight your sequence and set it to “courier” fontTIP: USING COURIER FONT FOR DNA SEQUENCE - all the letters are the same size. Useful for alignment and visualizing the sequence.

19. Go to your newly opened tabThis is where you access the amino acid sequence of the particular protein product you have chosenAgain, at the top of the page, click on FASTA

20. This is your amino acid sequence (properly formatted)Once the page loads, copy the whole thing from “>” into your reportHighlight you sequence and set it to “courier” font

21. Translate tool:Go to translate ExPASy: https://web.expasy.org/translate/ Copy/paste the cDNA/gDNA sequence into the search windowUse the defaults and hit translate (green button bottom right)The results come up below the window, scroll downAll six reading frames are shown, and the open reading frames (ORFs) are in red

22. The program finds all the ATGs in each frame and makes them RED and BOLD. All the stop codons in the frame are dashes. The program shades everything between M and – in red so you can see all the ORFs. This one is clearly not the correct reading frame.

23. This is the correct one as you can tell by comparing with the published protein sequence from the NCBI site. This is usually the longest ORF – but not always. If you cannot match any of the ORFs with the amino acid sequence you have copied from the NCBI site, repeat the instructions from slide 16 and choose a different protein ID. Then move onto slides 19, 20 and repeat the process of searching for the AA sequence of an ExPASy ORF (start with the longest) in the NCBI AA sequence.

24. Special notes: Please let your instructor know if:you end up matching an ORF (from ExPASy) with the AA sequence of a hypothetical or putative protein (NCBI)your ORF does not end with a stop codon, ie. it ends abruptly at the end of an AA sequence without a “-”your protein is disconnected and is a combination of two ORFsyou notice anything else unusualThe ORF from ExPASy may start midway in the AA sequence obtained from NCBI: you can find the first 5 AA or so (starting with “M”) within the NCBI AA sequence, but not from the start. In this case, go with the ExPASy sequence for locating your start codon (next slide).

25. To identify start and stop codons in the cDNA/gDNA sequence:Once you have determined which frame is the appropriate one (forward or reverse / frame 1, 2 or 3)Repeat the translation, but don’t use the default settingsInstead choose “includes nucleotide sequence”Choose forward if your reading frame was 5’ to 3’ and reverse if it was 3’ to 5’TranslateThen copy/paste the entire correct translated reading frame into your documentGo back to your cDNA sequence and bold the start and stop codonsNote that the amino acid “M” goes underneath the corresponding codon sequence, though the top line of triplets is not highlighted in redUse the nucleotides around the start and stop codons in the translated sequence to be sure you highlight the correct ATG and stop codons

26.

27. FinallyFind out what disease this pathogen causesFind out something about the gene - what does the gene product do? In your own wordsSpeculate about whether this gene might have anything to do with the disease It probably won’t - they’ve been chosen somewhat at random from the first 100kb or so of the main chromosome and not all the genes associated with its pathogenicity are knownThis is just a thinking exercise for you

28. What goes into your report.The sequence you searched with and the BLAST results below itPaste in just the top 5 results (screenshot works best)The name of the organism, the disease it causes, and the gene that matches perfectly to your sequenceThe amino acid and cDNA sequences of your gene – the cDNA should have the start and stop codons boldYour translated sequence showing the correct reading frame A short description of what the gene probably does in regular English in your own words. NOT copied from a website.A speculation on whether a gene like this could have anything to do with causing the disease