/
ID   X03006; SV 1; linear; mRNA; STD; MAM; 620 BP. ID   X03006; SV 1; linear; mRNA; STD; MAM; 620 BP.

ID X03006; SV 1; linear; mRNA; STD; MAM; 620 BP. - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
417 views
Uploaded On 2016-04-04

ID X03006; SV 1; linear; mRNA; STD; MAM; 620 BP. - PPT Presentation

XX AC X03006 XX SV X030061 XX DT 28JAN1986 Rel 08 Created DT 12SEP1993 Rel 36 Last updated Version 2 XX DE Bovine mRNA for lens betascrystallin XX KW betacrystallin betagammacrystallin crystallin ID: 273874

sequence embl seqret boolean embl sequence boolean seqret line uniprot format string database q5zkn6 stdout qualifiers protein output file

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ID X03006; SV 1; linear; mRNA; STD; MA..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1
Slide2

ID X03006; SV 1; linear; mRNA; STD; MAM; 620 BP.

XX

AC X03006;

XXSV X03006.1XXDT 28-JAN-1986 (Rel. 08, Created)DT 12-SEP-1993 (Rel. 36, Last updated, Version 2)XXDE Bovine mRNA for lens beta-s-crystallinXXKW beta-crystallin; beta-gamma-crystallin; crystallin.XXOS Bos taurus (cow)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae;OC Bovinae; Bos.XXRN [1]RP 1-620RX PUBMED; 4054100.RA Quax-Jeuken Y.E.F.M., Driessen H., Leunissen J., Quax W.J., de Jong W.,RA Bloemendal H.;RT "Beta-s-crystallin: structure and evolution of a distinct member of theRT beta-gamma-superfamily";RL EMBO J. 4(10):2597-2602(1985).XXCC Data kindly reviewed (06-MAR-1986) by Y. Quax-JeukenXX...

EMBLSlide3

Index

parser

index

flatfileSlide4

Retrieve

index

parser

display

entriesSlide5

SRS

Sequence Retrieval System

an indexing and retrieval system for flat file databasesSlide6

http://srs.bioinformatics.nlSlide7

http://srs.ebi.ac.ukSlide8
Slide9
Slide10

Q: Which sequences in EMBL [do not] encode for a protein for which the 3D structure is known?Slide11
Slide12
Slide13
Slide14

Command line SRS

Using

getzSlide15

Retrieve the UniProt entry for the protein with accession number

P19558:

getz "[uniprot-acc:P19558]" -eCount the human proteins in the UniProt database: getz "[uniprot-org:human]" –cPrint sequence of the rice proteins in the UniProt database that have a length between 10 and 50 aa: getz "[uniprot-org:rice]&[uniprot-sl#10:50]" -f slSlide16

Give the id and description for all

A.thal

proteins that have at least 8 transmembrane domains

: getz '[swissprot-org:arabidopsis thaliana]< ([swissprot-CountedItem:transmem] &[swissprot-CountedN#8:]))' -f "id des"Slide17

Count the human protein sequences in the NCBI RefSeq database:

getz

"[refseqp-org:human]" –cCount the human mRNA sequences in the NCBI RefSeq database: getz "[refseq-org:human]&[refseq-mol:mrna]" –cRetrieve the mRNA sequences for all human proteins in the NCBI RefSeq database in fasta format : getz "[refseqp-org:human

]>[

refseq-mol:mrna

]" –d –

sf

fastaSlide18

MRS: A fast and compact retrieval system for biological data. Hekkelman M.L., Vriend G.

http://mrs.cmbi.ru.nl/Slide19

European Molecular Biology

Open Software SuiteSlide20

EMBOSS

"European

Molecular Biology Open

Software Suite"http://emboss.sourceforge.net/Toolbox with bioinformatics applicationsSlide21

http://emboss.bioinformatics.nl/Slide22

http://main.g2.bx.psu.edu/Slide23

command line / shellSlide24

Useful EMBOSS commands

command

description

showdb

Displays information on the currently available databases

wossname

Finds programs by keywords in their one-line documentation

tfm

Reads the manual entries for each program in EMBOSS

seealso

Finds the relevant programs of certain program

seqret

Reads and writes (returns) sequences

entret

Reads and writes (returns) flatfile entries

extractfeat

Extract features from a sequence

extractseq

Extract regions from a sequence

transeq

Translate nucleic acid sequencesSlide25

Get help from EMBOSS itself

#

showdb

Shows the currently available databases# tfm wossnameHow to use a EMBOSS command? Just (r)tfm it#

wossname

alignment

Which commands can handle alignments?

#

seealso

seqret

Are there any other commands able to do the similar

thing?Slide26

Command line options

All EMBOSS programs react to a number of command line options. The most important ones are

–help Get help

–help –verbose Get elaborate help–auto “no questions asked” –stdout Write to standard output–filter Read stdin, write stdoutSlide27

SEQRET

parameters

zonnebloem

> seqret -help Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -feature boolean Use feature information

-

firstonly

boolean

Read one sequence and stop

General qualifiers:

-help

boolean

Report command line options. More

information on associated and general

qualifiers can be found with -help -verboseSlide28

SEQRET

parameters

zonnebloem

> seqret -help -verbose Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -feature boolean Use feature information

-

firstonly

boolean

Read one sequence and stop

Associated qualifiers:

"-sequence" associated qualifiers

-sbegin1 integer Start of each sequence to be used

///Slide29

SEQRET

parameters

///

"-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case

-sformat1 string Input sequence format

-sdbname1 string Database name

-sid1 string

Entryname

-ufo1 string UFO features

-fformat1 string Features format

-fopenfile1 string Features file name

///Slide30

SEQRET

parameters

///

"-outseq" associated qualifiers -osformat2 string Output seq format -osextension2 string File name extension -osname2 string Base file name -osdirectory2 string Output directory -osdbname2 string Database name to add -ossingle2 boolean Separate file for each entry -oufo2 string UFO features -offormat2 string Features format -ofname2 string Features file name -ofdirectory2 string Output directory /// Slide31

SEQRET

parameters

///

General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options

-help

boolean

Report command line options. More

information on associated and general

qualifiers can be found with -help -verbose

-warning

boolean

Report warnings

-error

boolean

Report errors

-fatal

boolean

Report fatal errors

-die

boolean

Report dying program messagesSlide32

Universal Sequence Address

Type

Example

Descriptionfilenamexxx.seqA sequence file "xxx.seq" in any formatformat::filenamefasta::xxx.seqA sequence file "xxx.seq" in fasta formatdb:IDnameembl:paamirEMBL entry PAAMIR, using whatever access method is defined locally for the EMBL databasedb:AccessionNumberembl:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case)db-acc:AccessionNumberembl-acc:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number only db-id:IDnameembl-id:paamirEMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database, and searching by ID onlydb-searchfield:wordembl-des:lectinEMBL entries containing the word 'lectin' in the Description linedb-searchfield:wildcard-wordembl-org:*human*EMBL entries containing the wildcarded word 'human' in the Organism fieldsdb:wildcard-ID

embl:paami*

EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL databaseSlide33

Universal Sequence Address

Type

Example

Descriptiondb or db:*embl or EMBL:*All sequences in the EMBL database@listfile@mylistReads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA.list:listfilelist:mylistSame as "@mylist" above'program parameters |''getz -e [embl-id:paamir] |'The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way. asis::sequenceasis::atacgcagttatctgaccatSo far the shortest USA we could invent. In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.Each of the above can have '[start : end]' or '[start : end : r]' appended to them. The 'file' and 'dbname' forms of USA can have 'format::' in front of them (although a database knows which format it is and so this is redundant and error-prone)Slide34

Walk through exercise

For a protein with UniProt Accession number:

Q5ZKN6

find the nucleotide sequence that encodes this (repeated) amino acid fragment:VAEEVAEESlide35

Getting the sequence

seqret -auto uniprot:Q5ZKN6 -stdout

>Q5ZKN6_CHICK Q5ZKN6

SubName: Full=Putative uncharacterized protein;MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIKENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALARLPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESESSHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIELLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLEYEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQCLGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFGQRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQPAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLEKPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTEEFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEESlide36

Getting the sequence

seqret -auto uniprot:Q5ZKN6 -stdout

>Q5ZKN6_CHICK Q5ZKN6

SubName: Full=Putative uncharacterized protein;MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIKENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALARLPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESESSHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIELLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLEYEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQCLGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFGQRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQPAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE

KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE

EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEESlide37

Run a program within Perl: 3 ways

$seq

= `

seqret -auto uniprot:Q5ZKN6 stdout`; system("seqret -auto uniprot:Q5ZKN6 stdout");open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|";while(my $line = <SEQRET>) { if($line !~ /^>/) { chomp($line); $seq .= $line; }}close SEQRET;Slide38

my $lsOutput = `ls -l`;

put shell commands or programs in backticks to run from Perl. The output can be stored in a variable.Slide39

open LS,"ls -l|";

The open function can run a program and read its output. The pipe symbol "|" links the output to a filehandle.Slide40

Find the fragment’s position

my $seq = "";

open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|";

while(my $line = <SEQRET>) { if($line !~ /^>/) { chomp($line); $seq .= $line; }}close SEQRET;# look for location of the repeatmy $position = index($seq, "VAEEVAEE") + 1;# print the offsetprint "Position = ", $position,

"\n";Slide41

!~

opposite of "=~ "gives true if the search found no hits.Slide42

Get

a cross-reference to EMBL

entret uniprot:Q5ZKN6 -auto

stdout |grep "DR "Get the feature table of this protein entrySlide43

Understand the cross-reference

DR EMBL; AJ720048; CAG31707.1; -; mRNA.

Read the detailed documentation of UniProt cross reference

http://www.expasy.org/sprot/userman.html#DR_line

Database cross reference

EMBL accession number

Protein ID

Molecule Type

Link to EMBL

Status identifier

The corresponding

cross reference

in EMBLSlide44

Get

a cross-reference to EMBL

entret uniprot:Q5ZKN6 -auto

stdout | grep "DR " |grep "EMBL;"In Perl, use a regular expression to locate the EMBL reference line, and extract the EMBL accession number and the protein-ID Slide45

Link protein to coding DNA

extractfeat

embl:AJ720048 -value CAG31707.1

stdoutReturns the DNA coding for protein CAG31707.1 (=Q5ZKN6)Slide46

Figure out the offset in DNA

Offset in amino acid sequence:

128

Offset in corresponding nucleotide sequence: ((128-1) x 3) + 1 OR (128 x 3)-2 = 382Position is from 382 to (382 + 8x3)=406Figure out the position of its corresponding coding DNA sequence (is there anything wrong here?)Slide47

Extract the DNA sequence

extractfeat

embl:AJ720048 -value CAG31707.1

stdout | extractseq –filter -reg "382-406"Now we got the corresponding DNA sequence for the protein fragmentIt should be: “gttgctgaggaggttgctgaagaac”But is that correct? Let's translate it for verification…Slide48

Verify the result

extractfeat

embl:AJ720048 -value CAG31707.1

stdout | extractseq –filter -reg "382-406" | transeq -filterResult is “VAEEVAEEX” but not “VAEEVAEE”What’s wrong here?

Always try to verify your results: computers make very few errors, but that is not true for people...Slide49

Exercise

Build a pipeline in Perl to perform the previous

steps of the walkthrough (from slide 34)

Test it with the UniProt protein A0L7N9Find the fragment at offset 305 that is 8 aa longFind out the coding DNA of this amino acid fragment and verify it