/
Local alignments & BLAST Local alignments & BLAST

Local alignments & BLAST - PowerPoint Presentation

christina
christina . @christina
Follow
64 views
Uploaded On 2024-01-13

Local alignments & BLAST - PPT Presentation

online and offline WenDar Lin Bioinformatics core IPMB wdlingatesinicaedutw Preface When we are talking about sequence similarity we usually are talking about local alignments NCBI BLAST is one of the most famous local alignment programs where almost everyone has ever used it ID: 1039683

ncbi blast query alignments blast ncbi alignments query alignment search sequence database protein nucleotide score sequences local tblastn tblastx

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Local alignments & BLAST" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Local alignments & BLASTonline and offlineWen-Dar LinBioinformatics core, IPMBwdlin@gate.sinica.edu.tw

2. PrefaceWhen we are talking about sequence similarity, we usually are talking about local alignments.NCBI BLAST is one of the most famous local alignment programs where almost everyone has ever used it.

3. PrefaceThis presentation is intended to provide information oftheoretical background of local alignments,underlying algorithm of BLAST, andusages of BLAST programs.Files: PowerPoints, walk-through logs, scripts, and example datahttps://maccu.project.sinica.edu.tw/20221006/would have some update by noon of 20221006

4. DisclaimerThis presentation is not intended to describe every detail of BLASTNCBI provides detailed documentation on BLASThttps://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocsThe BLAST programs described in this presentation are recent BLAST+ programs.The interface of online BLAST services might be improved by anytimeThey might look different from what was described in this presentation

5. Topics1. Theoretical alignment algorithm2. BLAST -- Basic Local Alignment Search Tools3. Understanding BLAST statistics4. Major variants of BLAST programs5. Online BLAST services: NCBI & Ensembl6. Standalone BLAST programs

6. Theoretical alignment algorithmIn this section, we will go throughedit distances,global alignments, dynamic programming, andlocal alignments.

7. Edit distanceThe very first question“a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other”InsertionDeletionSubstitutionSource: Wikipedia: Edit distance

8. Edit distanceEdit distance, an examplekitten → sitten, one substitutionsitten → sittin, another substitutionsittin → sitting, one insertionFrom “kitten” to “sitting”, we need three operations.The distance between the two words is 3.Source: Wikipedia: Edit distance

9. Global alignment“In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.”Source: Wikipedia: Edit distanceAGCATTC-ACAGAAGC-TTCGACCGAinsertiondeletionmismatchsubjectqueryAlignment length: 13edit distance: 3=> alignment identity = (13-3)/13 = 77%

10. Global alignmentAssumingeach match base gives score +1each mismatch/insertion/deletion gives penalty -1Given the two sequences, we can have an alignment of score 4.score: 4AGCATTCACAGAAGCTTCGACCGAno InDels

11. Global alignmentWith the same two sequence, we can have another alignment of score 7.Question: given two sequences, how can we be sure that an alignment is of the best score?score: 7AGCATTC-ACAGAAGC-TTCGACCGA

12. Global alignmentThe dynamic programming algorithmØAGCATTCACAGAØ0-1-2-3-4-5-6-7-8-9-10-11-12A-110-1-2-3-4-5-6-7-8-9-10G-20210-1-2-3-4-5-6-7-8C-3-113210-1-2-3-4-5-6T-4-20223210-1-2-3-4T-5-3-111343210-1-2C-6-4-20023543210G-7-5-3-1-112443232A-8-6-4-2001354434C-9-7-5-3-1-10246543C-10-8-6-4-2-2-1135543G-11-9-7-5-3-3-2024465A-12-10-8-6-4-4-3-113557ØAGCATTCACAGAØ0-1-2-3-4-5-6-7-8-9-10-11-12A-110-1-2-3-4-5-6-7-8-9-10G-20210-1-2-3-4-5-6-7-8C-3-113210-1-2-3-4-5-6T-4-20223210-1-2-3-4T-5-3-111343210-1-2C-6-4-20023543210G-7-5-3-1-112443232A-8-6-4-2001354434C-9-7-5-3-1-10246543C-10-8-6-4-2-2-1135543G-11-9-7-5-3-3-2024465A-12-10-8-6-4-4-3-113557Ø fornull strings

13. Dynamic programmingThe key is the incremental computation based on previous results on every cell of the matrixØAØ0-1ØØ0A-1ØAØ0-1A-11“A” to Ø, a deletionscore: -1Ø to “A”, an insertionscore: -1“A” to “A”, three possibilities1. delete A and insert A: score -2 (green)2. insert A and delete A: score -2 (blue)3. match of A: score 1 (red, the best)

14. Dynamic programmingThe incremental computation for the entire matrixØAGCATTCACAGAØ0-1-2-3-4-5-6-7-8-9-10-11-12A-110-1-2-3-4-5-6-7-8-9-10G-20210-1-2-3-4-5-6-7-8C-3-113210-1-2-3-4-5-6T-4-20223210-1-2-3-4T-5-3-111343210-1-2C-6-4-20023543210G-7-5-3-1-112443232A-8-6-4-2001354434C-9-7-5-3-1-10246543C-10-8-6-4-2-2-1135543G-11-9-7-5-3-3-2024465A-12-10-8-6-4-4-3-113557Ø fornull stringsThe computational time would be proportional to the size of the matrix

15. Dynamic programmingTrace back and get the global alignment ØAGCATTCACAGAØ-DDDDDDDDDDDDAIMDDMDDDMDMDMGIIMDDDDDDDDMDCIIIMDDDMDMDDDTIIIISMMDDDDDDTIIIISMMDDDDDDCIIIMSIIMDMDDDGIIMISIIISSSMDAIMIIMIIIMDMDMCIIIMISIMIMDDDCIIIMISIMIMSSSGIIMIISIIIISMDAIMIIMSIIMIMIMØ fornull stringsM: matchS: substitutionI: insertionD: deletion

16. Dynamic programmingYou may check sheet DynamicProgramming of the supplement Excel file LocalAlignmentExample_20221006.xlsx for Excel formulas of dynamic programming for string alignments

17. Global alignmentDot plots, another way of viewing the matrixAGCATTCACAGAAGCTTCGACCGAA path from the upper-left corner to the lower-right corner with the highest score (#dots-#spaces) is the best alignmentDots for matches

18. Local alignmentDotplots – local alignments, diagonals of dotsAGCATTCACAGAAGCTTCGACCGALocal alignments with scores>=2

19. Local alignmentPhysical meaning of Global alignmentsDetermine if two sequences are entirely similar to each other.Local alignmentsFind subsequences in one sequence that are very similar to subsequences in the other sequence.To find any similarity from one sequence (query) in some other sequences (database)

20. Local alignmentExact local alignment search algorithmsStep 1: build up a table of size mn (m: length of query, n: length of database) like a dot plotStep 2: search diagonals of dots (local alignments)This kind of approaches would cost at least mn units of time.Image that that the database might be something like nr/nt, RefSeq, UniProt …should be time-consuming

21. Short summariesIn computer science, edit distances are used for measuring the distance between two strings.In bioinformatics, scores of matches + penalties of insertion/deletion/substation are for measurement of the similarity between two sequences.

22. Short summariesDynamic programming helps to findglobal alignments andlocal alignmentsfor entire sequences and subsequences, respectively.Time cost of dynamic programming is proportional to the size of query times the size of target database.could be time-consuming for large databases.

23. BLASTBasic Local Alignment Search Toola heuristic algorithmBLAST assumes local alignments to be found containing exact matches no less than WConsider an alignment as a series of coin tossing with outcomes Head (match) or Tail (mismatch/InDel)High similarity means a long run of Heads (matches)HHHTHHHTHHTHHAGCATTC-ACAGAAGC-TTCGACCGAalignment identity = 77%longest run of matches = 3

24. BLASTThe algorithm largely reduces the search time bybuilding a look-up table that stores all positions of W-mers of the query sequencetakes m units of timelooking for these W-mers in the databasetakes n units of timeforming local alignments based on the W-mer seeds

25. BLASTBuild a look-up table of all 2-mersQuery sequence: AGCTTCGACCGA W=2AG1GC2CT3TT4TC5CG6,10GA7,11AC8CC9

26. BLASTlook for these W-mers in the databaseAG1GC2CT3TT4TC5CG6,10GA7,11AC8CC9AGCATTCACAGAAGCTTCGACCGA

27. BLASTForm local alignments based on the W-mer-hitsAGCATTCACAGAAGCTTCGACCGA

28. Short summariesTime cost of theoretical dynamic programming is proportional to the size of query times the size of target database.The BLAST heuristic algorithm assumes consecutive W matches in the result alignmentsThe time cost of BLAST is proportional to the sum ofthe size of query (for building the look-up table),the size of database (scanning against the look-up table), andthe volume of total alignments (which is usually very small, compared to the size of database).

29. BLAST statisticsIn addition to the fast heuristic, BLAST computes an E-value for each alignmentbased on the similarity scoreA minor constantE-valueLength of queryLength of databaseRaw scoreScaling factormn: the search space

30. BLAST statisticsThe E-value means the expected number of local alignments that have alignment scores greater than or equal to S in this BLAST search.An E-value close to zero means that an alignment with score S or greater is not likely to appear in a random sequence model.The alignment should not be “random.” A thinking of statistical hypothesis testing.

31. BLAST statisticsFor each local alignment, there will be a summary like thishit sequence in databaseraw scorebit scoreE valuebit scores are normalized scores

32. BLAST statisticsWithin one BLAST searchIt is feasible to compare alignments based on E-values or bit scores.A smaller (significant) E-value means an alignment less likely to be “random.”A larger bit score means an alignment of two “closer” subsequences.

33. BLAST statisticsFrom different runs of BLAST searches, you may compare alignments based onbit score, the normalized score.E-values are no longer feasible to be used for comparing alignmentsRecall that E = kmne-Sif the same query sequence (length m) were used to search against different databases (length n1  n2, respectively)alignments with the same score S would result in different E-values kmn1e-S  kmn2e-S

34. Short summariesKnowing BLAST statistics better would help you to interpret BLAST outputs better.A way to fix “the same score but different E-values in different BLAST searches” problem when running standalone BLAST programsspecify a fixed effective search space (i.e. mn in the E-value formula)option “-searchsp” for BLAST+ programs

35. Major variants of BLAST programsBLASTNSearching nucleotide databases using a nucleotide query.The underlying algorithm should be closed to what we described in the BLAST algorithm section.Build a W-mer look-up table of the queryScan the database against the look-up table, a hit was identified if an exact matchForm alignments based on W-mer hits in the database

36. Major variants of BLAST programsBLASTPSearching protein databases using a protein query.The underlying algorithm should be similar with what we described in the BLAST algorithm sectionBuild a W-mer look-up table of the queryScan the database against the look-up table, a seed was identified if the database W-mer is close enough to the query W-mer, given the AA-to-AA scoring matrix.Form alignments based on W-mer hits in the database. Alignment scores were computed according to the AA-to-AA scoring matrix

37. Major variants of BLAST programsBLASTXSearching protein databases using a nucleotide query by translating 1 query into 6 protein queries using the six reading frames.Actual sequence search was done like using BLASTP.Considering this as running 6 times of BLASTP.

38. Major variants of BLAST programs TBLASTNSearching nucleotide databases using a protein query by translating the database using the six reading frames.Actual sequence search was done like using BLASTP.Considering this as running 6 times of BLASTP.

39. Major variants of BLAST programs TBLASTXSearching nucleotide databases using a nucleotide query by translating both query and the database using the six reading frames.Actual sequence search was done like using BLASTP.Considering this as running 36(=6x6) times of BLASTP.

40. Major variants of BLAST programsIf the actual alignment to be done are for protein sequences (blastp, blastx, tblastn, and tblastx),there will be a matrix parameter for scoring.MatrixBest useSimilarity (%)BLOSUM90Short alignments that are highly similar70-90BLOSUM80Detecting members of a protein family50-60BLOSUM62Most effective in finding all potential similarities30-40BLOSUM30Longer alignments of more divergent sequences<30

41. Short summariesThe five major BLAST variantsquerydatabasenucleotideproteinnucleotideproteintranslationBLASTNBLASTPBLASTXTBLASTNTBLASTX

42. Online BLAST servicesIn this section, we will demonstrate usages of online BLAST services provided by NCBI and EnsemblNCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgiEnsembl (plants): https://plants.ensembl.org/Multi/Tools/BlastThese service sites are periodically updating their functionalitiesdescriptions here might be a little different with the actual pages later

43. NCBI BLAST serviceshttps://blast.ncbi.nlm.nih.gov/Blast.cgior google “NCBI BLAST”Help docsVarious BLAST programsVarious BLAST programs

44. NCBI BLAST servicesEntering any BLAST program in last slide would lead you to pages of similar organizationTake Nucleotide BLAST as an examplePick target database, and optionally set organism constraintEnter query sequence(s)Finer program selectionRun!

45. NCBI BLAST servicesquery sequence(s)Can be in multi-FASTA formats or bare sequenceAll these three formats are feasible

46. NCBI BLAST servicesTarget databaseUsually we pick nr/nt or Refseqnt: GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA. Identical sequences have been merged into one entry.refseq_rna: RNA parts of RefSeq. RefSeq is a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.make a good use of those question icons!

47. NCBI BLAST servicesSet organism constraintEntering name of your target organism would bring out a pull-down menu, simply pick the desired organism.A precise way is to find out taxid or the target organism from the NCBI Taxonomy databasegoogle “NCBI Taxonomy”

48. NCBI BLAST servicesFiner program selectionFor Nucleotide BLAST, there actually a few number of variations of programs doing the similar tasksThe key difference between them are parametersCan be down to 7megablastblastndiscontiguous megablast

49. NCBI BLAST servicesBefore you run BLASTCheck this box to keep the input page available and create a new page for BLAST progress and resultsExpand this to check or adjust parameters

50. NCBI BLAST servicesA BLASTN exampleIn this example, we would like to search Arabidopsis thaliana bZIP60 against NCBI RefSeq databasebZIP60 transcript sequencehttps://www.arabidopsis.org/servlets/TairObject?type=sequence&id=1002431237

51. NCBI BLAST servicesIn Nucleotide BLAST page

52. NCBI BLAST servicesIn Nucleotide BLAST page

53. NCBI BLAST servicesIn the result pageNCBI will keep the search result for a few days. Retrieve the result according to the RID.Or simply download the result.Search info

54. NCBI BLAST servicesIn the result pageThe Descriptions tab gives descriptions of hit sequences.pick columns to showclick to get alignmentsclick to get hit sequence info

55. NCBI BLAST servicesIn the result pageThe Graphic Summary tab gives alignment locations in the query side.Every line segment represents one alignment of a certain score in the corresponding color.Click to get info and the alignment

56. NCBI BLAST servicesIn the result pageThe Taxonomy tab gives taxonomy distribution of hits.click to get hit sequence listclick to get alignment list

57. NCBI BLAST servicesIf there are two or more query sequences,there will be a pull-down menu for selecting results of different queries.

58. NCBI BLAST servicesA BLASTP exampleIn this example, we would like to search a rice may-be-true protein sequence BAA36183.1 (an NCBI genbank accession)>BAA36183.1 dihydroflavonol 4-reductase [Oryza sativa Japonica Group]MGEAVKGPVVVTGASGFVGSWLVMKLLQAGYTVRATVRDPSNVGKTKPLLELAGSKERLTLWKADLGEEGSFDAAIRGCTGVFHVATPMDFESEDPENEVVKPTVEGMLSIMRACRDAGTVKRIVFTSSAGTVNIEERQRPSYDHDDWSDIDFCRRVKMTGWMYFVSKSLAEKAAMEYAREHGLDLISVIPTLVVGPFISNGMPPSHVTALALLTGNEAHYSILKQVQFVHLDDLCDAEIFLFESPEARGRYVCSSHDATIHGLATMLADMFPEYDVPRSFPGIDADHLQPVHFSSWKLLAHGFRFRYTLEDMFEAAVRTCREKGLLPPLPPPPTTAVAGGDGSAGVAGEKEPILGRGTGTAVGAETEALVK

59. NCBI BLAST servicesIn BLASTP pagePick Standard database, refseq_prot, and Japanese rice

60. NCBI BLAST servicesIn this example, we are not getting an 100% identity alignment by querying this rice protein against rice proteins in the NCBI refseq_prot database.Best alignment got only 42% identity

61. NCBI BLAST servicesWHY?NCBI refseq_prot is a collection of translated nucleotides of coding genes in NCBI’s current annotation.The source nucleotide sequence of BAA36183.1 may not be considered as coding in NCBI’s current annotation.And it is likely that the nucleotide sequence does exists in the rice genome.

62. NCBI BLAST servicesHow to find the source location of protein sequence BAA36183.1?We have protein query.We want to search it against a genome (a set of nucleotide sequences)We should apply TBLASTN

63. NCBI BLAST servicesIn TBLASTN pagePick refseq_genomes, and Oryza sativa japonica

64. NCBI BLAST servicesIn this example, we are not getting an 100% identity alignment by querying this rice protein against rice proteins in the NCBI refseq_prot database.Best hit sequence got 90% identity

65. NCBI BLAST servicesBy examining the alignments, we have three pieces of near 100% identity alignmentsfrom protein to genomeShould mean three exon/CDS regionsThe “90% identity” in last slide should be an average of many alignments.

66. NCBI BLAST servicesGene locus at the position is considered as pseudo => no BLASTP results.

67. NCBI BLAST servicesA short note on searching against genomesIt is possible to search against draft genomes that were submitted to NCBI, if availableBy picking “wgs” for Database and input corresponding organismNCBI Taxonomy should help you to find the correct organism name.

68. Ensembl BLAST servicesFor the aim of searching protein source in the genome, Ensembl provides a better integrated interface.In this example, we use the same BAA36183.1 protein sequence as the query to search rice genome in the Ensembl Plants database.https://plants.ensembl.org/Multi/Tools/Blast

69. Ensembl BLAST servicesThe operation should be simple.

70. Ensembl BLAST servicesThe first few alignments with %ID close to 100 are all in chromosome 1.Click any of it to get an integrated view of the alignments and genome annotation.The alignment

71. Ensembl BLAST servicesAppropriate zoom-out and zoom-in would bring us to a view of all three alignments and overlapping genome annotationsThe three TBLASTN alignmentsGenome annotation

72. A short note on TBLASTN/TBLASTXTBLASTN/TBLASTX searches are very sensitivethe actual search is done at the protein levelless affected by mutations at the nucleotide levelTBLASTN/TBLASTX would be good atfinding footprints of protein/nucleotide in a nucleotide databaseIncluding genome database

73. General guide line of using BLASTWhat to do if no desired search results?Set organism constraintCheck numbers of Nucleotide/Protein in corresponding Taxonomy page (for NCBI)Loose E-value thresholdBLAST statisticsAdjust Maximum target sequencesBLAST reports top sequences but not first sequences during the search

74. General guide line of using BLASTWhat to do if no desired search results? (cont.)Shorten Word sizeBLAST algorithmChange scoring matrix if the target is divergent from your sourcefor non-BLASTN programsChoose a sensitive BLAST programmegablast -> blastn -> blastp -> blastx/tblastn -> tblastx

75. Short summariesNCBI BLAST services provide enriched options for controlling the program and many options of databases.The interface is strongly integrated with the NCBI database.Ensembl BLAST services provide a good integration for visualization of alignments and genome annotation

76. Standalone BLAST programsThe way to download BLAST executablesThe same entry page of BLAST pagesGo and follow the download link inside it

77. Standalone BLAST programsThe way to download BLAST executables (cont.)Executables for many platforms are availableWindows, MacOS, and Linux

78. Standalone BLAST programsNext slides are a walkthrough under a fresh new Ubuntu20 server (AS ITS OpenStack)This walkthrough will presentinstall ncbi-blast+ from the ubuntu distributioninstall most updated ncbi+blast+ from downloaded executablesperforming TBLASTN/TBLASTX by querying a protein/nucleotide sequence against the rice genome for footprint searcheswith a few small scripts for post-processing

79. Standalone BLAST programsRelated files including the walkthrough log (process.txt) can be found athttps://maccu.project.sinica.edu.tw/20221006/CAUTION: Better not copy command from this PowerPoint file. Office might twist symbols like - ‘ “.Please refer the walkthrough log process.txt at the above URL.

80. Standalone BLAST programs1. install ncbi blast+ from ubuntu distribution packagesubuntu@blast:~$ sudo apt updateubuntu@blast:~$ sudo apt install ncbi-blast+(...)Setting up ncbi-blast+ (2.9.0-2) ...Processing triggers for man-db (2.9.1-1) ...Processing triggers for libc-bin (2.31-0ubuntu9.9) ...ubuntu@blast:~$ blastn -versionblastn: 2.9.0+ Package: blast 2.9.0, build Sep 30 2019 01:57:31

81. Standalone BLAST programs2. (optional) install most-updated ncbi blast+ programsubuntu@blast:~$ curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gzubuntu@blast:~$ tar -zxvf ncbi-blast-2.13.0+-x64-linux.tar.gzubuntu@blast:~$ tail -n 2 ~/.bashrcPATH="/home/ubuntu/ncbi-blast-2.13.0+/bin:$PATH"export PATHubuntu@blast:~$ source ~/.bashrcubuntu@blast:~$ blastn -versionblastn: 2.13.0+ Package: blast 2.13.0, build Feb 2 2022 15:38:31

82. Standalone BLAST programs3. Download genome files(download genome FASTA file)ubuntu@blast:~$ wget http://ftp.ensemblgenomes.org/pub/plants/release-54/fasta/oryza_sativa/dna_index/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz(download genome annotation GFF3 files)ubuntu@blast:~$ wget http://ftp.ensemblgenomes.org/pub/plants/release-54/gff3/oryza_sativa/Oryza_sativa.IRGSP-1.0.54.gff3.gz(unzip files)ubuntu@blast:~$ gzip -d Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gzubuntu@blast:~$ gzip -d Oryza_sativa.IRGSP-1.0.54.gff3.gz

83. Standalone BLAST programs4. Install necessary programsubuntu@blast:~$ wget https://maccu.project.sinica.edu.tw/20221006/ExampleData.tar.gzubuntu@blast:~$ tar -zxvf ExampleData.tar.gzubuntu@blast:~$ wget https://downloads.sourceforge.net/project/rackj/0.99a/rackJ.tar.gzubuntu@blast:~$ tar -zxvf rackJ.tar.gzubuntu@blast:~$ sudo apt install default-jreubuntu@blast:~$ sudo apt install bioperlubuntu@blast:~$ tail -n 3 ~/.bashrcPATH="/home/ubuntu/ncbi-blast-2.13.0+/bin:$PATH"PATH="/home/ubuntu/ExampleData:$PATH"export PATHubuntu@blast:~$ source ~/.bashrc

84. Standalone BLAST programs5. TBLASTN BAA36183.1 against rice genome and corresponding post processingubuntu@blast:~$ tblastn -subject Oryza_sativa.IRGSP-1.0.dna.toplevel.fa -query ExampleData/BAA36183.1.fasta -outfmt "7 qaccver qlen sallacc slen pident length nident positive gapopen qstart qend sstart send sframe evalue bitscore" -out BAA36183.1.tblastn.txt -word_size 3 -window_size 0 -evalue 10(create tmp file for consistentIterator input)ubuntu@blast:~$ cat BAA36183.1.tblastn.txt | perl -ne 'chomp; next if /^#/; chomp; @t=split; if($t[11]>$t[12]){ print join("\t",@t[2,0,12,11,10,9])."\t$_\n" }else{ print join("\t",@t[2,0,11,12,9,10])."\t$_\n" }' > BAA36183.1.tblastn.tmp(maximum gene size in chromosomes)ubuntu@blast:~$ cat Oryza_sativa.IRGSP-1.0.54.gff3 | perl -ne 'if(/^#/){}else{ @t=split(/\t/); $len=$t[4]-$t[3]+1; $max=$len if $len>$max && $t[2] eq "gene"} print "$max\n" if eof'57648(consistentIterator)ubuntu@blast:~$ consistentIterator.pl -parapass "-min 20 -max 20 -score 22 -limitRef 57648 -refKeep -queryKeep -strandKeep -order" BAA36183.1.tblastn.tmp /home/ubuntu/rackJ/rackj.jar | cut -f 1,8- > BAA36183.1.tblastn.grp.xls

85. Important parameters of BLAST-outfmt <format_string>specify the output formatbe sure to apply “-help” to get detailed informationIn our tblastn example, we applied-outfmt "7 qaccver qlen sallacc slen pident length nident positive gapopen qstart qend sstart send sframe evalue bitscore“which means text tabular output with specified information as columns (query accession, query length, subject accession, subject length, …)

86. Important parameters of BLAST-evalue <real_number>: E-value cutoffto filter out alignments with E-values larger than the cutoff-dust (BLASTN) OR -seg (non-BLASTN)to filter low complexity regionsthe default setting might be different from one BLAST program to the other, apply -help to check it.-num_threads <integer>number of processors to usewould speed up BLAST search for multi-core CPU

87. Important parameters of BLAST-matrix <matrix string> (default BLOSUM62)specify scoring matrix for protein alignmentsMatrixBest useSimilarity (%)BLOSUM90Short alignments that are highly similar70-90BLOSUM80Detecting members of a protein family50-60BLOSUM62Most effective in finding all potential similarities30-40BLOSUM30Longer alignments of more divergent sequences<30

88. Important parameters of BLAST-word_size <integer>: index word size (for the look-up table)Increasing this parameter would increase search speed at a price of sensitivity.-window_size <integer>:Set this to 0 to apply 1-hit algorithm to increase sensitivity at a cost of search speed.1-hit may be needed for short query sequence.

89. Standalone BLAST programs6. (optional) TBLASTX AB003496.1 (source nucleotide of BAA36183.1) against rice genome and corresponding post processingubuntu@blast:~$ tblastx -subject Oryza_sativa.IRGSP-1.0.dna.toplevel.fa -query ExampleData/AB003496.1.fasta -out AB003496.1.tblastx.out -evalue 10ubuntu@blast:~$ parseTBLASTX.pl AB003496.1.tblastx.out > AB003496.1.tblastx.txt(create tmp file for consistentIterator input)ubuntu@blast:~$ cat AB003496.1.tblastx.txt | perl -ne 'chomp; next if /^#/; chomp; @t=split; if($t[11]>$t[12]){ print join("\t",@t[2,0,12,11,10,9])."\t$_\n" }else{ print join("\t",@t[2,0,11,12,9,10])."\t$_\n" }' > AB003496.1.tblastx.tmp(consistentIterator)ubuntu@blast:~$ consistentIterator.pl -parapass "-min 20 -max 20 -score 22 -limitRef 57648 -refKeep -queryKeep -strandKeep -order" AB003496.1.tblastx.tmp /home/ubuntu/rackJ/rackj.jar | cut -f 1,8- > AB003496.1.tblastx.grp.xls

90. Standalone BLAST programsThe use of the consistentIterator.pl script is to partition fragmented BLAST alignments into co-linear and non-overlapping groups.We saved the grouped TBLASTN and TBLASTX results in BAA36183.1.footprint.xlsx.alnGroupchridentityaln lengthqstartqendsstartsendframeevaluebitscore1110040140253828182538293711.74E-8482.41197.56112341163253830502538341821.74E-8425311100210163372253838032538443222.58E-1193831st group of alignments insheet BAA36183.1.tblastn.grp of Excel file BAA36183.1.footprint.xlsx

91. Standalone BLAST programsThis 1st co-linear and non-overlapping alignment group visualized in the Ensembl website.Our query is 372 AA’s.Query 1-40Query 41-163Query 163-372

92. Short summariesStandalone BLAST programs give us abilities toprogrammatically run BLAST programs,designate BLAST output information, and postprocessing the outputs.The consistentIterator.pl script can be used togroup low similarity and fragmented BLAST alignments into co-linear and non-overlapping groups.

93. Short summariesGrouped alignments with high coverage to the query usually means a confident footprint.In practice, we ever see 30 alignments with bitscore<100 (low similarity) been grouped together for a 2500bp TBLASTX query.Highly mutated footprint.

94. FinallyThank you for your attentions.I am willing to answer and/or discuss questions via email or in some other interactive form.Please don’t hesitate to let me know if you have any questions.