/
ttp://www.mirlabs.org ttp://www.mirlabs.org

ttp://www.mirlabs.org - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
367 views
Uploaded On 2015-12-10

ttp://www.mirlabs.org - PPT Presentation

Y APS Y et Another Protein Similarit y T om a s No v os ad V acla v Sn a sel Departmen t of Computer Science V SB T ec hnical Univ ersit y of Ostra v a Ostra v a The Czec h Republic E ID: 220497

Y APS: Y et Another Protein Similarit y T om a  s No v os ad V acla v Sn a  sel  Departmen t of Computer Science V  SB - T ec hnical Univ ersit y of Ostra v a Ostra v a The Czec h Republic E

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ttp://www.mirlabs.org" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Y APS: Y et Another Protein Similarit y T om a  s No v os ad, V acla v Sn a  sel  Departmen t of Computer Science V  SB - T ec hnical Univ ersit y of Ostra v a Ostra v a, The Czec h Republic Email: f tomas.no v osad,v acla v.snasel g @vsb.cz Ajith Abraham  Mac hine In telligence Researc h Labs MIR Labs, h ttp://www.mirlabs.org Email: a jith.abraham@ieee.org Jac k Y Y ang Harv ard Univ ersit y , Bo x 400888, Cam bridge, Massac h usetts 02140-0888, USA Email: dr.jac k.y ang@gmail.com Abstract In this article we pr esent a novel metho d for me asur- ing pr otein similarity b ase d on their tertiary structur e. Our new metho d de als with sux tr e es and classic al information r etrieval tasks, such as the ve ctor sp ac e mo del, using tf-idf term weighing schema or using var- ious typ es of similarity me asur es. Our go al to use the PDB datab ase of known pr oteins, not just some kinds of sele ctions, which have b e en studie d in other works. F or veri c ation of our algorithm we ar e using c omp arisons with the SCOP datab ase which is main- taine d primarily by humans. The next go al is to b e able to c ate gorize pr oteins not include d in the latest version of the SCOP datab ase with ne arly 100% ac cur acy. 1 In tro duction Analyzing three protein structures is a v ery imp ortan t task in molecular biology . The solution more and more no w ada ys for protein structures is the use of state-of-the-art tec hnologies suc h as n uclear magnetic resonance (NMR) sp ectroscop y tec hniques or X-Ra y crystallograph y as seen in the increasing n um b er of PDB [16 ] en tries: 56366 as of Marc h 10, 2009. It w as pro v ed that structurally similar proteins tend to ha v e similar functions ev en if their amino acid sequences are not similar to eac h other. Th us it is v ery imp ortan t to nd proteins with similar structures (ev en in part) from the gro wing database to analyze protein functions. Y ang et al. [24 ] exploited mac hine learning tec hniques including v arian ts of Self-Organizing Global Ranking, a decision tree, and a supp ort v ector mac hine algorithms to predict the tertiary structure of transmem brane proteins. Hec k er et al. [10 ] dev elop a state of the art protein disorder predictor and tested it on a large protein disorder dataset created from Protein Data Bank. The rela- tionship of sensitivit y and sp eci cit y is also ev aluated. Habib et al. [8] presen ted a new SVM based approac h to predict the sub cellular lo cations based on amino acid and amino acid pair comp osition. More protein features can b e tak en in to consideration and conse- quen tly impro v es the accuracy signi can tly . W ang al. [22 ] discussed an empirical approac h to sp ecify the lo calization of protein binding regions utilizing information including the distribution pattern of the detected RNA fragmen ts and the sequence sp eci cit y of RNase digestion. In this pap ers w e presen t a no v el metho d for an- alyzing three dimensional protein structurea using sux trees and classical information retriev al metho ds and sc hemes. Sev eral studies w ere dev elop ed for in- dexing protein tertiary structure [5 , ]. These studies are targeted mainly at some kind of selection of the PDB database. The goal of this w ork is that w e are taking in to accoun t the whole curren t PDB database and calculating the similarities of eac h protein in comparison to eac h other protein. The sux tree is a v ery useful data structure whic h can disco v er 2009 International Conference of Soft Computing and Pattern Recognition978-0-7695-3879-2/09 $26.00 © 2009 IEEEDOI 10.1109/SoCPaR.2009.101503 2009 International Conference of Soft Computing and Pattern Recognition978-0-7695-3879-2/09 $26.00 © 2009 IEEEDOI 10.1109/SoCPaR.2009.101497 common substructures of proteins in a reasonable time (linear or logarithmic time), dep ending on the implemen tation of the construction algorithm. When the generalized sux tree is constructed for all proteins app earing in the en tire PDB database, w e are using similar metho ds whic h w ere previously studied [26 , 9, 3 , 13 ] for measuring the similarit y of proteins based on their three dimensional structure de nition. Our w ork arises from the relations of amino acid residues de ned b y its dihedral angles rather then the relations b et w een just the Alpha Carb on atoms. The relations b et w een alpha carb ons use D ALI for example, when computing the distance matrix b et w een alpha carb on atoms of a giv en protein. In the nal stage w e are building a v ector space mo del whic h is v ery suitable for v arious information retriev al tasks and can b e used for future studies of proteins relations. 2 Bac kground 2.1 Pr otein and Its Structur e Proteins are large molecules. In man y cases only a small part of the structure - an active site - is directly functional, the rest existing only to create and x the spatial relationship among the activ e site residues [11 ]. Chemically , protein molecules are long p olymers t ypically con taining sev eral thousand atoms, com- p osed of a uniform rep etitiv e bac kb one (or mainc hain) with a particular sidec hain attac hed to eac h residue. The amino acid sequence of a protein records the succession of sidec hains. The p olyp eptide c hain folds in to a curv e in space; the course of the c hain de ning a folding pattern . Proteins sho w a great v ariet y of folding patterns. Underlying these are a n um b er of common structural features. These include the recurrence of explicit structural paradigms - for example, � hel ices and � sheets and common principles or features suc h as the dense pac king of the atoms in protein in teriors. F olding ma y b e though t of as a kind of in tramolecular condensation or crystallization [11 ]. 2.1.1 Protein Databank - PDB The PDB arc hiv e con tains information ab out exp erimen tally-determined structures of proteins, n ucleic acids, and complex assem blies. As a mem b er of the wwwPDB, the R CSB PDB curates and anno- tates PDB data according to agreed up on standards [16 ]. 2.1.2 Dihedral angles An y plane can b e de ned b y t w o non-collinear v ectors lying in that plane; taking their cross pro duct and normalizing yields the normal unit v ector to the plane. Th us, a dihedral angle can b e de ned b y four, pairwise non-collinear v ectors. The bac kb one dihedral angles of proteins are called  (phi, in v olving the bac kb one atoms C-N- C -C), (psi, in v olving the bac kb one atoms N- C -C-N) and ! (omega, in v olving the bac kb one atoms C -C-N- C ). Th us,  con trols the C-C distance, con trols the N-N distance and ! con trols the C - C distance. 2.2 V ector Space Model The v ector mo del [1 ] of do cumen ts is dated bac k to 70th of the 20th cen tury . In v ector mo del there are do cumen ts and users queries represen ted b y v ectors. W e use m di eren t terms t 1 : : : t m for indexing N do cumen ts. Then eac h do cumen t d i is represen ted b y a v ector: d i = ( w i 1 ; w i 2 ; : : : ; w im ) ; where w ij is the w eigh t of the term t j in the do cumen t d i . An index le of the v ector mo del is represen ted b y ma- trix: D = 0 B B B @ w 11 w 12 : : : w 1 m w 21 w 22 : : : w 2 m . . . . . . . . . . . . w n 1 w n 2 : : : w N m 1 C C C A ; where i -th ro w matc hes i -th do cumen t, and j -th col- umn matc hes j -th term. The similarit y of t w o do cumen ts is giv en b y follo w- ing form ula: sim ( d i ; d j ) = P m k =1 ( w ik w j k ) q P m k =1 ( w ik ) 2 P m k =1 ( w j k ) 2 F or more information see [12 , 15 , 1]. 2.3 Sufx T r ees A sux tree is a data structure that admits ecien t string matc hing and querying. Sux trees ha v e b een 504 498 studied and used extensiv ely , and ha v e b een applied to fundamen tal string problems suc h as nding the longest rep eated substring [23], strings comparisons [4 ], and text compression [17 ]. The follo wing description of the sux tree w as tak en from Dan Gus eld's b o ok A lgorithms on Strings, T r e es and Se quenc es [7 ]. One ma jor di erence is that w e treat do cumen ts as sequences of w ords, not c haracters. A sux tree of a string is simply a compact trie of all the suxes of that string. In more precise terms [25 ] Citation: De nition 2.1. A sux tree T for an m -w ord string S is a ro oted directed tree with exactly m lea v es n um- b ered 1 to m. Eac h in ternal no de, other than the ro ot, has at least t w o c hildren and eac h edge is lab eled with a nonempt y sub-string of w ords of S . No t w o edges out of a no de can ha v e edge lab els b eginning with the same w ord. The k ey feature of the sux tree is that for an y leaf i , the concatenation of the edge lab els on the path from the ro ot to leaf i exactly sp ells out the sux of S that starts at p osition i , that is it sp ells out S [ i : : : m ]. In a similar manner, a sux tree of a set of strings, called a generalized sux tree [7], is a compact trie of all the suxes of all the strings in the set [25 ]: De nition 2.2. A generalized sux tree T for a set S of n strings S n , eac h of length m n , is a ro oted directed tree with exactly P m n lea v es mark ed b y a t w o n um b er tuple ( k ; l ) where k ranges from 1 to n and l ranges from 1 to m k . Eac h in ternal no de, other than the ro ot, has at least t w o c hildren and eac h edge is lab eled with a nonempt y sub-string of w ords of a string in S . No t w o edges out of a no de can ha v e edge lab els b eginning with the same w ord. F or an y leaf ( i; j ), the concatenation of the edge lab els on the path from the ro ot to leaf ( i; j ) exactly sp ells out the sux of S i that starts at p osition j , that is it sp ells out S i [ j : : : m i ]. Sev eral linear time algorithms for constructing suf- x trees exist [14 , 21 , 23 ]. In this w ork w e ha v e made some implemen tation impro v emen ts of the naiv e sux tree construction algorithm to ac hiev e b etter than the O( L 2 ) w orst-case time b ound. With these impro v e- men ts w e ha v e ac hiev ed constan t access time when nding an appropriate c hild of the ro ot and logarith- mic time to nd an existing c hild or to insert a new c hild no de to an y other in ternal no des of the tree [13 ]. 3 Preparing the Data In this section w e describ e the pro cess of retrieving the data for protein indexing. W e are using the whole PDB database whic h consists of appro ximately 49000 kno wn proteins. 3.1 Cr eating Pr oteins Collection In the curren t PDB database w e can nd proteins, n ucleic acids and complex assem blies. Our study is fo cused just on relations b et w een proteins. W e ha v e ltered out all n ucleic acids and complex assem blies from the en tire PDB database. Next w e ha v e ltered out proteins whic h ha v e incomplete N-C -C-O bac k- b ones (e.g. some of the les ha v e C atoms in the protein bac kb one missing, etc.). After this cleaning step w e ha v e obtained a col- lection consisting of 44351 les. Eac h suc h that the le con tains a description of one protein and its three dimensional structure and con tains only amino acid residues with complete a N-C -C-O atom sequence. F rom eac h suc h that the le w e ha v e retriev ed has at least one main c hain (some proteins ha v e more than one main c hain) of at least one mo del (in some cases PDB les con tains more mo dels of three dimensional protein structure). In cases when the PDB le con tains more main c hains or more mo dels w e tak e in to accoun t all main c hains of all mo dels. 3.2 Encoding the 3D Pr otein Main Chain Struc­ tur e f or Indexing T o b e able to index proteins b y IR tec hniques w e need to enco de the 3D structure of the protein bac kb one in to some sequence of c haracters, w ords or in tegers (as in our case). Since the protein bac kb one is the sequence of the amino acid residues (in 3D space) w e are able to enco de this bac kb one in to the sequence of in tegers in the follo wing manner. F or simple example let's sa y the protein bac k- b one consists of four amino acid residues M V L S (abbreviations for methionie, v aline, leucine and serine). The relationship b et w een the t w o follo wing residues can b e describ ed b y its dihedral angles  , and ! . Since  and are taking v alues from the in terv al h� 180  ; 180  i w e ha v e to do some normaliza- tion. F rom this in terv al w e ha v e obtained 36 v alues (the in terv al w as divided in to 35 equal parts, b y 10  degrees) e.g. � 180  , � 170  ,. . . ,0  ,10  ,. . . ,180  . Eac h of these v alues w as lab eled with a p ositiv e n um b er (00, 01, 02, . . . , 35). No w, let's sa y that  is � 21  , the closest discrete v alue is � 20  whic h has the lab el 02, so w e ha v e enco ded this dihedral with the string 505 499 '02'. The same holds for . The ! w as enco ded as the t w o c haracters A or B since the ! tends to b e almost in ev ery case 0  or 180  . After concatenation of these three parts w e get a string whic h lo oks something lik e this 'A0102' whic h means that !  180  ,   � 10  ,  � 20  3.3 Putting Ev erything T ogether The ma jor ob jectiv e of this stage is to prepare the data for indexing b y sux trees. The sux tree can index sequences. The resulting sequence in our case is a sequence of in tegers (p ositiv e n um b ers). F or simple example let's sa y w e ha v e a protein with a bac kb one consisting of 6 residues e.g. M V L S E G with its three dimensional prop erties. The resulting enco ded sequence can b e for example: f A3202, A2401, A2603, A2401, A2422 g After obtaining this sequence of 5 w ords, w e create a dictionary of these w ords (eac h unique w ord receiv es its o wn unique in teger iden ti er). The translated sequence will lo ok lik e this: f 0, 1, 2, 1, 3 g In this w a y w e enco de eac h main c hain of eac h mo del con tained in to one PDB le. This task is done for ev ery protein included in our ltered PDB collection. No w w e are ready for indexing proteins using sux trees. 4 Protein Similarit y Algorithm In this section w e describ e the algorithm for measur- ing protein similarit y based on their tertiary structure. A brief description of the algorithm follo ws: 1. Prepare the data as w as men tioned in section 3. 2. Insert all enco ded main c hains of all proteins in the collection in to the generalized sux tree data structure. 3. Find all maximal substructures clusters in the suf- x tree. 4. Construct a v ector mo del of all proteins in our collection. 5. Build proteins similarit y matrix. 6. F or eac h protein nd top N similar proteins. 4.1 Inserting All Main Chains into the Sufx T r ee In this stage of the algorithm w e will b e construct- ing a generalized sux tree of all enco ded main c hains. As w as men tioned in section 3, w e obtain the enco ded forms of three dimensional protein main c hains - se- quences of p ositiv e n um b ers. All of these sequences are inserted in to the generalized sux tree data structure (section 2.3). 4.2 Finding All Maximal Substructur e Clusters T o b e able to build a v ector mo del of proteins w e ha v e to nd all maximal phrase clusters. De nition of the maximal phrase cluster (the longest common sub- structure) follo ws [26 ]: De nition 4.1. A phrase cluster is a phrase that is shared b y at least t w o do cumen ts, and the group of do cumen ts that con tain the phrase. A maximal phrase cluster is a phrase cluster whose phrase cannot b e ex- tended b y an y w ord in the language without c hang- ing (reducing) the group of do cumen ts that con tain it. Maximal phrase clusters are those w e are in terested in. The phrase in our con text is an enco ded protein main c hain or an y of its parts. The do cumen t in our con text can b e seen as a set of enco ded main c hains of the protein. No w w e simply tra v erse the generalized sux tree and iden tify all maximal phrase clusters (i.e. all of the longest common substructures). 4.3 Building a V ector Model In this section w e describ e the pro cedure of building the matrix represen ting the v ector mo del index le (section 2.2). In a classical v ector space mo del the do cumen t is represen ted b y the terms resp ectiv ely b y the w eigh ts of the terms. In our mo del the do cumen t is represen ted not b y the terms but it is represen ted b y the common phrases (maximal phrase clusters)! In the previous stage of the algorithm w e ha v e iden ti ed all maximal phrase clusters - all of the longest common substructures. F rom the de nition of the phrase cluster w e kno w that the phrase cluster is the group of the do cumen ts sharing the same phrase (group of proteins sharing the same substructure). No w w e can obtain the matrix represen ting the v ector mo del index le directly from the generalized sux tree. Eac h do cumen t (protein) is represen ted b y the maximal phrase clusters in whic h it is con tained. F or computing the w eigh ts of the phrase clusters w e are using a tf � id f w eigh ting sc hema: w ij = tf ij  id f j = tf ij  log n d f j (1) where tf ij is the frequency of term t j in do cumen t d i and d f j is coun t of do cumen ts where term t j app ears 506 500 in, and n is the total coun t of do cumen ts in collection. Simple example: let's sa y w e ha v e a phrase clus- ter con taining do cumen ts d i . These do cumen ts share the same phrase t j . W e compute w i j v alues for all do cumen ts app earing in a phrase cluster sharing the phrase t j . This task is done for all phrase clusters iden ti ed b y the previous stage of the algorithm. No w w e ha v e a complete matrix represen ting the index le in a v ector space mo del (section 2.2). 4.4 Building a Similarity Matrix In the previous stage of the algorithm w e ha v e con- structed a v ector mo del index le. T o build a pro- tein similarit y matrix w e use standard information re- triev al tec hniques for measuring the similarit y in a v ec- tor space mo del. As w as men tioned in section 2.2 w e ha v e used cosine similarit y whic h lo oks quite suitable for our purp ose. The similarit y matrix will b e: Do cumen ts (proteins) similarit y matrix: S = 0 B B B @ 0 sim ( d 1 ; d 2 ) : : : sim ( d 1 ; d n ) sim ( d 2 ; d 1 ) 0 : : : sim ( d 2 ; d n ) . . . . . . . . . . . . sim ( d n ; d 1 ) sim ( d n ; d 2 ) : : : 0 1 C C C A ; where the i -th ro w matc hes the i -th do cumen t (pro- tein resp ectiv ely), and the j -th column matc hes the j -th do cumen t (protein). The similarit y matrix is diagonally symmetrical. Note that on the diagonal w e ha v e put zeros to eliminate sim ( d i ; d i ) whic h is alw a ys equal to 1 and for the simpli cation of the last step of the algorithm. As this task is the most time consuming, w e ha v e dev elop ed a m ulti-threaded v arian t of computing this similarit y matrix. W e ha v e simply divided the similarit y matrix in to n equal parts and for eac h n i thread computed its o wn part of the similarit y matrix. By this little enhancemen t w e ha v e ac hiev ed a v ery go o d reduction of the time needed to compute the similarit y matrix - m ultipro cessors or m ulti-core pro cessors computers required. 4.5 Finding Similar Pr oteins This step is quite simple. When w e ha v e computed the similarit y matrix S , w e simply sort the do cumen ts (proteins) on eac h ro w according to its scores. The higher score the more similar protein is. This is done for eac h protein in our protein collection. 5 Ev aluation and T esting 5.1 Structural Classication of Pr oteins T o ev aluate the accuracy and e ectiv eness of our algorithm w e are using a comparison with the SCOP database [19 ]. It is main tained primarily b y h umans in con trast with for example CA TH, whic h uses some automated metho ds. In the curren t v ersion of the SCOP database there are ab out 33000 of proteins classi ed. W e ha v e c hose SCOP b ecause w e w an ted to ev aluate our algorithm to man ually classi ed proteins rather than to automated metho ds. There is also another structural classi cation system called CA TH. CA TH is a hierarc hical classi cation of protein domain structures, whic h clusters proteins at four ma jor lev els: Class (C), Arc hitecture (A), T op ology (T) and Homologous sup er-family (H). The b oundaries and assignmen ts for eac h protein domain are determined using a com bination of automated and man ual pro cedures whic h include computational tec h- niques, empirical and statistical evidence, literature review and exp ert analysis [2]. The CA TH uses the D ALI algorithm to nd similarities b et w een proteins. 5.2 Ev aluation F or eac h protein P in our collection C w e did the follo wing: 1. F or protein P determine the class, folding pattern group, sup er-family , family and domain. 2. Based on the similarit y matrix, nd N most simi- lar proteins P S according to their score of similar- it y to protein P . 3. F or eac h protein P S determine the class, folding pattern group, sup er-family , family and domain. 4. F or all proteins in our collection compute the p er- cen tage of correctly classi ed proteins P S to pro- tein P . W e did this for eac h protein in our collection and computed the o v erall p ercen tage accuracy o v er our ltered collection. There are appro ximately 10000 unclassi ed proteins b ecause they do not app ear in SCOP database. In more precise terms: let's sa y w e ha v e protein P . Based on the calculated similarit y matrix w e sort all other proteins P S in our protein collection in descending order according to their scores. The 507 501 greater the score the more similar the protein is to protein P . W e tak e only the top N highest scoring proteins (the top N most similar proteins to the giv en protein). W e set N to the v alue of 20. After that w e obtain a list suc h that the similar proteins for ev ery protein in our collection w e ha v e determined the SCOP classi cation of those proteins. 5.3 Experiments Here w e presen t our rst results with this new metho d of measuring protein similarit y based on their tertiary structure and in t comparison with the SCOP database. All exp erimen ts w ere run on computer with 32 GBytes of RAM and 4 AMD 64 bit Opteron dual core CPUs. The whole PDB database indexed b y our v ersion of the sux tree construction algorithm tak es ab ou 2.5 GBytes of RAM and ab out 40 min utes of time (section 4.1). The calculation of the similarit y matrix 4.4 tak es ab out 45 hours of time and 10 Gb ytes of RAM since the similarit y matrix is computed in memory . First w e ha v e computed a p ercen tage accuracy of all proteins in the en tire SCOP database (32509 proteins classi ed), next w e ha v e computed the accu- racy only for proteins for whic h our algorithm found proteins with at least some giv en score of similarit y (e.g. w e ha v e protein A and for this protein exists at least one protein whic h has a score of similarit y with protein A of at least 0.2 - w e cut o all proteins whic h do not satisfy this assumption) - this is some kind of threshold or cuto . The description of the follo wing table 1 is as fol- lo ws (Figures 1, 2, 3, 4 sho w these results in a graph represen tation). Column No. means the ordering of similar proteins (e.g. No. 1 means the most similar protein to a giv en protein, No. 10 means the 10th most similar protein to a giv en protein). Column sim w as men tioned ab o v e. Line Coun t means for ho w man y proteins with this cuto w ere found in our collection. In more precise terms: e.g. line 1 of the table 1 (not considering the header of the table) means that all the proteins placed in the 1st place (i.e. the most similar protein to giv en a protein) has a 89.36909% accuracy in the classi cation of class with no cuto , a 89.62752% accuracy with the cuto of proteins scoring less than 0.1, etc... W e ha v e also iden ti ed class, fold, sup er-family , No. sim 0.0 sim 0.10 sim 0.15 sim 0.20 sim 0.25 1 89.36 89.62 96.24 99.18 99.39 2 84.42 84.65 91.18 94.52 95.18 3 81.84 82.05 88.09 91.80 93.02 4 79.86 80.04 85.68 89.28 90.39 5 78.05 78.27 83.74 87.22 88.38 6 76.92 77.11 82.13 85.98 87.01 7 75.73 75.92 80.94 84.38 85.45 8 74.73 74.89 79.70 82.91 84.06 9 74.02 74.16 78.70 81.94 83.15 10 73.37 73.54 77.72 80.99 82.14 Coun t 32509 32297 23780 16481 11630 T ab le 1. Class c lassication per centa g e accu­ rac y . sim mpa C mpa F mpa S F mpa F mpa D UPC TPC 0.00 89.36 83.24 82.98 82.51 80.39 3352 32509 0.10 89.62 83.60 83.34 82.87 80.75 3303 32297 0.15 96.24 94.11 94.01 93.88 92.84 1395 23780 0.20 99.18 98.87 98.84 98.79 98.33 636 16481 0.25 99.39 99.19 99.19 99.15 98.85 384 11630 0.30 99.38 99.19 99.19 99.14 98.88 247 8083 T ab le 2. Pr oteins unc lassied b y using SCOP f ound b y our algorithm and their member ship per centa g e accurac y (mpa) to a given Class, Fold, Super ­famil y , Famil y and Domain. family and domain of proteins whic h are not classi ed b y the SCOP with almost 100% mem b ership accuracy . T able 2 sho ws these results. Let's examine line 4 of this table. Column sim = 0 : 2 means that w e ha v e c hosen only proteins whic h ha v e at least one structurally similar protein with a score of similarit y of at least 0.2. Column mpa C lass means minimal memb ership p er c entage ac cur acy to the sc op pr otein class (same for F old, Sup erfamily , F amily and Domain). Column UPC - Unclassi e d pr oteins c ount is the coun t of proteins whic h are not classi ed b y SCOP and whic h app ear in the rst place in the list of similar proteins to a giv en protein. Column TPC - T otal pr oteins c ount is the total coun t of proteins whic h ha v e at least one structurally similar protein with a score of similarit y of 0.2. In summary this means that w e ha v e found 636 unclassi ed proteins b y using SCOP out of 16481, suc h that proteins ha v e a 99.18694% class mem b ership accuracy , a 98.87143% fold mem b ership accuracy , etc. 508 502 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Class Membership % Accuracy Ranking cut-off 0.00 cut-off 0.10 cut-off 0.15 cut-off 0.20 cut-off 0.25 cut-off 0.30 Figure 1. Pr otein Class Member ship P er cent­ a g e Accurac y . 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Folding Pattern Membership % Accuracy Ranking cut-off 0.00 cut-off 0.10 cut-off 0.15 cut-off 0.20 cut-off 0.25 cut-off 0.30 Figure 2. Pr otein Folding P attern Member ship P er centa g e Accurac y . 6 Conclusion In this article w e ha v e presen ted a no v el metho d for measuring protein similarities using sux tree data structure and information retriev al tec hniques. The metho d is fully automated and in comparison with the h uman main tained database SCOP has ac hiev ed v ery go o d results. W e ha v e also pro v ed that w e can use common information retriev al mo dels and metho ds for measuring similarit y of proteins. With these metho ds w e ha v e ac hiev ed v ery go o d results. W e can also iden tify classes, folds, sup er-families, families and domains of man y unclassi ed proteins con tained in the curren t SCOP database with almost 100% mem b ership accuracy . By the simple observ ation 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Super-family Membership % Accuracy Ranking cut-off 0.00 cut-off 0.10 cut-off 0.15 cut-off 0.20 cut-off 0.25 cut-off 0.30 Figure 3. Pr otein Super ­F amil y Member ship P er centa g e Accurac y . 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 Family Membership % Accuracy Ranking cut-off 0.00 cut-off 0.10 cut-off 0.15 cut-off 0.20 cut-off 0.25 cut-off 0.30 Figure 4. Pr otein F amil y Member ship P er cent­ a g e Accurac y . that when the unclassi ed protein is most similar to the protein whic h is classi ed and ha v e at least some giv en score, than in 99% cases the unclassi ed protein has a similar SCOP categories as kno wn proteins. W e ha v e no w a similarit y matrix computed for all proteins included in our PDB Database. In future w ork w e w an t to use the similarit y matrix for other information retriev al tasks suc h as clustering or application of statistical metho ds. The clustering of proteins is one of the rst steps in the homology mo deling of proteins, whic h w e w an t to dev elop in the future. W e will also w an t to try other metho ds for enco ding of dihedral angles suc h as the clustering of these angles, whic h should, w e b eliev e, giv e b etter results. 509 503 References [1] Baeza-Y ates R., Rib eiro-Neto B.: Mo dern Informa- tion Retriev al. Adison W esley , 1999. [2] CA TH: Protein Structure Classi cation h ttp://www.cathdb.info/ (last access July-10 2009) [3] Chim H. and Deng X.: A new sux tree similarit y measure for do cumen t clustering. In Pro ceedings of the 16th in ternational Conference on W orld Wide W eb. WWW 2007. A CM, New Y ork, NY, pages 121-130. [4] Ehrenfeuc h t A., Haussler D.: A new distance met- ric on strings computable in linear time. Discrete Applied Math, 20(3): 191-203 (1988). [5] Gao F., Zaki M.J.: PSIST: Indexing Protein Struc- tures using Sux T rees. Pro c. IEEE Computational Systems Bioinformatics Conference (CSB), pages 212-222, 2005. [6] Go ogle Searc h Engine: h ttp://www.go ogle.com. [7] Gus eld, D. Algorithms on Strings, T rees and Se- quences: Computer Science and Computational Bi- ology , Cam bridge Univ ersit y Press, 1997. [8] Habib T., Zhang C., Y ang J.Y., Y ang M.Q., Deng Y.: Sup ervised learning metho d for the prediction of sub cellular lo calization of proteins using amino acid and amino acid pair comp osition. BMC Ge- nomics 2008, 9(Suppl 1):S16. [9] Hammouda K.M. and Kamel M.S.: Ecien t phrase-based do cumen t indexing for w eb do cumen t clustering. IEEE T ransactions on Kno wledge and Data Engineering, 16(10):1279-1296, 2004. [10] Hec k er J., Y ang J.Y., Cheng J.: Protein disor- der prediction at m ultiple lev els of sensitivit y and sp eci cit y . BMC Genomics 2008, 9(Suppl 1):S9. [11] Lesk A.M.: In tro duction to Bioinformatics, Ox- ford Univ ersit y Press, USA, 2008. [12] Manning, C. D.; Ragha v an, P .; Sc h  utze, H. In tro- duction to Information Retriev al. Cam bridge Uni- v ersit y Press; 1 2008. [13] Martino vi  c J., No v os ad T., Sn a  sel V.: V ec- tor Mo del Impro v emen t Using Sux T rees. IEEE ICDIM 2007: pages 180-187 [14] McCreigh t E.: A space-economical sux tree con- struction algorithm. In Journal of the A CM, pages 23:262{272, 1976. [15] C.J. v an Rijsb ergen: Information Retriev al (sec- ond ed.). London, Butterw orths, 1979. [16] R CSB Protein Databank - PDB. h ttp://www.rcsb.org (last access July-10 2009) [17] Ro deh M., Pratt V.R., and Ev en S.: Linear algo- rithm for data compression via string matc hing. In Journal of the A CM, pages 28(1):16{24, 1981. [18] Salton, G., and Buc kley , C. T erm-w eigh ting ap- proac hes in automatic text retriev al. Information Pro cessing and Managemen t, 24(5):513-523, 1988. [19] SCOP: a structural classi cation of proteins database for the in v estigation of sequences and structure. h ttp://scop.mrc-lm b.cam.ac.uk/scop/ (last access July-10 2009) [20] Shibuy a T.: Geometric Sux T ree: A new index structure for protein 3D structures. In Com bina- torial P attern Matc hing, LNCS 4009, pages 84-93, 2006. [21] Ukk onen E.: On-line construction of sux trees. In Algorithmica, pages 14:249{60, 1995. [22] W ang X., W ang G., Shen C., Li L., W ang X., Mo oney S.D., Eden b erg H.J., Sanford J.R., Liu Y.: Using RNase sequence sp eci cit y to re ne the iden- ti cation of RNA-protein binding regions. BMC Genomics 2008, 9(Suppl 1):S17. [23] W einer P .: Linear pattern matc hing algorithms. In The 14th Ann ual Symp osium on F oundations of Computer Science, pages 1{11, 1973. [24] Y ang J.Y., Y ang M.Q., Dunk er A.K., Deng Y., Huang X: In v estigation of transmem brane proteins using a computational approac h. BMC Genomics 2008, 9(Suppl 1):S7. [25] Zamir O.: Clustering w eb do cumen ts: A phrase- based metho d for grouping searc h engine results. In Do ctoral dissertation. Univ ersit y of W ashington, 1999. [26] Zamir O., Etzioni O.: W eb do cumen t clustering: A feasibilit y demonstration. In SIGIR'98, pages 46{ 54, 1998. 510 504

Related Contents


Next Show more