Computer Physics Communications    www
119K - views

Computer Physics Communications www

elseviercomlocatecpc A specialpurpose computer for exploring similar protein sequences by the dynamic programming method Takashige Sugie Keiji Horita Satoru Hosono Tomoyoshi Ito Toshikazu Ebisuzaki Laboratory of Tomoyoshi Ito Department of Electro

Download Pdf

Computer Physics Communications www

Download Pdf - The PPT/PDF document "Computer Physics Communications www" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Computer Physics Communications www"— Presentation transcript:

Page 1
Computer Physics Communications 157 (2004) 1–8 A special-purpose computer for exploring similar protein sequences by the dynamic programming method Takashige Sugie , Keiji Horita , Satoru Hosono Tomoyoshi Ito , Toshikazu Ebisuzaki Laboratory of Tomoyoshi Ito, Department of Electronics and Mechanical Engineering, Chiba University, 1-33, Yayoi, Inage, Chiba 263-8522, Japan The Institute of Physical and Chemical Research, Japan Received 5 February 2003; received in revised form 1 October 2003 Abstract We built a special-purpose computer for exploring

similar protein sequences by the dynamic programming method, BIOLER- 1 (BIOLogical sequence explorER). It can compute a complete similarity between two protein sequences which have less than 1024 amino acids. We integrated the system on a PLD (Programmable Logic Device) chip, APEX EP20K300EQC240-1 (300,000 gates) by Altera corporation. It is mounted on the 32 bit PCI (Peripheral Component Interconnect) bus board which is connected to a personal computer. The performance is 3564 MIPS (Million Instructions Per Second) which is three times faster than a personal computer with Pentium4 at 1.8 GHz.

BIOLER-1 showed us the effectiveness in the field of biological sequence analysis. 2003 Elsevier B.V. All rights reserved. PACS: 87.80.+s Keywords: Special-purpose computer; PLD; Dynamic programming; Pairwise alignment; Protein sequence; Bioinformatics; Homology search 1. Introduction The human genome project [1] has started since 1988. At last, mapping and sequencing has finished in April 2003 [2–7]. Full length human DNA (Deoxyri- boNucleic Acid) is over 3 billion bases. Biological se- quence analysts are tasked by a large amount of text processing. They analyze DNA and protein

sequences to identify a hereditary function. In recent years, com- Corresponding author. E-mail address: (T. Sugie). puter technology has shown remarkable progress, but is not enough to compute protein analysis tasks yet. For example, if they analyze one hundred thousands proteins which have 1000 characters of amino acids and the human DNA which has 3 10 characters of bases, the computational complexity is O 10 17 The dynamic programming method [8] is well known as solving the global matching problem of pair- wise alignment in the biological sequence analysis. The computational

complexity is O (n where is the number of amino acids in a protein sequence. It can be reduced to O (n using the affine gap penalty func- tion [9]. (Hereafter, we call the dynamic programming 0010-4655/$ – see front matter 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0010-4655(03)00487-9
Page 2
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 method using the affine gap penalty function “DP”.) The DP can completely get a similarity. Some meth- ods to reduce the computational complexity of the DP have been developed. FASTA [10,11] and BLAST [12]

methods are well known as the biological sequence analysis. Specially, the calculation time was reduced by BLAST. However, the both algorithms have a pos- sibility of missing a similarity because of the nature of each algorithm. In another, it is attempted to improve the perfor- mance of computer itself. One is the PC (Personal Computer) cluster system or the grid computing sys- tem. Such a system obtains us for high performance, but it is difficult that each PC gives full performance. For some problems, a special-purpose computer system is suitable. It has very high performance for no

waste. Some special-purpose computers have already designed, for example, Refs. [13,14]. The DP can completely get a similarity by computing only simple arithmetic. This feature is suitable for a special- purpose computer. So we built a special-purpose computer for exploring similar protein sequences by the dynamic programming method, which we named BIOLER-1. It is a prototype device of biological sequence explorer. BIOLER-1 was designed with the pipeline ar- chitecture in one PLD (Programmable Logic De- vice) chip. BIOLER-1 can be activated at 33 MHz. BIOLER-1 has performance of 3564 MIPS

(108 in- structions 33 MHz 1 pipeline 1 chip). BIOLER- 1 is mounted on the 32 bit PCI (Peripheral Compo- nent Interconnect) bus board, and it is driven by the host computer. It costs us about 100,000 Japanese yen (750 US dollars) to build BIOLER-1. BIOLER-1 can calculate a similarity between the two proteins which have the number of amino acids less than 1024. The number is not yet enough, but BIOLER-1 shows that a special-purpose computer system has the potential performance for biological sequence analysis. We explain a pairwise alignment with the DP in Section 2. We describe the hardware

design of BIOLER-1 in Section 3. We show the performance of BIOLER-1 in Section 4. We discuss about a special- purpose computer for the biological sequence analysis in Section 5. 2. Pairwise alignment Fig. 1(a) shows the pairwise alignment between the two protein sequences, GKFD and GFSD. ‘G’, ‘K’, ‘F’, ‘D’ and ‘S’ are amino acid symbols. We calculate the score from the upper left to the lower right at the two-dimensional network, and get the similarity between the two proteins. Fig. 1(b) shows the state of calculation at one node in the network. The calculation for getting a similarity is

simple operations [15] as follows: i,j Max i,j ,(D ,j a),(D ,j a) (1) b, (2) i,j Max i,j ,D ,j ,D ,j ]+ i,j i,j Max (D i,j a),(D ,j a),D ,j (3) b. The i,j is a similarity at a node and .The superscripts of i,j are the way to calculate. They correspond to Fig. 1(b). The Max operation returns the maximum score in three values in the bracket. The i,j is the element with and at the probability transition matrix. The is the opening gap cost. The is the extending gap cost. The probability transition matrix gives us the similarity between two amino acids. We used PAM250 (Point Accepted Mutation)

matrix, called Dayhoff matrix [16]. The two gap costs, and , are decided by the probability transition matrix and the problem we solve [17]. In the Dayhoff matrix, the opening gap cost is 7 and the extending gap cost is 1. We calculate for all nodes using Eqs. (1), (2) and (3). We can get the similarity between two proteins at the best score of the lower right node. Here, we trace the best path in the network of Fig. 1(a). The upper left node has 0 score. The scores at the first column and row nodes are linear, because the way to a node is only one. The first nodes and have the

score which is the opening gap cost and the extending gap cost. The following nodes have the score which is the score of previous node and the extending gap cost. Therefore, we can start the calculation at the node . In the case that the way goes to the right node from the , we can get 3 score using Eq. (1). Each
Page 3
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 (a) (b) Fig. 1. Pairwise alignment. value in the Max operator is the score as follows: = 16 = = 23 The Max operator in Eq. (1) selects the way from the node . Therefore, the is decided 3. In

thesameway,the is 0 and the is 3. If we continue the calculation to the lower right node, we can get the best score, 2. The best score is the similarity between the two proteins. If we keep a past path, we can get a pairwise alignment. The bold line is the alignment between the two proteins in Fig. 1(a). The alignment using PAM250 is given as the following sequences. GKF-D G-FSD A hyphen expresses a gap which means that an amino acid is lacking. The alignment indicates that regarding the amino acids, ‘K’ and ‘S’ as lack or insertion is the best. If we use a probability transition matrix which

is different from PAM250, we may get the below alignment: GKFD GFSD In this case, the alignment means that the amino acids, ‘K’ and ‘F’ mutate and become ‘F’ and ‘S’. The above information is important in the biology. However, computing an alignment demands a huge memory space. Therefore, BIOLER-1 only calculates the score. We obtain the alignment using the host computer. 3. Hardware design BIOLER-1 is designed with the pipeline architec- ture in a PLD chip. It is driven on the host computer through the PCI bus. We operate it as follows. (1) The host computer sets the number of amino acids for

query and known protein sequences. (2) The host sets amino acid symbols of them. (3) The host sends a signal to start the pipeline. (4) The host gets the best score after the calculation has done. We repeat the above operations until no query se- quence exists. 3.1. Pipeline Fig. 2 is the block diagram of BIOLER-1. The comparators in the figure compare two input scores to output the larger value. The gap cost is a constant number which is decided by the probability transition matrix. BIOLER-1 uses the Dayhoff matrix. In this matrix, the opening gap cost is 7 and the extending gap cost is

1. i,j is an element of the Dayhoff matrix of amino acid and . All data format is integer. All calculations are fixed-point operations.
Page 4
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 Fig. 2. Block diagram of the BIOLER-1 pipeline. We follow the pipeline. The first step is to fetch each score from RAM (Random Access Memory) to register. The second step has two processes. A pattern to go from a left node to the right does not add an opening gap cost. A pattern from an upper to the lower does not either. We change Eqs. (1) and (3) to the next

expressions for efficient pipeline, i,j Max (D i,j a),D ,j ,D ,j (4) (a b), i,j Max i,j ,D ,j ,(D ,j a) (5) (a b). We could make an efficient pipeline as Fig. 2. There- fore, the scores are subtracted by an opening gap cost. The other process is a pattern to go to the lower right node. This pattern can compare two scores without using a gap cost. The third step is comparison with the score from a left path. The fourth step is addition for the gap cost or the element of the Dayhoff matrix. The fifth step is to store each high score from register to RAM. We explain how to get a

score toward the right node. It fetches scores of the left, the upper left and the upper nodes from RAM1, RAM2 and RAM3 in Fig. 2. It calculates scores and stores them to RAM4, RAM5 and RAM6. For example, to calculate a score for the right node is as follows. It gets a larger score (Value1) between the score from the upper left node and from Fig. 3. Direction of calculation for the BIOLER-1 pipeline. the upper node at Comparator2. On the other hand, it gets a score (Value2) subtracted the opening gap cost from the score from the left node at Adder1. It gets a lager score (Value3) between

Value1 and Value2 at Comparator3. It can add the opening gap cost and the extending gap cost to Value3 at Adder3 because it subtracted the opening gap cost from a score from the left node at Adder1. Finally, it gets the best scores toward the right node. The pipeline calculates each high score to the right, the lower right and the lower nodes at all amino acids and . In an arbitrary node, it can be calculated on condition that the left, the upper left and the upper nodes have already had each best score. For example, as a general-purpose computer, if we calculate along the horizontal or the

vertical line in regular order, we always waste pipeline delay
Page 5
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 time for each node. It is because the DP requires that the previous neighboring nodes have had the best score. Fig. 3 is two-dimensional networks for the pairwise alignment as Fig. 1(a). The arrows express the direction of calculation. The numbers express the order of calculation. If we calculate them by this order, we do not waste calculation time. Because a node has already had the score for calculating. Therefore, the node can start the calculation

immediately. We can calculate them with the pipeline method to slant direction. We need not calculate the scores at the nodes of the white (not filled) circles in Fig. 3. We can know the scores about the first column and row without calculating since the both lines have only one node as the previous neighboring node. So we need calculate for the filled nodes. Therefore, the computational complexity of BIOLER-1 is given by the next expression (6) bioler seq Here, seq is the number of amino acids of a protein sequence. We mention the memory space. The true maximum score is

obtained when we compare between two protein sequences which have the only amino acid which is given the maximum score in the probability matrix. The true minimum score is obtained in the same way. However, generally, sequences such as the above are out of a target. According to them, we decide the dynamic range in BIOLER-1. The maximum score is obtained when we compare an amino acid sequence with itself. At present, it is 8291 score and has 1733 amino acids in the PDB (Protein Data Bank) [18] database. We can get the maximum score as follow: (7) max 8291 seq 1733 (8) 78 seq And the minimum is

obtained when all amino acids are decided lacks or insertions. It depends on the length of sequence. We can get the minimum score, (9) min seq Therefore, we need data width as follow: (10) =| max |+| min (11) 78 seq seq (12) 78 seq Since, BIOLER-1 pipeline is designed simply. It needs a little memory space. We understand that we need 6 RAMs from Fig. 2. But we must prepare one more RAM for which we can get a score from ,j We remember that the direction of calculation is slant. ,j requires the scores of previous two calculation lines. For example, in Fig. 3, the node on the third calculation

line needs ,j existing on the first calculation line. We need two RAMs for ,j . The total needed memory space is given the next expression (13) total seq bit seq 20 20 Here, bit is the bit width of score. The second term is an amount of data for the amino acid symbols of two proteins. The third term is an amount of data for the Dayhoff matrix. The numbers of amino acid are 20 patterns, which can be expressed 5 bit. bit is expressed using Eq. (12) as follow: (14) bit log 78 seq Eq. (13) is expressed by Eq. (14) as follow: (15) total seq log 78 seq 10 seq 2000 The BIOLER-1 chip has 147,456

bit (18.4 KByte) memory as the hardware device. In this limitation, seq becomes 1438 by Eq. (15). When seq is 1438, Eq. (14) is changed as follow: (16) bit log 78 1438 (17) 13 02 So we designed BIOLER-1 in considering the above. We decided that seq is 1024 and bit is 13 bit. In this case, total is 105,424 bit (13.2 KByte). Therefore, BIOLER-1 can compute a similarity between two proteins with less than 1024 amino acids. 3.2. Interface We adopted the PCI bus for the communication between the host computer and BIOLER-1. We made the 32 bit PCI bus controller as the target device on the PCB

(Printed Circuit Board) implemented the PLD chip, FLEX EPF10K100ARC240-1 (100,000 gates) made by Altera corporation. We used Max+Plus2 version 10.2 by Altera as CAD (Computer Aided
Page 6
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 Fig. 4. Top and bottom view of the BIOLER-1. Design) tool. We connected it with BIOLER-1 using a wide SCSI (Small Computer System Interface) cable which has 68 bit data width. Its communication speed is about 22 MByte sec on Linux kernel version 2.4.19 with GCC (Gnu Compiler Collection) version 3.2. 3.3. Package We packaged BIOLER-1 on

a universal board whose size is 23 cm by 16 cm. Fig. 4 is the top and bottom views of it. It is composed of only one PLD chip, APEX EP20K300EQC240-1 (300,000 gates) made by Altera. It has a wide SCSI connector for the external communication. This is able to communicate with 68 bit data width. No external memory was used. The DP must access to memory which have a score at all time. If we use external memory, the calculation performance limits by the access speed of it. So we only used internal memory in a PLD chip. We used Quartus2 version 2.0 with the service pack 2 by Altera as CAD tool to

design a BIOLER-1 chip. The cost for the electronics parts used in PCI bus controller and BIOLER-1 is about 200,000 Japanese yen (1500 US dollars). But a PCB for the PCI bus controller is an independent board from BIOLER- 1. Only BIOLER-1 is about 100,000 Japanese yen (750 US dollars). We had started designing PCI bus controller and BIOLER-1 in January 2002 and have finished in October 2002. 4. Performance The computational complexity for Eq. (1) is equal to about 108 instructions on a general-purpose com- puter. We implemented one pipeline in a BIOLER-1 chip. BIOLER-1 is driven at 33

MHz clock frequency. Therefore, BIOLER-1 has about 3564 MIPS (Million Instructions Per Second) at the peak performance. The effective performance is decided on the ratio of the communication cost to the calculation cost. We send amino acid symbols of query and known sequences to compute. The number of amino acids is 20 patterns, which can be expressed 5 bit. We can send 6 amino acid symbols at one time via 32 bit PCI bus.
Page 7
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 Therefore, the communication time is (18) comm seq comm where comm is the time to transfer

32 bit data between the host and BIOLER-1. The calculation time of BIOLER-1 is (19) bioler seq bioler where bioler is the clock period of BIOLER-1. Since comm 180 ns (22 MByte sec) and bioler 30 ns (33 MHz) in BIOLER-1, the ratio of comm to bioler is (20) comm bioler seq seq comm bioler (21) seq The average number of amino acids is 500. Then comm is less than 0.4% of bioler and the effective performance is over 99% of the peak performance of 3564 MIPS. We got a result that BIOLER-1 could calculate 92 s to compare 10,000 proteins which have 500 amino acids. On a general-purpose computer with

Pentium4 at 1.8 GHz, we took 268 s to calculate the same task. 5. Discussion The pipeline of BIOLER-1 calculates each node for itself. It is easy that we can improve the pipeline system of BIOLER-1 to multi pipelines. Using such architecture, we can build and design a BIOLER-2 board. We have a plan that we design a new pipeline and mount multiple pipelines on a BIOLER-2 chip. Using the new pipeline, we will be able to reduce the memory space which is wasted in a BIOLER-1 chip. If we could reduce it, we would be able to cut down the calculation time and make a protein sequence longer.

Specially, the latter will be possible for us to calculate in the base level. An amino acid is translated from three bases. Simply, a sequence length changes three times longer. So the computational complexity and the number of the calculation will increase. We also need a high performance device. In the near future, we will be able to get a new PLD chip which has 10,000,000 gates. It is 33 times as much as a BIOLER-1 chip which has 300,000 gates. If we use the new multiple pipeline in a BIOLER-2 chip, we can implement about 64 pipelines. Moreover, we speed up the clock frequency to 133 MHz.

The performance of BIOLER- 2 will be (22) bioler 256 bioler where bioler is the calculation time of BIOLER- 2. We can calculate between two protein sequences which have 500 amino acids less than a second. And we can investigate a biological sequence in detail. We expect that the BIOLER special-purpose computer is a powerful explorer to clear the unknown field in the biological sequence analysis. Acknowledgements We would like to thank Dr. Nobuyuki Masuda and Dr. Tomoyoshi Shimobaba for helpful comments. References [1] National Research Council, Mapping and Sequencing the Human Genome,

National Academy Press, Washington DC, 1988. [2] B.R. Jasny, L. Roberts, Introduction, Science 300 (2003) 277. [3] F.S. Collins, M. Morgan, A. Patrinos, The Human Genome Project: Lessons from large-scale biology, Science 300 (2003) 286–290. [4] M.E. Frazier, G.M. Johnson, D.G. Thomassen, C.E. Oliver, A. Patrinos, Realizing the potential of the genome revolution: The genomes to life program, Science 300 (2003) 290–293. [5] F.S. Collins, E.D. Green, A.E. Guttmacher, M.S. Guyer (on behalf of the US National Human Genome Research Institute), A vision for the future of genomics research, Nature 422

(2003) 835–847. [6] S.B. Carroll, Genetics and the making of homo sapiens, Nature 422 (2003) 849–857. [7] J. Arnold, N. Hilton, Revelations from a bread mould, Na- ture 422 (2003) 821–822. [8] S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Molecular Biology 48 (1970) 443–453. [9] T.F. Smith, M.F. Waterman, Identification of common molecu- lar subsequences, J. Molecular Biology 147 (1981) 195–197. [10] W.J. Wilbur, D.J. Lipman, Rapid similarity searches of nucleic acid and protein data banks, Proc.

Natl. Acad. Sci. USA 80 (3) (1983) 726–730. [11] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein simi- larity search, Science 227 (1985) 1435–1441.
Page 8
T. Sugie et al. / Computer Physics Communications 157 (2004) 1–8 [12] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, A basic local alignment search tool, J. Molecular Biology 215 (1990) 403–410. [13] T. Narumi, R. Susukita, T. Ebisuzaki, G. McNiven, B. El- megreen, Molecular dynamics machine: Special-purpose com- puter for molecular dynamics simulations, Molecular Simula- tion 21 (1999) 401–415. [14] T.

Shimobaba, S. Hishinuma, T. Ito, Special-purpose computer for holography HORN-4 with recurrence algorithm, Comput. Phys. Comm. 148 (2002) 160–170. [15] O. Gotoh, An improved algorithm for matching biological sequences, J. Molecular Biology 162 (1982) 705–708. [16] M. Dayhoff, R.M. Schwartz, B.C. Orcutt, A model of evolu- tionary change in proteins, Atlas Protein Sequence Struct. 5 (3) (1978) 345–352. [17] W.M. Fitch, T.F. Smith, Optimal sequence alignments, Proc. Natl. Acad. Sci. 80 (1983) 1382–1386. [18] RCSB (Research Collaboratory for Structural Bioinformatics), PDB (Protein Data Bank),