2014-12-13 104K 104 0 0


32 NO 6 JUNE 1997 797 A 1Gbs FourState Sliding Block Viterbi Decoder Peter J Black Member IEEE and Teresa HY Meng Abstract To achieve unlimited concurrency and hence throughput in an areaef64257cient manner a sliding block Viterbi deco ID: 23594

Direct Link: Link: Embed code:

Download this pdf

DownloadNote - The PPT/PDF document "IEEE JOURNAL OF SOLIDSTATE CIRCUITS VOL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Page 1
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 797 A 1-Gb/s, Four-State, Sliding Block Viterbi Decoder Peter J. Black, Member, IEEE , and Teresa H.-Y. Meng AbstractÐ To achieve unlimited concurrency and hence throughput in an area-efficient manner, a sliding block Viterbi decoder (SBVD) is implemented that combines the filtering characteristics of a sliding block decoder with the computational efficiency of the Viterbi algorithm. The SBVD approach reduces decode of a continuous input stream to decode of independent overlapping blocks, without constraining the encoding process. A systolic SBVD architecture is presented that combines forward and backward processing of the block interval. The architecture is demonstrated in a four-state, =1 , eight- level soft decision Viterbi decoder that has been designed and fabricated in double-metal CMOS. The 9.21 mm 8.77 mm chip containing 150 k transistors is fully functional at a clock rate of 83 MHz and dissipates 3.0 W under typical operating conditions DD =5 V, =27 C). This corresponds to a block decode rate of 83 MHz, equivalent to a decode rate of 1 Gb/s. For low-power operation, typical parts are fully functional at a clock rate of greater than 12 MHz, equivalent to a decode rate of 144 Mb/s, and dissipate 24 mW at DD =1 V, demonstrating extremely low power consumption at such high rates. Index TermsÐ Forward error correction, trellis codes, Viterbi decoding, Viterbi detection, Viterbi estimation. I. I NTRODUCTION N recent years there has been great interest in the imple- mentation of high-speed Viterbi decoders. The application of Viterbi decoders to magnetic storage channels for decoding intersymbol interference has pushed required decode rates to over 100 Mb/s for high-end drives [1]. A potential application of even higher rates is in convolutionally coded optical -ary pulse position modulation (PPM) systems, for which decode rates extend into the Gb/s range [2]. Another less obvious application of high-speed architectures is in low power design through voltage scaling, a technique which is based on trading excess speed for reduced power dissipation at lower voltages [3]. The goal of high-speed Viterbi decoder design is to achieve unlimited concurrency and hence throughput while obeying an ideal linear scaling rule. Lookahead-based architectures (e.g., the layered method in [4]) have been proposed as a means of achieving unlimited concurrency; however, such architectures are not area-efficient, because a -fold increase in throughput requires a -fold increase in hardware complexity, where is the number of states. The radix-4 fully parallel architecture for a radix-2 trellis is the only lookahead scheme which obeys Manuscript received April 29, 1996; revised January 1, 1997. This work was supported in part by ARPA. P. J. Black was with the Electrical Engineering Department, Stanford University, Stanford, CA 94305 USA. He is now with QUALCOMM Inc., San Diego, CA 92121-2779 USA. T. H.-Y. Meng is with the Electrical Engineering Department, Stanford University, Stanford, CA 94305 USA. Publisher Item Identifier S 0018-9200(97)03838-9. ideal linear scaling for a two-fold increase in throughput [5]. Further throughput increases require an alternative approach to lookahead, if the overall area efficiency is to be maintained. The simplest method to increase throughput is to decompose the input stream into blocks of length which can be processed in parallel using conventional Viterbi decoders. Such an architecture yields a -fold increase in throughput for a -fold increase in complexity. However, independent block processing requires knowledge of the initial state metrics which are unknown until the previous block is processed. Hence, without additional information, this approach is of no practical use because is limited to one. Two practical approaches to block-based decoding are state initialization [6] and interleaving [4]. State initialization achieves block independence by forcing the encoder to a known state at the start of each block. This technique reduces the information rate and adds complexity to the decoder which must obtain block synchronization from the received data. Interleaving achieves block independence by interleaving independently encoded sequences. By definition, the blocks are independent and hence can be decoded independently; however, the encoded sequences are interleaved prior to the addition of noise. This approach is not applicable to digital sequence detection in the presence of intersymbol interference because the ˚encoderº is the intrinsic impulse response of the channel and hence cannot be separated from the additive noise component of the channel. Another block-based approach is the sliding block decoder [7]. This approach is based on modeling the Viterbi algorithm as a time-invariant, nonlinear digital filter with finite mem- ory. Table lookup is the proposed method to implement the nonlinear filter; however, except for the simplest examples, this approach is infeasible. For example, implementation of a four-state, convolutional decoder (hard decision) requires 8 Gbytes of ROM to achieve coding gain equivalent to a conventional Viterbi decoder. The infeasibility of this approach explains why it has been all but forgotten in the Viterbi literature. In this paper, an alternative block-based method is pro- posed that can be applied to any Viterbi decoding problem without constraining the encoding process. The sliding block Viterbi decoder (SBVD), similar to the ˚acquisition method 1º proposed in [8], utilizes the Viterbi algorithm to imple- ment the sliding block method. Rather than precompute the nonlinear filter tables, the Viterbi algorithm is used in real time to compute the decoder output. The SBVD combines the filter characteristics of the sliding block decoder with the computational efficiency of a conventional Viterbi de- coder. 0018±9200/97$10.00 1997 IEEE
Page 2
798 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 Fig. 1. Four-state, =1 , convolutional encoder/decoder. The systolic SBVD implementation presented in this paper is similar to the systolic minimized architecture [9], but there are also some major differences. The minimized method was proposed as an extremely efficient block-based algorithm and claimed to be minimal with respect to hardware complexity (hence the name minimized). It will be shown that the hard- ware complexity of our SBVD architecture is similar to, if not less than, that of the minimized method. Furthermore, it provides a true maximum likelihood algorithm given the constraint of block decoding. As such, the coding performance of the SBVD method always upper bounds the minimized method for the same block length. The organization of this paper is as follows. In Section II the Viterbi algorithm and the notational conventions are reviewed. In Section III the sliding block method is reviewed and ex- tended to the more general and practical SBVD method. A systolic SBVD architecture and its implementation for a four- state Viterbi decoder is presented in Section IV. Finally, in Section V fabrication results for the design are presented. II. V ITERBI LGORITHM The Viterbi algorithm is an optimal algorithm for estimating the state sequence of a finite state process observed in the presence of memoryless noise [10]. The classical application of the Viterbi algorithm is decoding convolutional codes, a class of forward error correction (FEC) code. The four-state, convolutional encoder/decoder is shown in Fig. 1. The encoder consists of a two-tap binary shift register, which can be modeled as a four-state finite state process. For each input (data) symbol , two encoded output symbols and are generated as a modulo-2 sum of the shift register contents and the input. The encoded symbols are corrupted by noise during transmission to produce decoder inputs and . Given the noisy observations of the encoding finite state process, the Viterbi algorithm is used to estimate the most likely encoding state sequence from which the decoded sequence can be easily derived. In this section the algorithm is described from an operational point of view using the four-state encoder example. The evolution of a four-state encoder can be described using the trellis diagram shown in Fig. 2. The trellis is a time- indexed version of the state diagram. Each node corresponds to a state at a given time index, and each branch corresponds to a state transition. Associated with each branch is the input symbol and output symbols corresponding to the state transition. Given a known starting state, every input sequence corresponds to a unique path through the trellis. Fig. 2. Four-state trellis diagram. At the decoder, each branch is assigned a weight, referred to as a branch metric , which is a measure of the likelihood of the transition given the noisy observations. Branch metrics are typically calculated using a distance measure, so that more likely transitions are assigned smaller weights. Given the unique mapping between a trellis path and an input sequence, the most likely path (shortest path) through the trellis corresponds to the most likely input sequence. In this context, the Viterbi algorithm is an efficient method for finding the shortest path through a trellis. The first phase of the Viterbi algorithm is to recursively compute the shortest paths to time in terms of the shortest paths to time . At time each state is assigned a state metric which is defined as the accumulated metric along the shortest path leading to that state. The state metrics at time can be recursively calculated in terms of the state metrics of the previous iteration as follows: (1) where is a predecessor state of and is the branch metric on the transition from state to state . The qualitative interpretation of this expression is as follows. The shortest path into state must pass through a predecessor state by definition. If the shortest path into passes through , then the state metric for the path must be given by the state metric for plus the branch metric for the state transition from to The final state metric for is given by the minimum of all candidate paths. The recursive update given in (1) is the well-known add- compare-select (ACS) operation and is implemented as shown in Fig. 3 for the four-state trellis. The update unit, referred to as a two-way ACS unit, outputs the updated state metric and a 1-b decision which identifies the entering path of minimum metric. The second phase of the Viterbi algorithm involves tracing back and decoding the shortest path through the trellis, which is recursively defined by the decisions from the ACS update. The shortest path leading to a state is referred to as the survivor path for that state. A property of the trellis which is utilized for survivor path decode is that if the survivor paths from all possible states at time are traced back, then with high probability all the paths merge at time , where is the survivor path length. Once the survivor paths have merged,
Page 3
BLACK AND MENG: 1-Gb/s, FOUR-STATE, SLIDING BLOCK VITERBI DECODER 799 (a) (b) Fig. 3. (a) Predecessor states of state-00. (b) State metric update imple- mented using two-way ACS unit. the traced path is unique independent of the starting state and future ACS iterations. Based on the survivor path merge property, the trace-back method for survivor path decode proceeds as follows [11]. Store all the decisions from to . At time an arbitrary starting state is chosen and the survivor path is traced back to time , at which point the input corresponding to the transition at time is decoded. III. S LIDING LOCK ETHODS A. Sliding Block Decoder The sliding block decoder was originally proposed for con- volutional decoding, although the result applies equally well to any Viterbi-based decoder. The only difference between the two applications is the survivor path length required to achieve the desired performance. In the following description of the sliding block method, an -state ( ), convolutional code is assumed. It has been shown that the survivor paths from all possible starting states merge with high probability iterations back into the trellis. The parameter is the well-known survivor path length and is typically 5 [12]. Similarly, when starting with unknown initial state metrics (typically set to zero), the state metrics after trellis iterations are independent of the initial metrics; or equivalently, the survivor path will merge with the true survivor path had the initial metrics been known. The parameter is the synchronization length and is typically [13]. Based on the observed survivor path behavior, it is proposed in [7] that the state at time can be decoded using only information from the interval to . Such a decoder is a sliding block decoder, a time-invariant nonlinear digital filter with finite memory and delay. The proposed implementation of the nonlinear filter is table lookup [7]. Received symbols over the interval to form the address, and the table output is the decoder output. Soft decision decoding of bits using 5 past and future symbols requires 20 address bits ( implies two symbols per iteration). Even the simple (2, 1, 3) code, with 3-b soft decision inputs, requires a 60-b address lookup table ROM! While the sliding block concept is sound, this implementation is infeasible. The following subsection describes how the sliding block approach can be implemented efficiently using the Viterbi algorithm. Fig. 4. Hybrid Viterbi algorithm for selecting the shortest path through a trellis of finite length (four-state trellis example). B. Sliding Block Viterbi Decoder The sliding block decoder output is equal to the decision of the state at time that lies on the shortest path (survivor path) through the trellis from to , which is equivalent to maximum likelihood decoding given a finite block length. The Viterbi algorithm is normally run in the direction of increasing time (forward) to avoid storing the entire sequence before decoding can start. Given a finite length trellis corresponding to the block interval, the shortest path can be found by running the algorithm forward or backward (by reversing the trellis transitions). Intuitively, finding the shortest path in either direction must yield the same result. Another implication of the reversibility is that and should be chosen equal for a given total block length [7]. In the following discussion the block interval is assumed to be to As well as forward and backward processing of the inter- val, a general hybrid Viterbi algorithm can be derived that combines forward and backward processing of the interval as shown in Fig. 4. At some trellis iteration , within the interval, the shortest path must pass through one of the four possible states. Forward processing of the interval to yields four survivor paths corresponding to the shortest paths from to each state at time . Similarly, backward processing of the interval to yields the shortest paths from to each state at time . For a given state at time the shortest path over the interval through this state must be the concatenation of the forward and backward shortest paths for this state (proof by contradiction) and the state metric for a concatenated path is the sum of the forward and backward state metrics [8]. Hence, selecting the state at time with the smallest concatenated state metric yields the starting state for trace-back of the shortest path. If then can be decoded directly; otherwise trace-back from to is required. The hybrid Viterbi algorithm subsumes the forward only and backward only algorithms, which correspond to the special cases of and , respectively. The mini- mized method is based on another special case corresponding
Page 4
800 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 (a) (b) Fig. 5. Block decoding using the SBVD method: (a) forward processing and (b) equal forward/backward processing. to . A sliding block decoder described in this paper is the SBVD. The SBVD requires trellis iterations per decoded output; hence the relative area efficiency is compared with a conventional Viterbi decoder. The relative area efficiency can be increased by decoding a block of length rather than just a single iteration. That is, apply the hybrid Viterbi algorithm to the interval to and decode the interval to . Each block of decoded outputs requires trellis iterations and hence the relative area efficiency is . As the decoded block length increases, the SBVD approaches ideal linear scaling. This SBVD architecture can be applied to high-speed Viterbi decoding of convolutional codes of any number of states. The hardware overhead, compared to sequential Viterbi decoding, is independent of the number of states, but proportional to the ratio of the survivor path length and the block length , which can be made arbitrarily small. This relatively small overhead enables codes of large constraint lengths to be decoded as efficiently as codes of shorter lengths without a theoretical speed limit. Any choice of for the hybrid Viterbi algorithm can be used for block decoding. Two schemes of practical interest are forward only processing, shown in Fig. 5(a), and equal forward/backward processing, shown in Fig. 5(b). The sys- tolic architecture described in Section IV is based on the forward/backward algorithm because the available algorithm concurrency is doubled with respect to the forward only approach. This translates to reduced decode delay and re- duced skew buffer memory required to support the systolic architecture. C. Minimized Method The minimized method [9] is an alternative block-based algorithm that utilizes forward/backward processing as shown in Fig. 6. The algorithm is a two-pass procedure. The first pass is used to estimate the states at either end of the decode block. On the second pass, these estimated states are used to force the initial state metrics for processing of the decode block. Fig. 6. Block decoding using the minimized method. Using known initial state metrics, the entire survivor path of the decode block can be decoded, rather than just the state at the merge point of the forward and backward processing. D. Comparison of SBVD and Minimized Method As formulated, both the SBVD and the minimized method decode the interval to based solely on the input symbols over the interval to . Given this finite observation window, the SBVD method is a true maximum likelihood algorithm (under the constraint of using block decoding), selecting the path (or sequence) from the set of all possible paths that is closest to the observed output over the finite observation interval. In comparison, the minimized method is not a maximum likelihood algorithm because the estimates of the states at either end of the decode block are not based on all of the available data. A true maximum likelihood estimate is based on the entire observation interval and hence the coding gain of the SBVD method always upper bounds the minimized method for the same interval parameters. The basic operation in both the SBVD and the minimized methods is the block state estimate. A state estimate consists of pair-wise additions of the forward and backward state metrics for each state, followed by an -way compare to select the state of smallest metric. This operation is referred to as an -way add-compare (AC), the complexity of which is lower bounded by two-way ACS units. The relative performance of the two approaches depends on the value of the survivor path length .At typical for fixed state decoding [12], the relative difference in performance is insignificant. The SBVD method is equivalent to the best state survivor path decoding, and hence the survivor path length can be reduced to 2.5 [11], without any significant performance degradation. It is in this operating regime that the performance of the minimized method falls off with respect to the SBVD method. An example of the coding loss as a function of survivor path length for the two methods relative to a conventional four-state, Viterbi decoder is shown in Fig. 7 (BPSK data with dB and BER 10 ). The simulations are based on a decode length , which is the chosen length for the systolic SBVD architecture described in Section IV and the minimized architecture described in [9]. Assuming a coding loss of up to 0.2 dB is acceptable, the
Page 5
BLACK AND MENG: 1-Gb/s, FOUR-STATE, SLIDING BLOCK VITERBI DECODER 801 Fig. 7. Simulated coding loss as a function of survivor length for the SBVD and minimized methods with respect to a (2, 1, 3) Viterbi decoder =16 (200 error events). minimized method can only operate down to an of seven compared to the SBVD method which can operate at an of five. The SBVD method is not only optimal in terms of perfor- mance for a given , but its hardware complexity may be lower compared to the minimized method for the following reasons. · In general, one state estimate is required per decoded block for the SBVD method compared with two for the minimized method. However, hardware complexity for the minimized method can be appropriately reduced to counteract this effect [8]. · The forward processing path contains one state estimate per decoded block for the SBVD method compared with two for the minimized method. Assuming a log-depth hi- erarchical -way AC unit, the overall latency (and hence skew buffer memory) of a systolic SBVD implementation can be reduced by as much as log ( ) stages compared with a systolic minimized implementation. · Initialization of state metrics to force the estimated start- ing states as in the minimized method is not required using the SBVD method. IV. S YSTOLIC SBVD I MPLEMENTATION The SBVD method reduces the problem of decoding a continuous input stream to decoding independent blocks. For moderate throughput increases (factor of two to four) over con- ventional architectures, the simplest approach is to run multiple decoders in parallel. For higher throughput increases (factor of ten), systolic processing of the data blocks (or vectors) is very efficient. The following subsections describe the architecture and implementation details of a systolic SVBD, designed to achieve a decode rate of 1 Gb/s. A. Specification The specifications for the Viterbi decoder are: · four-state, or (2, 1, 3) convolutional de- coder/encoder; · generator polynomials and · eight-level soft decision inputs; Fig. 8. Simulated (2, 1, 3) decoder performance (200 error events). Fig. 9. Continuous stream processing using the SBVD method. · survivor path length (0.10 dB loss with respect to Viterbi decoder ); · coding gain of 3.4 dB at 10 BER. Simulated perfor- mance of the decoder for an additive white Gaussian noise (AWGN) channel is shown in Fig. 8. B. Architecture For systolic implementation it is efficient to choose the forward and backward intervals to be equal, as shown in Fig. 5(b), and to be an integer multiple of ; hence the decode interval should be greater than or equal to and a multiple of . The minimum value of is chosen in our implementation for the following reasons. First, the systolic architecture presented in this section achieves a decode rate of , where is the clock rate. Given that clock rates on the order of 100 MHz are achievable in current technology, it is unlikely that a value of greater than the minimum will be required, since the resulting decode rates are already over 1 Gb/s. Second, the complexity of the ACS hardware scales with , but the complexity of the pipeline skew buffer memory scales with . As a result, it is desirable to achieve a given throughput by minimizing the block length and maximizing the clock rate. Finally, higher throughput rates are achievable by decoding multiple blocks in parallel on multiple systolic SBVD processors. Such architectures can be implemented efficiently by replicating identical systolic
Page 6
802 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 Fig. 10. Systolic SBVD architecture. processors if the forward and backward decode intervals and the decode block interval are equal. These requirements constrain the choice of to Decoding of a continuous input stream using the SBVD method is analogous to overlap-add filtering as shown Fig. 9. The input stream is blocked into input symbol vectors of length , successive pairs of which are decoded using the SBVD method to produce output vectors of length . Since the vector lengths are finite, the recursive ACS and trace- back recursions can be unfolded and pipelined to yield the systolic SBVD architecture shown in Fig. 10, which achieves a throughput of one decoded vector per clock cycle, equivalent to a decode rate of Implementation of the SBVD architecture is relatively straightforward given the high-level architecture shown in Fig. 10. A more detailed view of the pipelined vector processors and the skew buffer design for the simplified case of is shown in Fig. 11. The design consists of the following five basic functional units: · branch metric (BM) units; · four-state ACS units; · four-way AC unit; · trace-back (TB) units; · skew buffers. Each BM/ACS/TB stage updates a complete trellis iteration per clock cycle and is equivalent in complexity to a four-state fully parallel decoder. The design goal of 1 Gb/s is achieved using an of six (vector length 12) and a clock rate of 83 MHz. This clock rate is achieved using only one pipeline stage per ACS, which is the limiting critical path in the design. The following subsections review the implementation details of the complete SBVD design. C. Branch Metric Units Each branch metric (BM) unit calculates the branch metrics for a single trellis iteration. Two 3-b soft decision inputs are combined to form four 4-b branch metrics corresponding to the four possible encoder outputs. The metrics are calculated using a uniform distance measure equal to the symbol itself when compared with logic-0 and equal to its one’s complement when compared with logic-1 [14]. The design of the BM unit is described in [5]. D. Four-State ACS and Four-Way AC Units Normalization is avoided in all add-compare operations us- ing modulo arithmetic [15]. The maximum potential dynamic range after an add occurs in the four-way AC unit and the state metric precision required to implement modulo arithmetic, equal to twice this range, is given by bits (2)
Page 7
BLACK AND MENG: 1-Gb/s, FOUR-STATE, SLIDING BLOCK VITERBI DECODER 803 Fig. 11. Systolic SBVD block diagram ( =2 ). In (2) is the maximum dynamic range of the state metrics and is given by (3) where is the number of states and is maximum branch metric. For the four-state decoder, and the required state metric precision is 7 b. The basic building block of the ACS and AC units is a two- way ACS unit. The design of the two-way ACS unit shown in Fig. 12 is similar to the four-way ACS unit described by the authors in [5]. Fast, area-efficient ACS units are designed using ripple carry arithmetic for the add and compare operations. By implementing the compare using subtraction, the adder and subtracter carry chains run in parallel from LSB to MSB, resulting in an add-compare delay that is only one full adder bit delay longer than the ripple carry add delay alone. Only the subtracter carry chains are implemented to generate the sign of the result and hence the comparison. The four-state ACS unit updates the state metrics for a single iteration of the trellis. Each unit consists of four two-way ACS units organized as two radix-2 units, as shown in Fig. 13 for the forward processor. The central position of the BM unit in the schematic reflects the final layout topology of each BM/ACS stage. On each clock cycle, four state metrics from the previous vector processor stage are input and four updated state metrics are output. Each update also generates a vector of four 1-b decisions that are output to the trace-back processor. Fig. 12. Two-way ACS unit block diagram. Fig. 13. Four-state ACS unit for the forward processor (the subscript is the pipeline stage index). Fig. 14. Four-way AC unit. The four-way AC unit adds the final state metrics from the forward and backward processors. The combined metrics are compared and the state with minimum metric is output as the starting state for the trace-back processors. The four-way AC unit is implemented in two pipeline stages as shown in Fig. 14. E. Trace-Back Units The trace-back (TB) unit implements a single trace-back recursion and consists of a four-way multiplexer. The current state estimate from the previous trace-back processor stage is used to select the decision of the current state from the input decision vector. The selected 1-b decision and the 2-b current state are appropriately combined to form the state estimate for the next stage and the decoded output. F. Skew Buffers Unfolding and pipelining the recursive ACS and trace-back calculations requires retiming of the input and output streams via skew buffers, which are implemented directly using shift-
Page 8
804 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 registers. The skew buffers are sufficiently small that RAM or hybrid delay RAM/shift-register architectures are not area- efficient solutions. The total skew buffer memory for input symbols, decisions, and decoded outputs is only 1188 b, occupies approximately 12% of the chip area, and dissipates 10% of the total power. G. Testing Issues The convolutional encoder is implemented on the same chip to facilitate extensive on-chip selftest. Due to pin limitations, the encoder inputs/outputs (32 pins) were not brought off- chip; that is, the sole function of the encoder is for selftest. In selftest mode, the encoder input is switched to an 11-tap pseudorandom sequence. The binary encoder output symbols are converted to 3-b soft decisions (0 011 and 1 100) for input to the decoder. The mapping is chosen to ensure state metric growth while guaranteeing error free decoding; hence a delayed version of the encoder input can be compared to the decoder output to verify functionality. H. Clocking and I/O A single phase clocking strategy was chosen to simplify clock distribution while maximizing system throughput. All register elements are implemented using rising edge true- single-phase circuits [16]. The global clock is distributed up the center of the chip, from which local clocks are derived using two-stage buffers of identical delay. The throughput of the design requires a minimum of 72 input pins (vector of 12 symbol pairs each of 6 bits) and 12 output pins (vector of 12 decoded bits), assuming the I/O is clocked at the internal clock rate. Given a 144-pin budget, the input stream is clocked at the internal clock rate and the output stream is parallelized to run at half the internal clock rate. A full rate clock of 83 MHz is supplied to the chip and a half rate clock is output, synchronized with the output stream. V. F ABRICATION AND EST ESULTS The design was fabricated using 1.2- m double-metal CMOS at Hewlett-Packard via MOSIS on the April 1992 run. The 9.21 mm 8.77 mm (81 mm ) chip contains 150 000 transistors and is packaged in a 144-pin PGA. A micrograph of the complete chip is shown in Fig. 15. The forward and backward processors are positioned on the right and left halves of the die, respectively. The 12-stage ACS vector processors are folded over to match the width of the six-stage trace-back processors, resulting in a very area-efficient layout. Typical parts are fully functional at a clock rate of greater than 83 MHz and dissipate 3.0 W at V and C. This corresponds to a vector iteration rate of 83 MHz, equivalent to a radix-2 iteration rate of 1 GHz or a decode rate of 1 Gb/s. For low power operation, typical parts are fully functional at a clock rate of greater than 12 MHz, equivalent to a decode rate of 144 Mb/s, and dissipate 24 mW at V and C. Since power dissipation scales with clock rate, the relative merit of the design can be quantified in terms of throughput per unit power and is equal to 6 Mb/s/mW. Fig. 15. Micrograph of complete sliding block Viterbi decoder. In contrast, the 32-state fully parallel decoder [5] achieves a comparable decode rate of 140 Mb/s and dissipates 1.8 W at V and C, equal to a throughput per unit power of 0.6 Mb/s/mW (scaled by a factor of eight to account for the relative complexity of the design). Hence, trading excess throughput for reduced power dissipation at lower supply voltages, the SBVD design achieves decode rates comparable to existing fully parallel decoders, while reducing the power dissipation by an order of magnitude. VI. S UMMARY An SBVD is implemented that combines the filtering char- acteristics of the sliding block decoder with the computation efficiency of the Viterbi algorithm. The finite memory length ) of the Viterbi algorithm allows decoding of the interval to based only on the input symbols over the interval to . A general hybrid Viterbi algorithm is proposed for decode of each block, which combines forward and backward trellis processing. Each decoded block of length requires trellis iterations, hence the relative computational efficiency of the SBVD method is approximately compared to conventional Viterbi decoding. The advantage of the SBVD approach is unlimited concurrency and hence throughput due to independent block decoding, as opposed to conventional decoder architectures, which are limited in throughput by the recursive ACS update. A systolic SBVD architecture is presented which achieves a decode rate of . The architecture is demonstrated in a four-state, Viterbi decoder, which has been designed and fabricated in 1.2- m CMOS. Typical parts are fully functional at a clock rate of 83 MHz, equivalent to a
Page 9
BLACK AND MENG: 1-Gb/s, FOUR-STATE, SLIDING BLOCK VITERBI DECODER 805 trellis iteration rate of 1 GHz or a decode rate of 1 Gb/s. This is the first single-chip CMOS design to reach the 1 Gb/s milestone and is currently the fastest published Viterbi decoder design. EFERENCES [1] R. W. Wood and D. A. Petersen, ˚Viterbi detection of Class IV partial response on a magnetic recording channel,º IEEE Trans. Commun., vol. COM-34, pp. 454±461, May 1986. [2] G. Cannone, S. P. Majumder, R. Gangopadhyay, and G. Prati, ˚Perfor- mance of convolutionally coded optical M-PPM systems with imperfect slot synchronization,º IEEE Trans. Commun., vol. 39, pp. 1433±1437, Oct. 1991. [3] A. Chandrakasan, S. Sheng, and R. Brodersen, ˚Low-power CMOS digital design,º IEEE J. Solid-State Circuits, vol. 27, pp. 473±483, Apr. 1992. [4] H. F. Lin and D. G. Messerschmitt, ˚Algorithms and architectures for concurrent Viterbi decoding,º in Proc. ICC’89, June 1989, vol. 2, pp. 836±840. [5] P. J. Black and T. H.-Y. Meng, ˚A 140 Mb/s, 32-state, radix-4 Viterbi decoder,º IEEE J. Solid-State Circuits, vol. 27, pp. 1877±1885, Dec. 1992. [6] J. E. Dunn, ˚A 50 Mb/s multiplexed coding system for shuttle commu- nications,º IEEE Trans. Commun., vol. COM-26, pp. 1636±1638, Nov. 1978. [7] K.-H Tzou and J. G. Dunham, ˚Sliding block decoding of convolutional codes,º IEEE Trans. Commun., vol. COM-29, pp. 1401±1403, Sept. 1981. [8] G. Fettweis and H. Meyr, ˚Feedforward architecture for parallel Viterbi decoding,º J. VLSI Signal Processing, vol. 3, pp. 105±119, 1991. [9] G. Fettweis, H. Dawid, and H. Meyr, ˚Minimized method Viterbi decoding: 600 Mb/s per chip,º in Proc. GLOBECOM 90, vol. 3, Dec. 1990, pp. 1712±1716. [10] G. D. Forney, Jr., ˚The Viterbi algorithm,º Proc. IEEE, vol. 61, pp. 268±278, Mar. 1973. [11] G. Feygin and P. G. Gulak, ˚Survivor sequence memory management in Viterbi decoders,º Tech. Rep. CSRI-252, University of Toronto, Jan. 1991. [12] G. C. Clark and J. B. Cain, Error-Correction Coding for Digital Communications. New York: Plenum, 1981, pp. 227±264. [13] A. J. Viterbi and J. K. Omura, Principles of Digital Communication and Coding. New York: McGraw-Hill, 1979, pp. 258±261. [14] J. A. Heller and I. M. Jacobs, ˚Viterbi decoding for satellite and space communication,º IEEE Trans. Commun. Technol., vol. COM-19, pp. 835±848, Oct. 1971. [15] A. P. Hekstra, ˚An alternative to metric rescaling in Viterbi decoders,º IEEE Trans. Commun., vol. 37, pp. 1220±1222, Nov. 1989. [16] J. Yuan and C. Svensson, ˚High-speed CMOS circuit technique,º IEEE J. Solid-State Circuits, vol. 24, pp. 62±70, Feb. 1989. Peter J. Black (S’86±M’86) was born in Brisbane, Australia, in 1964. He received the B.E. degree in electronic engineering from the University of Queensland, Australia, in 1985 and the M.S.E.E. and Ph.D. degrees from Stanford University, Stanford, CA, in 1990 and 1993, respectively. From 1986 to 1988, he worked at Austek Microsystems, Adelaide, Australia, on custom VLSI design for digital signal processing. He joined QUALCOMM Inc., San Diego, CA in 1993 as a system engineer working on CDMA wireless modem design. Teresa H.-Y. Meng received the B.S. degree from National Taiwan University in 1983 and the M.S. and Ph.D. degrees from the University of California, Berkeley, in 1984 and 1988, respectively. She joined the faculty of the Stanford University, Stanford, CA, in 1988, where she is an Associate Professor in the Department of Electrical Engineer- ing. Her current research activities include low- power design, wireless communication, and dis- tributed computing. Among the awards Dr. Meng has received are the IEEE Signal Processing Society’s Paper Award in 1989, the 1989 NSF Presidential Young Investigator Award, the 1989 ONR Young Investigator Award, a 1989 IBM Faculty Development Award, and the 1988 Eli Jury Award from the University of California, Berkeley, for recognition of excellence in systems research. She was Co-Program Chair of the 1992 Application Specific Array Professor Conference and of the 1993 HOTCHIP Symposium and general chair of the 1996 IEEE Workshop on VLSI Signal Processing.

About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.