SOC Implementation Wave-Pipelined: Venkataramani

SOC implementation of wave-pipelined circuits
G. Seetharaman#, B. Venkataramani* # Research Scholar, Department of ECE, National Inst. of Technology, Tiruchirappalli, India. gsraman@nitt. edu * Professor and Head, Department of ECE, National Inst. of Technology, Tiruchirappalli, India. bvenki@nitt. edu
Abstract
In the literature, wave-pipelining is proposed as one of the techniques for increasing the operating frequency of the digital circuits. Higher operating frequencies can be achieved in Wave-Pipelined (WP) circuits, by adjusting the clock periods and clock skews so as to latch the outputs of combinational logic circuits at the stable periods. Major contributions of this paper are the proposal for the use of soft-core processor for the automation of the above tasks, and the superiority of the WP circuits with regard to power dissipation. The proposed scheme is evaluated by using two circuits. filters using Distributed Arithmetic Algorithm (DAA) and a sine wave generator using COordinate Rotation DIgital Computer (CORDIC) algorithm. Both the circuits are studied by adopting three different schemes. wave-pipelining, pipelining and non-pipelining. The SystemOn-Chip (SOC) approach is adoptedfor implementation on Altera Field Programmable Gate Arrays (FPGAs) based SOC kits with Nios II soft-core processor. From the implementation results, it is verified that the WP circuits are faster compared to non-pipelined circuits. The pipelined circuits are found to be faster than the WP circuits and this is achieved at the cost of increase in area and power. For the power dissipation, when both pipelined and WP circuits are operated at the same frequency, the former dissipates more power for circuits with higher word sizes and for medium taps filters. From the implementation results, it is verified that the superiority of the power dissipation of the WP circuits depends not only on the area but also on the logic depth of the circuit. This observation is made for the first time for the WP circuits. Index Terms- CORDIC, DAA, SOC, wave-pipelining, pipelining, FPGA.
1. Introduction
Programmable logic devices such as FPGAs offer an alternative solution for the computationally intensive functions performed traditionally by Programmable Digital Signal Processors (P-DSPs). The ability to design, fabricate and test Application Specific Integrated Circuits (ASICs) as well as FPGAs with gate count of the order of a few tens of
million, has led to the development of complex embedded SOC. Hardware components in a SOC may include one or more processors, memories and dedicated components for accelerating critical tasks and interfaces to various peripherals. One of the approaches for SOC design is the platform based approach [1], [2]. For example, the platform FPGAs such as Xilinx Virtex II Pro and Altera Excalibur include custom designed fixed programmable processor cores together with millions of gates of reconfigurable logic devices. In addition to this, the development of Intellectual Property (IP) cores for the FPGAs for a variety of standard functions including processors, enables a multi million gate FPGA to be configured to contain all the components of a platform based FPGA. Development tools such as the Altera System-On-Programmable Chip (SOPC) builder enable the integration of IP cores and the user designed custom blocks with the Nios II soft-core processor [3]. Softcore processors are far more flexible than the hard-core processors and they can be enhanced with custom hardware to optimize them for specific application. The increased gate count in a complex SOC results in increased power dissipation, clock routing complexity and clock skews between different parts of a synchronous system. These limitations may be partially overcome by adoption of circuit design techniques such as wavepipelining. Wave-pipelining enables a combinational logic circuit to be operated at a higher frequency without the use of registers and may result in lower power dissipation and clock routing complexity compared to a pipelined circuit. However, the maximization of the operating speed of the wave-pipelined circuit requires the following three tasks: adjustment of the clock period, clock skew and equalization of path delays. The automation of these three tasks are proposed for the first time in this paper. Effectiveness of the automation scheme is studied using two circuits: i) filter using DAA and ii) sine wave generator using CORDIC algorithm. The organization of the rest of the paper is as follows: In section 2, the previous work related to wave-pipelining and the challenges involved in the design of wave-pipelined circuits are described. In section 3, automation schemes for wave-pipelined circuits are presented. In section 4, an
1-4244-1472-5/07/$25.00 c 2007 IEEE
FPT ,200X7
overview of the DAA and parallel DAA and pipelined DAA schemes are presented. In section 5, an overview of the CORDIC algorithm is presented. In section 6, SOC approaches for the implementation of wave-pipelined circuits are discussed and the implementation results are presented. Section 7, summarizes the conclusions.
2. Review of previous work

Pipelining achieves high speeds in digital circuits at the cost of increased area, power dissipation and routing complexity [4]. Wave-pipelining is proposed as one of the techniques for achieving high speed without the above limitations. Wave-pipelining has been employed for implementing a number of systems on both ASICs and FPGAs [5]. The concept of wave-pipelining has been described in a number of previous works [6], [7]. To illustrate this concept, graphical representation of the data flow through combinational logic circuit, is used [7]. Fig. 1 shows a typical combinational logic circuit along with the input and output registers [7]. Fig. 2 depicts the flow of data through the above circuit [7]. The skew between the clocks at the input and output registers is denoted as 6. At the beginning of each clock cycle, data is initiated into the combinational logic block through the input register.
Combination;al
Hence, adjustment of the clock period, clock skew (6) and equalization of path delays, are the three tasks required for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as FPGA editor from Xilinx or Floor planner from Altera may be used for this purpose.
n
Dmax
4D
0TF
UP r
"
Fig. 2. Temporal/spatial diagram of data flow through the combinational logic circuit.
/1 7,VI-," Telk
Time
Clock
Clock Skew
Fig. 1. A combinational logic circuit with input and output registers.

A number of paths may exist between the inputs and output of a logic block. A change in the input causes the output to change after a delay of Dmin, Dmax through the shortest and longest path respectively. The shaded regions bounded by (Dmin and Dmax) depict the periods where the logic levels of the logic block vary with time. The nonshaded areas depict the stable duration of the logic block. In the conventional system, the output register is clocked in the nonshaded region and the minimum clock period, Tl,k is chosen to be greater than Dmax. In the wavepipelined system, the clock period is chosen to be (Dmax Dmin) + clocking overheads such as set up time, hold time etc. To ensure correct operation, 6 should be adjusted so that the active clock edge occurs in the stable period. As the shaded region increases with increase in the logic depth, the operating clock frequency should be reduced with increase in logic depth. Moreover, to maximize the frequency of operation of the wave-pipelined system, the difference (Dmax - Dmin) is minimized by equalizing the path delays.
These tasks are carried out manually in [8], [9]. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be significantly different due to fabrication variations. This difference becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a Personal Computer (PC) based test system in [9]. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave-pipelined circuit in this fashion requires human intervention and is time consuming. Automation of the above three tasks are considered in the next section.
3. Automation schemes for WP circuits

To maximize the operating speed of the wave-pipelined circuit, the equalization of the path delays is considered first. This cannot be completely automated as the commercially available synthesis tools do not support the specification of interconnect delays. However, the difference in path delays can be minimized by specifying the physical location of logic cells used for the implementation, through User Constraints File (UCF) or the Logic lock feature supported by the FPGA CAD tools [3], [10]. The adjustment of the clock skew and clock period can be automated by using programmable clock, skew generator and a processor. The programmable clock may be implemented as shown in Fig. 3 using delay blocks and an inverter. Programmable clock skew generator may be implemented using only delay blocks. The actual clock
Fig. 3. Clock generation scheme
period/clock skew depends on the interconnect delays and the delay in the logic blocks. The select inputs s(O)-s(1 8) are connected to one data input and a(O)-a( 18) are connected to the other port of the Nios II processor. The select input of 8:1 multiplexer (pl, p2, p3) is varied by the processor to achieve different clock frequencies. The clock and skew generator may be programmed using either off-chip processor or on-chip processor. The off-chip processor is used when the FPGA is used as a coprocessor or hardware accelerator for a main processor or microcontroller. The offchip communication between the FPGA and a processor is bound to be slower than on-chip communication. In Fig. 3, a majority logic circuit with 3 inputs is used to minimize the effect of glitches which may arise due to transients in the data lines. The clocks required for the wave-pipelined circuit may also be derived using the internal system clock generator of Altera. However, the multiplication factor has to be specified at the synthesis time and hence the clock frequency cannot be dynamically altered as in the scheme given in Fig. 3. The circuit using the programmable clock and skew generator is a suboptimal wave-pipelined circuit but can operate at a higher frequency than that reported by the commercially available synthesis tools which use Dmax for fixing the operating frequency. In order to minimize the time required for adjustment of the parameters of the wavepipelined circuit (clock frequency and skew), the Built In Self Test (BIST) approach for design for testability [11] may be used. In the BIST approach, a Finite State Machine (FSM) assumed to be available on off-chip and it is used for adjustment of the parameters of the wave-pipelined circuit [12]. In SOC approach, a processor is assumed to be
available on-chip and it is used for adjustment of the parameters of the wave-pipelined circuit.
3.1. SOC approach for wave-pipelined circuits

The BIST approach requires a number of overheads such as FSM, signature generator and test vector RAM. Instead of using a dedicated circuit such as BIST, a processor may be used to carryout the above tuning and retuning tasks. The tasks performed in software uses the on-chip processor. The hardware block may use wave-pipelining and it may be retuned by the on-chip processor periodically. Hence, the retuning task may be time shared with the other tasks performed by the processor. The block diagram of a wavepipelined circuit which is tuned using the SOC approach is shown in Fig. 4. It consists of programmable clock, clock skew generator and block RAMs for storing the inputs and output vectors of the hybrid wave-pipelined circuit. The select inputs for the clock as well as skew blocks and the data inputs to the wave-pipelined circuit may be applied and varied through the on-chip processor. During normal operation, the block RAM contains the array of data to be processed. In the test mode, the block RAM contains the test data. During the testing mode, the processor writes the test vectors into block RAM, systematically applies the select inputs for the clock generator and clock skew blocks and uploads the results stored into the output block RAM for each combination of select inputs. It keeps varying the select inputs and repeats the above steps till the operating frequency at which the circuit works for three different skew values is found. A variety of choices exist for the implementation of SOC. The SOC may consist of a hard core processor such as power PC or ARM processor and an FPGA coprocessor or
DSP block. Alternatively, it may consist of soft-core processors such as Nios II or Micro blaze and a custom DSP block implemented in FPGA. In this paper, Altera FPGA based SOC consisting of Nios II soft-core processor is used for the implementation.
The computation of the output of an N tap Linear Time Invariant (LTI) filter and computation oftransform of a Nxl vector can be generalized as the problem of computation of the sum of products given by
ii
0
y(n)
kO=
EL a(n,k)x(k)
(1)
In the case of LTI filters and transform computation, a(n,k) is time invariant and only x(k) varies with time. In view of this, y(n) can be computed by using the look up tables for multiplication. This can be achieved as follows: The input samples x(k) may be assumed to be represented in 2's complement representation using W bits and can be written as
x(k) = -x(W - 1, k) + WE x(W -1 - m, k)2-(i)

m=l
(2)
Fig. 4. SOC approach for wave-pipelined circuit.
3.2. Procedure for adjusting the clock period and skew

In this paper, the DAA/CORDIC block is implemented as custom hardware for the soft-core processor as shown in Fig. 4. The period of the clock signal and delay introduced by the clock skew blocks depend on the interconnect delays,
Substituting equation (2) in (1) and interchanging the order of summation w.r.t. m and k, we get Y,LS(m)2-(W-) y(n) = -S(W -1) + m=O (3)
Where S(m) = k=Ox(m, k)a(n, k)
(4)
the location of the logic elements and the interconnects used for the implementation of these blocks should be fixed so that when these blocks are integrated with the DAA/CORDIC or the processor, the interconnect delays are not altered. This is achieved by using the Logic Lock feature in Altera. The operating frequency of the wave-pipelined circuit is expected to lie between that of non-pipelined circuit and pipelined circuit. Hence, the minimum and maximum frequency of the clock generator should correspond to the maximum operating frequencies of the non-pipelined and pipelined circuits respectively. The approximate values of these two frequencies are found for the circuit to which the clock is to be applied using the synthesis report. After determining the range of the frequencies to be generated by the clock circuit, the number of delay blocks are adjusted.
It may be noted that x(m,k), for m= 0,1, ... W-1, takes binary values 1 or 0. Hence, S(m) can be computed using ROM with address as the bits x(m,0), x(m,1), ... x(m,N-1). Furthermore, the contents of S(m) is the same for all values of m.
4.1. Full parallel DA algorithm

To compute y(n), W ROMs, ROM 0 - ROM (W-1) can be used. ROM 0 - ROM (W-2) contain the same content and correspond to S(0) - S(W-2). ROM (W-1) corresponds to [-S(W-1)] and is actually the 2's complement of the content of the other ROMs. The MSBs of all the samples are fed as the address to the (W-_)th ROM. The next bits of all the samples are fed to the (W-2)th ROM address bits. Similarly, the LSBs of all the samples are fed as address to the oth ROM. For W = 8, y(n) can be computed using four stages of adders. y(n) is expressed using S(0) - S(7) in equation 5.
4. Design of distributed arithmetic algorithm

The Distributed Arithmetic (DA) plays an important role in embedding DSP functions in the Look-up Table (LUT) based FPGAs and enables the FPGAs to achieve performance which is superior to those of programmable DSPs. DA can be optimized for area efficiency, speed efficiency or for both. For efficient implementation of DA on FPGAs, a number of algorithms such as Read Only Memory (ROM) decomposition technique and offset binary coding have been proposed in the literature [4]. Normally, for the computation of vector dot product using DA, the content of DA ROM is stored assuming multiplication using 2's complement arithmetic with sign extension technique.
y(n) = { [-S(7) + S(6)2-1 + [S(5) + S(4)2-1 ]2-2} + (5) { [S(3) + S(2)2-1 ]+ [S(1) + S(0)2-1 ]2-2
-4
Equation (5) requires multiplication ofthe numbers by 2-i. If 2's complement multiplication with sign extension is used, this requires shifting the number towards right i times and replicating the MSB i times. For example, multiplication of a number 10100101 represented in 2's complement form by 2-4 results in the number 1 1 11 1010 0101. The full parallel DAA scheme with 2's complement multiplication with sign extension is shown in Fig. 5. The logic depth or the no. of stages of logic elements required for DA filter depends on the no. of taps. The no. of stages required for DA filter with 8, 16 and 32 taps are 4, 5 and 6 respectively.
4.2. ROM decomposition and Pipelining for DAA

DA algorithm discussed above can be modified to reduce the size of the ROM required. Fig. 5 shows ROM decomposition technique for DA algorithm [4]. It can be verified that an N tap filter requires Distributed Arithmetic Look-up Tables (DALUTs) with 2N locations. The exponential growth in the ROM size can be avoided by splitting the N address bits to the ROM into blocks of K address bits each. Now, only K inputs DALUTs are required and hence the individual ROM size becomes 2K. Totally N/K such DALUTs are required for computing the output corresponding a particular bit of the input samples. To get the correct output, the outputs of the K input DALUTs have to be added. In the scheme shown in Fig. 5, the minimum sampling rate or the maximum clock frequency for the input register of the DAA block is determined by the processing time in the combinational logic block consisting of the ROM and the adders. The clock frequency can be increased by introducing pipeline registers at the output of the ROM and at the output of the adders. This scheme is also referred to as synchronous pipelining. In this case, the maximum operating frequency is determined by the largest critical path delay between any of the two registers. Pipelining increases the operating frequency at the cost of increase in the number of registers, increase in routing complexity and power dissipation.
wave-pipelining. Implementation of self tuned wavepipelined CORDIC unit is considered next. The CORDIC algorithm provides an iterative method of performing vector rotations by arbitrary angles using shifts and adds. In the rotation mode, CORDIC may be used for converting a vector in polar form to rectangular form. In the vector mode, it converts a vector in rectangular form to polar form [16]. The functionality of the circuit may be verified by taking the cosine value as output.
5.1. Rotation mode of CORDIC

The CORDIC algorithm for this mode is derived from the general rotation transform (6) Xfin = xin cos-y in sin 0
Yfin =Yin cos 0 + xin sin 0
(7)
which rotates a vector (x,,, y, in a Cartesian plane by an Y) angle 0 to another vector with the coordinates (xfin ,Yfin) . The rotation may be achieved by performing a series of successively smaller elementary rotations 00, 01, 02,... ON such that 0 = Y0 0 Rotation of the vector by an angle 0. can be rewritten as
CoS0 -yi sinO y1= yi cos80 +xi sin8

Xi,, = X.
cos 0
CosS
(8)
(9)
(10)
Xi+l
yI Yi+l =y +x.tan6
i
iX
(11)
The computational complexity of (10), (11) can be reduced by rewriting these equations as (12) Xi+1 = x -y tan 0 (13) y11 = yi + X tan 0
(x,y)
Fig. 5. Distributed Arithmetic using ROM decomposition.
11
YN5 afElos0j
(14)
5. Design of CORDIC algorithm

CORDIC is the acronym for COordinate Rotation DIgital Computer. CORDIC is an iterative arithmetic algorithm introduced by Volder [13] and later refined by Walther [14] and others. CORDIC unit uses only shifts and adds to perform a wide range of functions including certain trignometric, hyperbolic, linear and logarithmic functions. CORDIC algorithm is used in diverse applications such as mathematical coprocessor units, calculators, waveform generators, universal modulator, demodulator digital filters carrier as well as bit time recovery circuits and digital modems [15]. The operating frequency of the CORDIC unit may be increased, if it is implemented using either pipelining or
and performing the division by cos 0. together for all the N+ 1 iterations by dividing the value of (XN, YN ) by
lNocos 0.. Further, the value of 0. for
i =1, 2.., N is
chosen such that tan 8. is 2-i. This reduces the multiplication by the tan 0. to simple shift operation. As the iteration increases, 0. becomes smaller and smaller. We may terminate the iteration when the difference between 0 & Zo 0i becomes very small for some value of N. The remaining angle by which the vector needs to be rotated after completion of i iterations is indicated by the parameter Zi+1 and is defined by equation (14).
Zi+l
z
zi- 0
(I 5a) (15b)
is considered to be positive when the rotation required is anticlockwise and is negative otherwise. To approximate an arbitrary angle using 0. of the form tan-' (2-4 0. may have to be chosen to be negative for some values of i. Since, tan 0. is +2-i when 0. is positive and 2-1 otherwise, the iterative equations may be rewritten as sgn (z1) (16)
0.
The shift and add operation required for each of the iteration is carried out using a single shift and add block in the serial CORDIC scheme (also referred to as the folded CORDIC scheme). Separate hardware blocks are used for each iteration in the case of Parallel or Unrolled CORDIC scheme. The block diagram of the unrolled CORDIC unit with 5 stages (corresponding to 5 iterations) is shown in Fig. 6 [16]. The entire CORDIC unit is reduced to an array of interconnected adder-subtractors [16]. The functionality of the circuit is verified by taking the cosine value as output
x1+=
x1
Yi 2-1
dxi 2-
(17)
(18)
(19)
6. Implementation of self tuned wave-pipelined circuits using SOC approach

For the SOC approach, the soft-core processor, Nios II, is implemented on Altera FPGAs. DAA and CORDIC unit are implemented as custom hardware. The optimal clock period and clock skews are determined using the procedure described in section III. The wave-pipelined DAA/CORDIC units (obtained by adding the input and output Block RAMs to the non pipelined circuit along with the programmable clock and clock skew blocks) are tested first using simulation. As mentioned in section III, simulation is inadequate to test the wave-pipelined circuit. Hence, this circuit is implemented along with the Nios II soft-core processor and the former is added as the custom block to the Nios II using SOPC builder. The program to be executed by the Nios II is written in C/C++ and the custom block is invoked as a function in the C/C++ program. A C++ program is written to read and write from the block RAM in the custom block. The C++ program is compiled and the executable code along with the configuration bits corresponding to Nios II integrated with the custom block is down loaded to the FPGA. When the C program is run, it systematically varies the select inputs for the clock and clock skew blocks, and uploads the content of the output block RAM. It compares this with the expected results. The clock and skew are adjusted till the match occurs for at least three consecutive clock skews. The operating clock and clock skew of the wave-pipelined circuit is fixed at the middle value and from now on, the custom block works without any intervention from the Nios II processor. Only when retuning is required, the Nios II processor interacts with the custom block. We have found the custom block to be reliably working at the frequency initially tuned for more than 6 hours. (This is carried out by running the C program at intervals of 1 hour and checking the results). The circuit was not tested beyond 6 hours. The delay values chosen through the automation procedure were found to be satisfactory for 6 hours. If the tests had been continued further, the delay values could have become unsatisfactory. The time after which it fails may be 12 hours, few days, few years or the entire life of the chip. The time to failure depends on the design margin. It is suggested that the retuning of the clock skews may be done periodically so that the delay variation due to temperature does not affect the system performance.
yi+, = yi
zi+ =zi
-dtan-' ( 2-1)
Fig. 6. Unrolled CORDIC unit.

The computation of
follows: Since
N
cos
HN
=
1
cos
0i
may
be simplified
as
6,
cos
for
very
small values of
0i may be computed for N=6 and may be used for any other value of N > 6. For N=6, K = 16 cos 0i = 0.6073.
6.1. Implementation results on DAA using Cyclone-It EP2C35F672C6

In order to demonstrate the applicability of the SOC approach, Altera Cyclone-II EP2C35F672C6 is chosen for the implementation of 8, 16 and 32 tap filters using DAA approach. For verification of the technique proposed, the three filters are implemented with and without pipelining. For the wave-pipelined circuit, the number of logic elements, number of registers, maximum operating frequency and power dissipated are computed and the results are given in Fig. 7. Overheads required for the wave-pipelined circuits are also shown in Fig. 7. From this figure, it may be noted that the wave-pipelined DA filter is faster by a factor of 1.43-1.6 compared to the non-pipelined DA filter. The pipelined DA filter is faster by a factor of 2.19-3.27 compared to the wave-pipelined DA filter. This is achieved with increase in the number of registers by a factor of 436469% and increase in number of LEs by a factor of 26.630.16% compared to the wave-pipelined DA filter.
In all the three filters, wave-pipelined circuits dissipate less power compared to pipelined circuits if the power dissipation due to the overhead circuits is not taken into account. However, the reduction in the power dissipation for the wave-pipelined circuit decreases as the logic depth increases. (As noted in Section IV, the logic depth required for DA filter with 8, 16 and 32 taps are 4, 5 and 6 respectively). If the power dissipation due to the overhead circuits is also taken into account, power dissipation of pipelined filter is higher by 15.1 00 and 6.1% for 16 tap and 32 tap filter. For the 8 tap filter, power dissipation of pipelined filters is lower by 5.90. The power dissipation due to overheads decreases with the number of taps as the filter with higher number of taps operates at a lower frequency. At lower logic depths, the overheads make the wave-pipelining to be inefficient. At higher logic depths, overheads along with the increased capacitance make the wave-pipelined DA filter to be less efficient with regard to power dissipation.
6.2. Implementation results on CORDIC using Cyclone-II EP2C35F672C6

The CORDIC unit with word size of 8, 16 and 32 bits are implemented on Cyclone-II EP2C35F672C6 with and without pipelining. For the wave-pipelined circuit, the
-P -P ~ t
X-I
-P
p
I-t a-l"4
-~
X
Type of Circu
- rnui
-~~~~~15,5"-
uit
Non-pipelined cirruit
Fipelirtd circuit
circuit (aditonal overhead) Fig. 7. Implementation results of 8, 16 and 32 tap filters using DAA approach. In order to assess the superiority of wave-pipelining with regard to power dissipation, both wave-pipelined and pipelined circuits are operated at the same frequency (corresponding to the maximum operating frequency of the wave-pipelined circuit) and the power dissipated for the 8,16 and 32 tap filters are given in Fig. 7.
Wave-pipelid circuit
Wve-pipelined
number of logic elements, number of registers, maximum operating frequency and power dissipated are computed and the results are given in Fig. 8. Overheads required for the wave-pipelined circuits are also shown in Fig. 8. From this figure, it may be noted that the wave-pipelined CORDIC unit is faster by a factor of 1.19-1.26 compared to the nonpipelined CORDIC unit. The pipelined CORDIC unit is faster by a factor of 1.83-2 compared to the wave-pipelined CORDIC unit. This is achieved at the cost of increase in the number of registers by a factor of 4.43-8.43 and increase in number of LEs by a factor of 1.14-1.42 compared to the wavepipelined CORDIC unit. In order to assess the superiority of wave-pipelining with regard to power dissipation, both wave-pipelined and pipelined circuits are operated at the same frequency and the power dissipated for different word sizes are also given in Fig. 8. From this figure, it may be noted that if the overheads are not considered, then pipelined CORDIC unit dissipates 5.5-7.300 more power than wave-pipelined CORDIC unit. If the overheads required for wave-pipelined CORDIC unit are also considered, then the pipelined CORDIC unit dissipates more (2.54%) power than wave-pipelined CORDIC unit only for higher word size (32 bit). This can be explained as follows: The area for overhead is independent of word size. However, the highest operating frequency decreases with word size. Hence, the power dissipated by the overhead circuit decreases with word size. On the other hand, the power dissipated by the CORDIC unit increases with the word size. If the word size is small, then power dissipated by the overheads becomes significant and this makes the wave-pipelined CORDIC unit to be inefficient. If the word size is increased, then the wave-
pipelined CORDIC unit becomes more and more efficient compared to pipelined CORDIC unit. It may be noted that SOC approach is also applicable for Xilinx FPGAs.
hardware and software IP components in SOC," Elsevier Integration, The VLSI Journal, pp. 1-31, Nov. 2003.
[2] [3] G. Martin and H. Chang, "System-on-Chip design," Proc. of Intl. conf on ASIC, pp. 12 - 17, 2001.
Altera documentation USA.
library- 2003, Altera corporation,
[4]
K. K. Parhi, "VLSI signal processing systems," John Wiley & Sons, 1999.
J. Nyathi and J. G. Delgado-Frias, "A hybrid wave pipelined network router," IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 12, pp. 1764 -1772, Dec. 2002. W. P. Burleson, M. Ciesielski, F. Klass, and Liu, "Wavepipelining: a tutorial and research survey," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 3, pp. 464 -474, Sep.1998.
[5]
[6]
16
328
16 32
16
32
16 32
[7]
Nuwber ofvord sizes
C. Thomas Gray, W. Liu and R. Cavin, "Wave Pipelining: Theory and Inplementation," Kluwer Academic Publishers, 1993.
E. I. Boemo, S. Lopez-Buedo and J. M. Meneses, "Wave pipelines via LUTs," IEEE International Symposium on Circuits and Systems ISCAS '96, vol. 4, pp. 185 -188, 1996.
Non-pipelined circuit
L
Pieid ciruit
Wa-pipelined
[8]
Wa-pipelined
circuit
circuit (additional overhead) Fig. 8. Implementation results of CORDIC unit with word
[9]
size of 8, 16 and 32 bits.
7. Conclusion
The automation scheme proposed in this paper for the FPGA implementation of the wave-pipelined circuit are tested using DAA and a CORDIC based sine wave generator. It is observed that wave-pipelined circuits operate faster compared to non-pipelined circuits. The pipelined circuits are in turn faster than the wave-pipelined circuits and this is achieved with the increase in the number of registers and LEs or slices. When both pipelined and wavepipelined circuits are operated at the same frequency, the superiority of one over the other with regard to power dissipation depends on the logic depth of the circuit and the input word size. From the implementation results, it is observed that, only for higher word size and medium tap filters, the SOC based wave-pipelined circuit is found to be more efficient in both area and power dissipation than pipelined circuit.
G. Lakshminarayanan and B. Venkataramani, "Optimization techniques for FPGA based wave-pipelined DSP blocks," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp 783-793, July 2005.
[10] Xilinx documentation library, Xilinx Corporation, USA.
[11] M. J. S. Smith, "Application Specific Integrated Circuits," Pearson Education Asia Pvt. Ltd, Singapore, 2003.
B.Venkataramani and G. [12] G. Seetharaman, Lakshminarayanan, "Design and FPGA implementation of self-tuned wave-pipelined filters," IETE journal ofresearch, vol 52, no. 4, pp. 305-313, July-August 2006.
[13] J. E. Volder, "The CORDIC trigonometric computing technique," IRE Trans. on Electronic Computers, vol. EC-8, no. 3, pp. 330-4, Sept. 1959.
[14] J. S. Walther, "A Unified algorithm for elementary functions," Spring Joint Computer Conf, pp. 379-385, 1971. [15] R.Andraka "A Survey Of CORDIC Algorithm For FPGAs," Proc. of ACMISIGDA sixth international symposium of FPGAs (FPGA'98), Monterey, CA, pp.191-200, Feb 22-24, 1998. [16] W.Tuttlebee, "Software defined radio: Baseband technology for 3G," Wiley, 2004.
References
[1]
Flavio R. Wagner, Wander 0. Cesario, Luigi Carro and Ahmed A. Jerraya, "Strategies for the integration of

SOC Implementation Wave-Pipelined: Venkataramani

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

SOC Implementation Wave-Pipelined: Venkataramani

Transféré par

Droits d'auteur :

Formats disponibles

SOC implementation of wave-pipelined circuits

1-4244-1472-5/07/$25.00 c 2007 IEEE

2. Review of previous work

Fig. 1. A combinational logic circuit with input and output registers.

3. Automation schemes for WP circuits

Fig. 3. Clock generation scheme

3.1. SOC approach for wave-pipelined circuits

x(k) = -x(W - 1, k) + WE x(W -1 - m, k)2-(i)

Fig. 4. SOC approach for wave-pipelined circuit.

3.2. Procedure for adjusting the clock period and skew

Where S(m) = k=Ox(m, k)a(n, k)

4.1. Full parallel DA algorithm

4. Design of distributed arithmetic algorithm

4.2. ROM decomposition and Pipelining for DAA

5.1. Rotation mode of CORDIC

Yfin =Yin cos 0 + xin sin 0

CoS0 -yi sinO y1= yi cos80 +xi sin8

5. Design of CORDIC algorithm

lNocos 0.. Further, the value of 0. for

6. Implementation of self tuned wave-pipelined circuits using SOC approach

Fig. 6. Unrolled CORDIC unit.

6.1. Implementation results on DAA using Cyclone-It EP2C35F672C6

6.2. Implementation results on CORDIC using Cyclone-II EP2C35F672C6

Altera documentation USA.

library- 2003, Altera corporation,

Nuwber ofvord sizes

size of 8, 16 and 32 bits.

[10] Xilinx documentation library, Xilinx Corporation, USA.

Vous aimerez peut-être aussi