Muravin Project

Design of Two Different 128-bit Adders
Project Report

By Vladislav Muravin
Concordia ID: 5505763

COEN6501: Digital Design & Synthesis
Offered by Professor Asim Al-Khalili

Concordia University
December 2004

Table of Contents
1 INTRODUCTION............................................................................................................................... 4
1.1 REPORT ORGANIZATION............................................................................................................... 4
1.2 COMMON ADDER STRUCTURES.................................................................................................... 4
1.2.1 1-bit Full Adder ...................................................................................................................... 4
1.2.2 N-bit Ripple Carry Adder ....................................................................................................... 4
1.2.3 Carry Skip Adder.................................................................................................................... 5
1.2.4 Carry Select Adder ................................................................................................................. 6
1.2.5 Carry Look Ahead Adder........................................................................................................ 6
1.2.6 Prefix Adders .......................................................................................................................... 7
1.2.7 Sklansky Prefix Adder............................................................................................................. 8
1.2.8 Kogge-Stone Prefix Adder ...................................................................................................... 9
2 DESIGN FLOW & IMPLEMENTATION..................................................................................... 10
2.1 MICRO ARCHITECTURE .............................................................................................................. 11
2.1.1 Top Entity ............................................................................................................................. 11
2.1.2 Sub-Block Partitioning ......................................................................................................... 12
2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen) ......................................................... 13
2.1.2.2 Carry Generation Block ............................................................................................................. 14
2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky) ................................. 14
2.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder (cg_gen_kogge_stone)..................... 14
2.1.2.3 Sum Bits Generation Block (sb_gen)......................................................................................... 15
2.2 RTL CODING ............................................................................................................................. 15
2.3 VERIFICATION PLAN .................................................................................................................. 15
2.4 SYNTHESIS, PLACE AND ROUTE ................................................................................................. 16
3 RESULTS.......................................................................................................................................... 16
3.1 SIMULATION RESULTS ............................................................................................................... 17
3.1.1 Initial Test Cases .................................................................................................................. 17
3.1.2 General Test Case ................................................................................................................ 17
3.2 SYNTHESIS RESULTS .................................................................................................................. 20
3.2.1 Multiplexing I/O ................................................................................................................... 20
3.2.1.1 Multiplexed Inputs ..................................................................................................................... 20
3.2.1.2 Multiplexed Outputs .................................................................................................................. 21
3.2.1.3 Multiplexed Inputs and Outputs................................................................................................. 21
3.2.2 Changing Target Device....................................................................................................... 21
4 DESIGN ENHANCEMENT PIPELINING................................................................................. 22
5 SUMMARY AND CONCLUSIONS ............................................................................................... 24
6 REFERENCES.................................................................................................................................. 25

Table of Figures
FIGURE 1: 1-BIT FULL ADDER ......................................................................................................................... 4
FIGURE 2: N-BIT CARRY PROPAGATE ADDER ................................................................................................. 5
FIGURE 3: CARRY SKIP CONCEPT.................................................................................................................... 5
FIGURE 4: CARRY SELECT CONCEPT............................................................................................................... 6
FIGURE 5: SKLANSKY PREFIX TREE ................................................................................................................ 8
FIGURE 6: KOGGE-STONE PREFIX TREE.......................................................................................................... 9
FIGURE 7: DESIGN FLOW............................................................................................................................... 10
FIGURE 8: TOP LEVEL VIEW.......................................................................................................................... 11
FIGURE 9: FULL_ADDER SUB-BLOCK PARTITIONING...................................................................................... 12
FIGURE 10: "CARRY GENERATE" AND "CARRY PROPAGATE" BLOCK IMPLEMENTATION ............................. 13
FIGURE 11: SUM BITS GENERATION BLOCK IMPLEMENTATION.................................................................... 15
FIGURE 12: TEST BENCH & VERIFICATION PLAN.......................................................................................... 16
FIGURE 13: INITIAL TEST CASE SIMULATION RESULTS................................................................................. 17
FIGURE 14: GENERAL TEST CASE - FULL ZOOM ........................................................................................... 18
FIGURE 15: GENERAL TEST CASE - EXAMPLE 1............................................................................................ 19
FIGURE 16: GENERAL TEST CASE - EXAMPLE 2............................................................................................ 19
FIGURE 17: FORWARD REGISTERS BALANCING (PIPELINING) ....................................................................... 22
FIGURE 18: BACKWARD REGISTERS BALANCING (PIPELINING) .................................................................... 22

TABLE 1: SIGNAL DESCRIPTION.................................................................................................................... 11
TABLE 2: SYNTHESIS RESULTS (NO PLACEMENT AND ROUTING): XC2V500 -FG456-4 DEVICE................... 20
TABLE 3: SYNTHESIS RESULTS: XC2V1000 FF896-4 DEVICE................................................................... 21
TABLE 4: PLACEMENT AND ROUTING RESULTS FF896-4 DEVICE .............................................................. 21
TABLE 5: PLACEMENT AND ROUTING RESULTS OF PIPELINED SKLANSKY ADDER........................................ 23
TABLE 6: PLACEMENT AND ROUTING RESULTS OF PIPELINED KOGGE-STONE ADDER ................................. 23

1 Introduction
The objective of this project is to design two different 128-bit adders by going through
the full design cycle from initial concept to structural RTL coding, simulation and
synthesis for Xilinx Virtex-2 FPGA family, device XC2V500.
1.1 Report Organization
The report is organized into few sections. Section 1 introduces common principles of
adder designs and structures, briefly describing the Carry Select, Carry Skip and the
Carry Look-Ahead principles with further elaboration on parallel-prefix adders, two of
which, Sklansky prefix adder and Kogge-Stone prefix adder, are implemented in this
project. Section 2 describes the design flow and the micro architecture of the design.
Section 3 focuses on the verification and test plan of the designs, followed by section 4
describing the results. Finally, sections 5 and 6 finalize the report with the conclusions
and references, respectively.
1.2 Common Adder Structures
1.2.1 1-bit Full Adder
A 1-bit Full Adder is shown on Figure 1. The equations describing the outputs are:
in
C B A S =
in out
C B A B A C + = ) (

Full
Adder
A
B
S
Cout
Cin
A
B
Cin
S
Cout

Figure 1: 1-bit Full Adder
1.2.2 N-bit Ripple Carry Adder
An iterative approach of considering an N-bit full adder leads to cascading of 1-bit full
adders. This concept is illustrated in Figure 2. Obviously, as N increases, the most critical
path, which is the carry path, increases as well (
out
C path), linearly.
Full
Adder
Full
Adder
Full
Adder
1 n
B
1 n
A
i
B
i
A
0
A
0
B
0
S
i
S
1 n
S
0
C
i
C
out
C

Figure 2: N-bit Carry Propagate Adder
1.2.3 Carry Skip Adder
Let
i i i
b a p = and
i i i
b a g = . p denotes "propagate" and g denotes "generate".
The basic carry-skip or carry-bypass design is an adder, which divides an N-bit adder into
M
N
blocks, where each block contains M bits. This is shown at Figure 3. Within each
block, a simple M-bit full adder structure is realized (linear time Carry Skip Adder),
where "propagate" and "generate" signals for the respective input bits are used to form
the output sum bits and the output carries. The multiplexer at the end of a block, allows
the input carry to bypass the block when all of the "propagate" signals in that block are
asserted. After the carry generate delay of the first block, the bypassing of carries in
subsequent blocks results in the carry-propagate delay. If any of the "propagate" signals
in some block is unasserted, then the carry propagation is not dependent on any of the
input carries from the previous blocks and each multiplexer. The critical path delay is
( )
SUM FA MUX FA setup PD
t t K t
M
N
t M t t + + |
.
|
\
|
+ + = 1 1
The subsequent section 1.2.4 explains how the better performance can be achieved by
modifying the block size.
Carry
Propagation
Cin
SUM(M-1)
M M
M
Carry
Propagation
M
Carry
Propagation
M
SUM(M-2) SUM(0)
0 1
A A
M

0 1
B B
M

M N N
A A

1 M N N
B B

1 M KM N KM N
A A

1 M KM N KM N
B B

1
Cout
Carry
Select
Logic
M M
Carry
Select
Logic
M M
Carry
Select
Logic

Figure 3: Carry Skip Concept
1.2.4 Carry Select Adder
This type of adder, despite its bigger amount of hardware needed, it has a very interesting
design concept. The linear Carry Select Adder is divided into
M
N
blocks, where each
block contains M bits, just as Carry Skip Adder. At each block, the hardware is replicated
in order to calculate sum and carry-out bits for both possible carry-ins. Figure 4 illustrates
this concept. The multiplexer at the end chooses between the carry-outs based on the
carry-in from the previous stage. In this implementation, the critical path delay comprises
the carry-generate of the first block, followed by the mux delays for successive blocks.
This results in a linear time Carry Select Adder.
Variable-sized blocks can yield higher performance [5]. For a carry-select adder, one can
have increasing sizes of the blocks so that the delay can be minimized by allowing all the
inputs to arrive at the same time at each multiplexer. For example, if the multiplexer
delay is similar to the delay of a full adder, then the minimal carry delay can be achieved
by adding 1 bit in the first block, 2 in the second, and so on. Having linearly increasing
block sizes results in a square-root number of block stages for the carry propagate delay,
and hence a square-root time CSA. A similar approach can yield a square-root time
CSkA.
M-bit Adder
Cin
SUM(0)
M M
M
M-bit Adder
SUM(1)
0 1
A A
M

0 1
B B
M

M M
M-bit Adder
M M
M M
A A
1 2 M M
B B
1 2 M N N
A A

1 M N N
B B

1
M-bit Adder
SUM(M-1)
M M
M-bit Adder
M M
M M
A A
1 2 M M
B B
1 2 M N N
A A

1 M N N
B B

1
Cout

Figure 4: Carry Select Concept
1.2.5 Carry Look Ahead Adder
Ripple Carry Adder implementation imposes the sequential generation of the carries,
making the output carry of each stage dependant on the input carry to the stage. Carry
Look Ahead implementation implies that the carry-out is not depending on the previous
carries.
Let
i i i
b a p = and
i i i
b a g = . P denotes "propagate" and G denotes "generate".
Then
i i i
c p s = and
i i i i
c p g c + =
+1

Expanding the above given equations for N-bit adder gives:
0 0 0 1
c p g c + =
0 0 1 1 1 1 2
c p p c p g c + + =

0 0 1 2 1 0 0 1 2 1 2 1 1
... ... ... ... C P P P P G P P p p g p g c
n n n n n n n n
+ + + + + =
It can be easily seen that since the carry is not depending on the previous carries, this
would result in less delay, as the adder circuit can be implemented as sum of products.
Consequently, an increase in the speed can be achieved. Unfortunately, due to the fact
that CMOS delay increases non-linearly as the fan-in grows, Carry Look Ahead
implementation is used in a modular way, cascading several 4-bit CLAs.
1.2.6 Prefix Adders
In very simple words, a parallel prefix algorithm takes n inputs
0 2 1
,..., , x x x
n n
and
produces in parallel n outputs
0 0 2 0 2 1
,..., ... , ... x x x x x x
n n n
. The analogy between carry
computation and the prefix algorithm is that the carry computation at a certain stage i
depends on all inputs of the stages 1 i to 0 .
Let
0 2 1
,..., , a a a
n n
and
0 2 1
,..., , b b b
n n
be n-bit binary numbers to be added. Let
o
c
designate the input carry and
n
c designate the output carry. For each bit, "propagate"
(
i
p ) and "generate" (
i
g ) signals are defined, as described in the previous section.
Furthermore, for parallelizing the computation of a carry two additional terms are
defined: Group Carry Generate (
j i
G
:
) and Group Carry Propagate (
j i
P
:
).
For each group of bits the Group Carry Generate signal
j i
G
:
means that the carry is
generated somewhere between stages i and j , and it is propagated from that location to
stage i . This implies 1
1
=
+ i
c and, in particular, if 0 = j , then
i i
c G =
0 :
.
For each group of bits the Group Carry Propagate signal
j i
P
:
means that the carry is
propagated from stage j to stage i , i.e.
j i
c c =
+1
.
So the formal definition of
j i
G
:
and
j i
P
:
is expressed using the following relationship:
| | | |
i i j i j i
p g P G , ,
: :
= if j i =
| | | | | |
j k j k k i k i j i j i
P G P G P G
: : : : : :
, , , D = if j i
Where j k i and "D " operator is introduced by Brent and Kung [1].
Finally, once the final carries
0 : i
G for all n i < have been computed, the sub bits are
calculated as:
=
> >
=
0 ,
1 ,
0 :
i p
i n G p
s
i
i i
i

The traditional CRA can be regarded as serial prefix adder using the above definitions.

1.2.7 Sklansky Prefix Adder
Sklansky Prefix tree is shown on Figure 5 for 16-bit adder. Its structure is the simplest
among the prefix adders. It used for a conditional-sum addition [2]. The fan-out of such
adder grows exponentially from input to output along the critical path and it is
2
n
. This
leads to a large delay as the adder operands width increases. Recursive division of the
blocks can construct full adder using such a tree for the implementation. The number of
"o" cells required to implement is n
n
2
log
2
and the delay is n
2
log , where n is the
adders width. The detailed implementation of "o" cell is described in 2.1.2.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 5: Sklansky Prefix Tree

1.2.8 Kogge-Stone Prefix Adder
The Kogge-Stone structure has a more optimal implementation than Sklansky structure,
as its fan-out is greatly reduced to 2 at the expense of larger "o" (circle) cells. It is
obtained by copying the of the most significant bit position [3]. Figure 6 shows this prefix
tree for 16-bit operands.
Just as in 1.2.7, recursive division of the blocks can construct full adder using such a tree
for the implementation. The number of "o" cells required for the implementation is
1 log
2
+ n n n and the delay is n
2
log , where n is the adders width. It is expected that
Kogge-Stone adder should consume more resources than Sklansky adder. The delay is 7
levels.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 6: Kogge-Stone Prefix Tree

2 Design Flow & Implementation
The following Figure 7 illustrates design flow for the implementation of prefix adders.

Design Specification
Macro Architecture
VHDL RTL Coding
Structural Level
(Emacs VHDL mode)
Simulation
ModelSim 6.0 SE
Synthesis
Place and Route
Xilinx ISE 6.3 SP3
Compare
Results
Test Bench
PRBS
Generator
Analyze
Results
Verification Plan
Test Case
Specification
Results
Results
Results
Reports

Figure 7: Design Flow

2.1 Micro Architecture
2.1.1 Top Entity
The following Figure 8 illustrates top-level view. The top entity is named
full_adder_sklansky and full_adder_kogge_stone, respectively, with the following
ports (Table 1).

full_adder_sklansky
or
full_adder_kogge_stone
operand1
operand2
sys_clk
128
128
result
128
carry_out
reset_n

Figure 8: Top Level View

Signal Name Width, [bits] Direction Comments
operand1 128 input Number #1 to be added
operand2 128 input Number #2 to be added
sys_clk 1 input System clock
reset_n 1 input System reset (active low)
result 128 output Result of an addition
carry_out 1 output Output carry resulting from an addition
Table 1: Signal Description

2.1.2 Sub-Block Partitioning
The top-level block is further partitioned into three sub-blocks, as it is shown on Figure 9.
No doubt, the choices of block partitioning are numerous. It is chosen to partition the
design into three sub-blocks due to the fact that in such block partitioning the two
different adders designs differ only by one sub-block, which is Carry Generation Block
(cg_gen). Consequently, two different sub-blocks are designed: cg_gen_sklansky and
cg_gen_kogge_stone.

pg_gen ("Carry Propagate"&"Carry Generate" Block)
operand1[0] operand2[0] operand1[127] operand2[127]
cg_gen_sklansky
cg_gen_kogge_stone
(2-D Carry Generation Block)
g[0](0) g[127](0)
sb_gen (Sum Bits Generation Block)
g[127](M-1)
s[127] s[126] s[0] s[1] carry_out
p[0]
p[127]
g[0](M-1)

Figure 9: full_adder sub-block partitioning

The subsequent sections elaborate on each one of the sub-blocks.

2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen)
This sub-block calculates "carry propagate" ) 0 ]( [i p and "carry generate" ) 0 ]( [i g , which
are calculated from operand1 and operand2 bitwise, as defined in 1.2.5, namely:
] [ 2 ] [ 1 ) 0 ]( [ i operand i operand i p =
] [ 2 ] [ 1 ) 0 ]( [ i operand i operand i g =
The implementation is shown on Figure 10.
This block consumes 128 2-input AND gates and 128 2-input XOR gates.

operand1[1] operand2[1] operand1[127] operand2[127] operand1[0] operand2[0]
g[127] p[127] g[i] p[i] g[1] p[1] g[0] p[0]

Figure 10: "Carry Generate" and "Carry Propagate" Block Implementation

2.1.2.2 Carry Generation Block
The signals ) 0 ]( [i p and ) 0 ]( [i g generated in Precondition Block are used within Carry
Generation Block for calculation the ) 1 ]( [ M i g signals, which could be represented as
two-dimensional carry generate structure. Further subsequent sections describe the
implementation of Carry Generation Block for each one of the chosen designs.
2.1.2.2.1 Carry Generation Block Sklansky Prefix Adder (cg_gen_sklansky)
Following the Sklansky prefix tree (presented in 1.2.7), the following observation is
determined (assuming a two-dimensional structure j rows by i columns):
In the column i , cells occupy the nodes whose row coordinates j correspond to
"1" in the binary representation of i , i.e. straight forward from binary encoding of
the index i . The coordinate corresponding to "0" in the binary representation of i
simply propagates the ) ]( [ j i p and ) ]( [ j i g
All "o" (circle) cells are of GP type except of those situated in the bottom border
of i j
2
log < .

The output of GP cell is defined as following:
) 1 ]( 1 2 mod [ ) 1 ]( [ ) 1 ]( [ ) ]( [
1
+ =

j i i g j i p j i g j i g
j

The output of G cell is defined as following:
) 1 ]( 1 2 mod ] [ ) 1 ]( [ ) ]( [
1
=

j i i p j i p j i p
j

) 1 ]( 1 2 mod [ ) 1 ]( [ ) 1 ]( [ ) ]( [
1
+ =

j i i g j i p j i g j i g
j

Following the prefix algorithm description, with 128 = n the implementation consumes
448 "o" cells, namely 448 2-input OR gates and the same amount of 2-input AND gates.
The delay is 7 levels and the fan-out is 64.
2.1.2.2.2 Carry Generation Block Kogge-Stone Prefix Adder
(cg_gen_kogge_stone)
Following the Kogge-Stone prefix tree (presented in 1.2.8) and assuming a two-
dimensional structure j rows by i columns, the nodes in the upper-left are populated
with "o" (circle) cells, while the rest of the two-dimensional array is empty, i.e. the "o"
(circle) cells are placed in the nodes whose coordinates satisfy the following relationship:
1 1 M j and 1 1 2
1

+
N i
j

The outputs of the placed cells are:
) 1 ]( 2 [ ) 1 ]( [ ) ]( [
1
=

j i p j i p j i p
j

) 1 ]( 2 [ ) 1 ]( [ ) 1 ]( [ ) ]( [
1
+ =

j i g j i p j i g j i g
j

Following the prefix algorithm description, with 128 = n the implementation consumes
769 "o" cells, hence occupying 769 2-input OR gates and the same amount of 2-input
AND gates.

2.1.2.3 Sum Bits Generation Block (sb_gen)
The sum bits are produced in Sum Bits Generation Block by XORing the "carry
propagate" signals, ) 0 ]( [i p , generated in Precondition Block, and the "carry generate"
bits ) 1 ]( [ M i g . Figure 11 illustrates the implementation, which is consuming 128 2-
input XOR gates.

p[1] g[0](M-1) p[127] g[126](M-1)
s[127] s[i] s[1]
carry_in p[0]
s[0]

Figure 11: Sum Bits Generation Block Implementation
2.2 RTL Coding
RTL coding is done in VHDL at the structural level. The basic cells are 2-input AND
gate, 2-input OR gate, 2-input XOR gate and D-type positive edge triggered flip flop.
The text editor used is emacs version 20.7 with vhdl mode, since it has many templates
for arranging VHDL code in an alignment, which is easy to read. Each one of the files
has a header at the top explaining the entity name and its logical function.
2.3 Verification Plan
In general, describing the same design functionality (especially of a large and complex
design) by a high-level language, such as C/C++ or using verification tools, such as
Verisity Specman, etc, is the way to verify the design in many scenarios with many
possible input combinations.
For the verification of the two full adders, the following is proposed (Figure 12).
A test bench, which is written in behavioral Verilog, instantiates both designs. Two 128-
bit numbers are generated using a dedicated LFSR (Linear Feedback Shift Register) [4],
which generates pseudo-random bit stream.
Each clock cycle, the values of two 128-bit numbers change in pseudo-random way.
These values are summed using a '+' operation within the test bench and they are also
applied as inputs to both adders. The resulting output sum and carry of each adder is
compared with the result generated by '+' addition within the test bench.
A successful test case (test passed) is defined as the match between the result of a test
bench and the result of each adder.

128-bit PRBS
Generator 1
128-bit PRBS
Generator 2
operand1+operand2
full_adder_sklansky
128
128
result
128
carry_out
test_bench
operand1[127:0]
operand2[127:0]
result[127:0]
carry_out
test_bench
results file
128
128
full_adder_kogge_stone
operand1
operand2
operand1
operand2
result
128
carry_out
match_sklansky
match_kogge_stone
128
128

Figure 12: Test Bench & Verification Plan

2.4 Synthesis, Place and Route
Synthesis, placement and routing of the design are done using Xilinx ISE 6.3i software
with the latest service pack SP3. The constraints are set for the best timing, by selecting
the optimization criteria "speed" with the maximum effort. More details on the results, as
well as the faced problems, are given in the section 3.2

3 Results
3.1 Simulation Results
3.1.1 Initial Test Cases
The initial test cases are defined as the sum of the following 128-bit numbers.
The very first case verifies the sum of the following numbers:
64 zeros followed by 64 ones.
64 ones followed by 64 zeros.
The next case is:
32 repetitions of 0xA.
32 repetitions of 0x5.

In such fashion, the possible bit swapping or incorrect index generation is tested.
Figure 13 illustrates the simulation results for the initial test case.
operand1 and operand2 are, effectively, the two 128-bit numbers to be added.
result and carry_out are outputs of each one of the adders marked by the appropriate
divider (Sklansky Adder and Kogge-Stone Adder, respectively).

Figure 13: Initial Test Case Simulation Results
3.1.2 General Test Case
In general test case, the data is generated in a pseudo-random way, as described in the
section 2.3. Three snapshots of the simulation results are given in the following figures.
Figure 14 illustrates the entire simulation. The lowest divider separates the test bench
signals. operand1_prbs and operand2_prbs are the 128-bit PRBS data, which is applied
to the adders. operand1 and operand2 are the input numbers; result and carry_out are
the outputs of the adder circuits, marked by the corresponding divider (Sklansky Adder
and Kogge-Stone Adder, respectively). Two more very important test bench signals are
result_match_sklansky and result_match_kogge_stone, which are updated each clock
cycle, depending whether there is a match between the test bench result and the
respective result of Sklansky adder and Kogge-Stone Adder.
Figure 15 and Figure 16 are giving two "zoom-in" examples of the same simulation.

Figure 14: General Test Case - Full Zoom

Figure 15: General Test Case - Example 1

Figure 16: General Test Case - Example 2

3.2 Synthesis Results
Both designs were successfully synthesized for Virtex-2 device XC2V500. The synthesis
results are summarized in the following Table 2. It is noted that Kogge-Stone adder
consumes more resources than Sklansky adder, just as it was expected.

Results Explanation (Table 2): The input and outputs of the design were sampled in
order to achieve more true delay estimation, assuming that the inputs and the outputs of
the design are registered. Furthermore, in the placement and routing stage, a specific
option, which forces the flip-flops to be packed within the I/O buffer, is selected, so that
the logic delay represents true estimation of each adders processing delay in this FPGA
implementation.

However, due to the fact the maximum available user I/O pins for this device is 264
(package FG456), further placement and routing of the design, and, hence, the true
estimation of its logic delay is not possible. Consequently, there are two alternatives. One
alternative is multiplexing the I/Os in order to fit the design into XC2V500 device.
Another alternative is to select a larger device, which is XC2V1000.
Both the alternatives are described in the following subsections.

Table 2: Synthesis Results (No placement and routing): XC2V500 -FG456-4 device
Design LUTs usage 1-bit Registers
Usage
Total Slices
Usage
Maximum
Frequency
Sklansky
Adder
829 (13%) 385 (6%) 453 (14%) 85.6 MHz
Kogge-Stone
Adder
1449 (23%) 385 (6%) 751 (24%) 100.5 MHz
3.2.1 Multiplexing I/O
This alternative requires complete redesigning of the interface and changing the overall
architecture of the design. Either loading the numbers or outputting the result in
multiplexed way could have advantages and disadvantages, which are summarized
further. In addition, handshaking signals, which designate the start of loading and the
completion of the addition, are required.
3.2.1.1 Multiplexed Inputs
In this case, it is obvious that the design latency (overall processing time) will increase,
since the whole input numbers cannot be acquired at once. However, there are two major
advantages that could be achieved. First, the logic required for the addition could be
reduced, since the logic performing the addition cannot process more bits than are present
on the interface at the same cycle. Consequently, the addition could be performed in
multiplexed fashion, especially if the loading of the input numbers is done in the way that
the least significant part of the numbers is loaded first. Second, that the overall speed of
the design will definitely increase as the complexity and combinational levels of logic
decrease as well.
3.2.1.2 Multiplexed Outputs
In this case, it is also obvious that the design latency (overall processing time) will
increase, since the output is not generated at once. However, there are two major
advantages that could be achieved here as well. First, the logic required for the addition
could be reduced, since the logic performing the addition cannot generate more bits than
the output (result) width is. Consequently, the addition could be performed in multiplexed
fashion, processing least significant part of the input numbers first, i.e. the least
significant part of the output is generated earlier than the most significant one. Second,
that the overall speed of the design will definitely increase as the complexity and
combinational levels of logic decrease as well.
3.2.1.3 Multiplexed Inputs and Outputs
In general, this case combines the alternatives discussed in 4.2.1.1 and 4.2.1.2. No doubt
as the design latency (overall processing time) will increase. Assuming that the inputs are
loaded with least significant part first, the least significant part of the output can be
generated at once. So, there are the same two major advantages can be achieved in this
case as well. First, the logic required for the addition could be reduced. Second, that the
overall speed of the design will definitely increase as the complexity and combinational
levels of logic decrease as well. The most optimal case is when the input and the output
widths are the same. If the input and the output widths are different, this will definitely
result in another level of complexity in this design, which I leave outside the scope of this
project.
3.2.2 Changing Target Device
This alternative is the quickest solution because it introduces no modifications within the
RTL design. The new target device is XC2V1000 with package FF896, allowing up to
432 user I/O pins. The main disadvantage of this alternative is that the larger device
represents a more costly solution. Table 3 and Table 4 present the synthesis and the
placement and routing results with the maximum efforts on timing, respectively. The
results are different because the synthesis tool gives the delays estimation without
knowing the true placement and routing.

Table 3: Synthesis Results: XC2V1000 FF896-4 device
Design LUTs usage 1-bit Registers
Usage
Total Slices
Usage
Maximum
Frequency
Sklansky
Adder
829 (8%) 385 (3%) 453 (6%) 85.6 MHz
Kogge-Stone
Adder
1449 (14%) 385 (3%) 751 (14%) 100.5 MHz

Table 4: Placement and Routing Results FF896-4 device
Design Total Slices Usage Maximum Delay / Frequency
Sklansky Adder 585 (11%) 15.428 ns / 64.8 MHz
Kogge-Stone Adder 1042 (20%) 14.149 ns / 70.1 MHz
4 Design Enhancement Pipelining
The pipelining of the design is introduced in order to improve the design speed. There are
two ways of applying pipelining. One, manual, is to locate the exact point at the critical
path, which has an arrival time of exactly half the total delay of the critical path (or one
third, if two pipeline stages are inferred, and so on) and insert a pipeline there. Another
alternative, automatic pipelining, is described below.
The location of the pipelining registers location is chosen automatically by Xilinx
synthesis tool. In the design, N pipeline stages are added to the inputs, the outputs or
both inputs and outputs of a design and the software optimizes the location of the pipeline
registers according to specified timing requirements and synthesis effort by moving them
forward and backward. This is also referred as "forward/backward register balancing" in
the tools (Xilinx ISE [6]) and "retiming" (Synplicity Synplify Pro 7.xx [7]) and it is
illustrated at Figure 17 and Figure 18. The software automatically determines Td1 and
Td2 corresponding to the given timing constraints and synthesis effort.

Pipeline
stage
Pipeline
stage
sys_clk
Pipeline
stage
Td
Pipeline
stage
sys_clk
Td1 Pipeline
stage
Td2 Pipeline
stage
Td = Td1 + Td2

Figure 17: Forward Registers Balancing (Pipelining)
Pipeline
stage
sys_clk
Pipeline
stage
Td
Pipeline
stage
sys_clk
Td1 Pipeline
stage
Td2 Pipeline
stage
Td = Td1 + Td2
Pipeline
stage

Figure 18: Backward Registers Balancing (Pipelining)
Table 5 gives the result of automatic pipelining of Sklansky Adder.
Table 6 gives the result of automatic pipelining of Kogge-Stone Adder.

From the results, it is observed that:
Adding one output pipeline stage improves the timing, while adding two pipeline
stages does not. The main reason is the fact that the delay distribution, consists of
approximately 25%-30% logic delay and approximately 70% routing delay.
Despite that adding 2 pipeline stages improves flip-flop to flip-flop delay, due to
the routing delay, the total delay is worse than with only 1 pipeline stage.
One other important factor that might prevent from achieving the good
performance could be the high usage of I/O pins, which imposes another level of
complexity for the place and route tool.
The faster a certain path is, the more percentage of it is contributed by the actual
logic delay.
Multiple iterations of synthesis, place and route produce slightly different results.

Number of
Pipeline Stages
Total Slices Usage Maximum Delay /
Frequency
Delay Distribution
Logic % / Routing %
1 input stage 551 (11%) 10.895 ns / 91.7 MHz 33 / 67
2 input stages 746 (14%) 9.9 ns / 101 MHz 36 / 64
1 output stage 603 (11%) 12.174 ns / 82.1 MHz 32 / 68
2 output stages 630 (12%) 12.644 ns / 79.1 MHz 27 / 73
1 stage at input
and output
571 (11%) 8.905 ns / 112.2 MHz 43 / 57
2 stages at input
and output
777 (15%) 8.698 ns / 114.9 MHz 45 / 55
Table 5: Placement and Routing Results of Pipelined Sklansky Adder

Number of
Pipeline Stages
Total Slices Usage Maximum Delay /
Frequency
Delay Distribution
Logic % / Routing %
1 input stage 838 (16%) 11.112 ns / 89.9 MHz 32 / 68
2 input stages 948 (18%) 10.597 ns / 94.36 MHz 28 / 72
1 output stage 852 (16%) 8.802 ns / 113.6 MHz 30 / 70
2 output stages 933 (18%) 9.286 ns / 107.68 MHz 41 / 69
1 stage at input
and output
888 (17%) 7.724 ns / 129.4 MHz 43 / 57
2 stages at input
and output
1075 (%) 7.612 ns / 131.3 MHz 47 / 53
Table 6: Placement and Routing Results of Pipelined Kogge-Stone Adder

5 Summary and Conclusions
Two different parallel prefix 128-bit adders were designed, analyzed and tested.

In the beginning of the design process, it was noted that the required device (XC2V500)
couldnt accommodate the requirements because of the limited number of the available
user I/O pins. Two alternatives were discussed and considered for further step of the
design: using the multiplexed I/O and, hence, reducing the overall number of the used
I/Os or changing the target device to XC2V1000. The second alternative was chosen
because it did not require redesigning and involving other levels of complexity.

It was observed that due to the nature of Kogge-Stone prefix, the expected resource usage
of Kogge-Stone adder will be greater comparing with Sklansky adder and it was justified
by the results.

It was also observed that multiple iterations of the same designs synthesis sometimes
produce slightly different placement results in terms of logic resources usage and timing.
The reason for this is the fact that the placement and routing algorithm used by Xilinx
tools is based on randomized initial settings [6], [8], in opposite to Altera [7].

Pipelining by inserting a number of pipeline stages enhanced the designs and the results
were analyzed. It turns out that the pipelining is not necessary improving the design
speed. The main reason for this is that the delay distribution in most cases consists of
approximately 20% to 40% of the actual logic and the rest, which is 80% down to 60%,
respectively, of routing delay. So, it is concluded that adding more pipeline stages does
not necessary improves the total delay.

6 References
[1] R. T. Brent and H. T. Kung "A regular layout of parallel adders", IEEE Trans.
Comput. Vol. C-31, No 3, pp. 260-264, March 1982

[2] J. Sklansky "Conditional-sum Addition Logic", in IRE transactions of electronic
Computers, Vol. EC-9, No 2, pp. 226-231, June 1960

[3] P. M. Kogge and H. S. Stone "A parallel algorithm for the efficient solution of a
general class of recurrence qeuations, IEEE Transactions on computers. C-22(8):260
264. Aug 1973

[4] Paul H. Bardell, William H. McAnney, and Jacob Savir, "Built-In Test for VLSI:
Pseudorandom Techniques", John Wiley & Sons, New York, 1987

[5] V. G. Oklobdzija, E. R. Barnes, "Some Optimal Schemes for ALU Implementation in
VLSI Technology", Proceedings of the 7th Symposium on Computer Arithmetic ARITH-
7, pp. 2-8. Reprinted in Computer Arithmetic, E. E. Swartzlander, (editor), Vol. II, pp.
137-142, 1985.

[6] Xilinx Programmable Logic Devices PLD & FPGA, www.xilinx.com

[7] Synplicity Synplify Pro 7.02 users guide www.synplicity.com

[8] Xilinx ISE 6.2 / 6.3 users manual www.xilinx.com

Muravin Project

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Muravin Project

Transféré par

Droits d'auteur :

Formats disponibles

Design of Two Different 128-bit Adders

Vous aimerez peut-être aussi