Goals of This Chapter: Designing For Performance, Area, or Power

B.
Supmonchai
12/4/2013 Arithmetic Building Blocks 1
Goals of This Chapter
Designing for Performance, area, or power
Adders
Multipliers
Shifters
Logic and System Optimizations for datapath
modules
Power-Delay trade-offs in datapaths
B.Supmonchai
Review: A Generic Processor
Datapath
I
n
p
u
t
/
O
u
t
p
u
t

Memory
Control
Adder, Multiplier,
Shifter, Comparator, etc.
RAM, ROM, Shift Register
FSM,
PLA,
Counter,
Random
Logic
Switches,
Arbiters,
Bus
Drivers
B.Supmonchai
Register
Adder
Shifter
Multiplexer
Datapath Unit
Bit-Sliced Architecture
Control
n-bit
Data In
n-bit
Data Out
B
i
t

0

B
i
t

1

B
i
t

n
-
2

B
i
t

n
-
1

Identical
Processing
Elements
Modular
Easy to design and verify
Easy to expand
Potential to be fast
B.Supmonchai
Example: I tanium I nteger Datapath
Itanium has 6 integer execution units (ALU)
B.Supmonchai
One-Bit Binary Full Adder (FA)
A
B
S
C
in
C
out
1-bit
Full Adder
(FA)
A B C
in
C
out
S
Carry
Status
0 0 0 0 0 kill
0 0 1 0 1 kill
0 1 0 0 1 propagate
0 1 1 1 0 propagate
1 0 0 0 1 propagate
1 0 1 1 0 propagate
1 1 0 1 0 generate
1 1 1 1 1 generate
S = A B C
in

C
out
= AB +AC
in
+BC
in

A VERY common operation - so worth spending some
time trying to optimize
Often in the critical path, so need to look at both logic level and
circuit level optimizations
B.Supmonchai
Generate (G) = AB
Propagate (P) = A B
Delete(D) = A B
S(G,P,C) = P C
in

C
out
(G,P,C) = G + PC
in

Propagate, Generate, and Delete (Kill)
Define 3 new variable which ONLY depend on A, B
Then we can write S and C
out
in terms of G, P, and C
in

We can also write S and C
out
in terms of D, P, and C
in

Sometimes an alternative definition for P can be used
Propagate (P) = A + B
(FA itself generates a carry)
(FA passes along carry)
(FA stops propagation of carry)
B.Supmonchai
FA CMOS I mplementation: First Try
C
out
A

B
C
in
A

C
in
A

B
A

A

B
C
in
A

B
C
in
B
C
in
C
in
B
B
B
A

A

S
A

A

B B
C
in
C
in
32 Transistors
Majority Function Maj(A,B,C)
outputs 0 or 1 whichever has
greater numbers at the inputs
B.Supmonchai
I mproved CMOS I mplementation
A more compact design is based on the observation that
S can be factored to reuse the C
out
term
S = ABC
in
+ (A + B + C
in
)C
out
A

B

C
in
A

B

C
in
C
out
S

S

C
out
Minority Function
B.Supmonchai
A B
B
A
C
i
C
i
A
X
V
DD
V
DD
A B
C
i
B A
B
V
DD
A
B
C
i
C
i
A
B
A C
i
B
C
o
V
DD
S
28 Transistors
I mproved CMOS I mplementation I I
B.Supmonchai
Notes on I mproved CMOS FA
Note that the PMOS network is identical to the NMOS
network rather than being the complement.
This is possible because of the inversion property which says
that the function of complemented inputs is equal to the
complement of the function.
This simplification reduces the number of series transistors
and makes the layout more uniform
This design has a greater delay to compute S than C
out

Most of the time the extra delay computing S has little effect
on the critical path because carry is the signal that propagates
With proper sizing this delay on S can be minimized
B.Supmonchai
A B
S
C
o
C
i FA
A B
S
C
o
C
i
FA
S A B C
i
, , ( ) S A B C
i
, , ( ) =
C
o
A B C
i
, , ( ) C
o
A B C
i
, , ( ) =
I nversion Property
The function must be symmetric
B.Supmonchai
TG-Based FA
XOR
XOR
2-to-1 MUX
16 Transistors
C
out
S
C
in

A
B
P
Extra delay - slower
B.Supmonchai
Complementary PT Logic (CPL) FA
A

A

B
B
C
in
C
in
A

B
B
A

B
B
C
in
C
in
C
in
C
in
S
S
C
out
C
out
28 transistors
dual rail
Voltage drop
Problems
Faster, Lower Power, and small area than full static CMOS
B.Supmonchai
B
B B
B B
B
B
B
A
A
A
A
A
A
A
A
C
in
C
in
C
in
C
in
C
in
!C
out
!S
24+4 transistors
kill
generate
0-propagate
1-propagate
4 4
4 4
4
8
8 8 8
8
2 2 2
3
3
3
6
6
6
4 4 4
4
2
Mirror Adder
S = ABC
in
+ (A + B + C
in
)C
out
C
out
= AB +AC
in
+BC
in

PUN and PDN are symmetrical not complemented
B.Supmonchai
Mirror Adder Features
The NMOS and PMOS chains are completely
symmetrical with a maximum of two series transistors
in the carry circuitry, guaranteeing identical rise and
fall transitions if the NMOS and PMOS devices are
properly sized.
When laying out the cell, the most critical issue is the
minimization of the capacitances at node !C
out
(four
diffusion capacitances, two internal gate capacitances,
and two inverter gate capacitances).
Shared diffusions can reduce the stack node capacitances.
The transistors connected to C
in
are placed closest to the
output.
B.Supmonchai
Mirror Adder Sizing I ssues
Only the transistors in the carry stage have to be
optimized for optimal speed. All transistors in the sum
stage can be minimal size.
Assume PMOS/NMOS ratio of 2. Each input in the
carry circuit has a logical effort of 2 so the optimal fan-
out for each is also 2.
Since !C
out
drives 2 internal and 2 inverter transistor
gates (to form C
out
for the bit adder) the carry circuit
should be oversized
B.Supmonchai
C
i
A B
V
DD
GND
B
C
o
A C
i
C
o
C
i
A B
S
Mirror Adder Stick Diagram
B.Supmonchai
Worst Case Delay : t
ripple
= O(N)
t
ripple
~ t
FA
(A,BC
out
) +

(N - 2)t
FA
(C
in
C
out
) + t
FA
(C
in
S)
Slow!
Ripple Carry Adder (RCA)
A
0
B
0
S
0
C
0
= C
in
A
1
B
1
S
1
A
2
B
2
S
2
A
3
B
3
S
3
C
out
= C
4
C
1
C
2
C
3
FA FA FA FA
Make the fastest possible carry path
B.Supmonchai
regular cell inverted cell
A
0
B
0
S
0
C
0
= C
in
A
1
B
1
S
1
A
2
B
2
S
2
A
3
B
3
S
3
C
out
= C
4
C
1
C
2
C
3
FA FA FA FA
Exploiting the I nversion Property
Now need two flavors of FAs
Minimizes the critical path (the carry chain) by elimi-
nating inverters between the FAs
Need increasing the transistor sizes on the carry chain portion
of the mirror adder.
B.Supmonchai
C
1
=G
0
+P
0
C
0
C
2
=G
1
+P
1
G
0
+P
1
P
0
C
0
C
3
=G
2
+P
2
G
1
+P
2
P
1
G
0
+P
2
P
1
P
0
C
0
C
4
=G
3
+P
3
G
2
+P
3
P
2
G
1
+P
3
P
2
P
1
G
0
+P
3
P
2
P
1
P
0
C
0
Fast Carry Chain Design
The key to fast addition is a low latency carry network
What matters is whether in a given position a carry is
Generated G
i
=A
i
B
i

Propagated P
i
=A
i
B
i
(sometimes use Ai | Bi)
Annihilated (killed) K
i
=!A
i
!B
i

Giving a carry recurrence of
C
i+1
= G
i
+ P
i
C
i
B.Supmonchai
Manchester Carry Chain
Switches controlled by G
i
and P
i

Components of total delay
time to form the switch control signals G
i
and P
i
setup time for the switches
signal propagation delay through N switches in the worst case
C
o
C
i
G
i
D
i
P
i
P
i
V
DD
Static
C
o C
i
G
i
P
i
V
DD
|
|
Domino
B.Supmonchai
4-bit Sliced MCC Adder
G

P

!C
0
clk

G

P

G

P

G

P

& & & &
A
0
B
0
A
1
B
1
A
2
B
2
A
3
B
3
S
0
S
1
S
2
S
3
!C
1
!C
2
!C
3
!C
4
B.Supmonchai
G
0
+P
0
C
0
G
1
+P
1
G
0
+P
1
P
0
C
0
G
2
+P
2
G
1
+P
2
P
1
G
0
+P
2
P
1
P
0
C
0
G
3
+P
3
G
2
+P
3
P
2
G
1
+P
3
P
2
P
1
G
0
+P
3
P
2
P
1
P
0
C
0
Domino MCC Circuit
P
0
P
1
P
2
P
3
C
i,0
clk

G
0
G
1
G
2
G
3
C
i,4
clk

3 3 3 3 3

1

2

3

4

5

6

1

2

2

3

3

4

4

5

B.Supmonchai
MCC Stick Diagram
P
i + 1
G
i + 1
|
C
i
Inverter/Sum Row
Propagate/Generate Row
P
i
G
i
|
C
i - 1
C
i + 1
V
DD
GND
B.Supmonchai
Notes on MCC Adder
When clock is low, the carry nodes precharge; when
clock goes high if G
i
is high, C
i+1
is asserted (goes low)
To prevent G
i
from affecting C
i
, the signal P
i
must be
computed as the xor (rather than the or) which is not a
problem since we need the xor of A
i
and B
i
for
computing the sum anyway
Delay is roughly proportional to n**2 (as n pass
transistors are connected in series)
we usually limit each group to 4 stages, then buffer the carry
chain with an inverter between each group
B.Supmonchai
Binary Adder Landscape
Synchronous Word
Parallel Adders
Ripple Carry Adders (RCA) Carry Prop Min Adders
Signed-Digit
Adders
Fast Carry Prop Adders Residue Adder
Manchester
Carry Chain
Carry
Select
Parallel
Prefix
Conditional
Sum
Carry
Skip
t = O(log N)
A = O(N log N)
t = O(\N)
A = O(N)
t =O(N)
A = O(N)
t = O(N), A = O(N)
t =O(1), A = O(N)
Bit-Serial
Adders
Asynchronous
Adders
B.Supmonchai
If (P
0
& P
1
& P
2
& P
3
= 1) then C
o,3
= C
i,0
otherwise the
block itself kills or generates the carry internally
Carry-Skip (Carry-Bypass) Adder
A
0
B
0
S
0
C
i,0
A
1
B
1
S
1
A
2
B
2
S
2
A
3
B
3
S
3
C
0,3
C
1
C
2
C
3
FA FA FA FA
C
o,3
BP = P
0
P
1
P
2
P
3
Block Propagate
1
0
B.Supmonchai
BP (By-Pass)
block carry-in
block carry-out
carry-out
Carry-Skip Chain I mplementation
C
in
G
0
P
0
P
1
P
2
P
3
G
1
G
2
G
3
BP

C
out
Only 10% to 20% area overhead
Only two gate delays to
produce C
out
if skip occurs
B.Supmonchai
Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0,
ripples through bits 1, 2, and 3, skips the middle two groups (B is the
group size in bits), ripples in the last group from bit 12 to bit 15
t
add
= t
setup
+ B t
carry
+ ((N/B) -1) t
skip
+ B t
carry
+ t
sum
4-bit Block Carry-Skip Adder
C
i,0
Carry
Propagation
Setup
Sum
Carry
Propagation
Setup
Sum
Carry
Propagation
Setup
Sum
Carry
Propagation
Setup
Sum
bits 0 to 3

bits 4 to 7

bits 8 to 11

bits 12 to 15

t
setup
t
skip
t
carry
t
sum
B.Supmonchai
Optimal Block Size and Time
Assuming one stage of ripple (t
carry
) has the same delay
as one skip logic stage (t
skip
) and both are 1
t
CSkA
= 1 + B + (N/B-1) + B + 1
= 2B + N/B + 1
So the optimal block size, B, is
dt
CSkA
/dB = 0 \(N/2) = B
opt
And the optimal time is
Optimal t
CSkA
= 2(\(2N)) + 1
t
setup
ripple in skips ripple in t
sum
block 0 last block
B.Supmonchai
Variations of Carry-Skip Adders I
Variable block sized Carry-Skip Adders
A carry that is generated in, or absorbed by, one of the inner
blocks travels a shorter distance through the skip blocks
Hence a CSA adder can have bigger blocks for the inner
carries without increasing the overall delay
C
in
C
out
t
CSkA
= 2B + O(N
B
)
N
B
Blocks
B.Supmonchai
skip level 1
skip level 2
C
in
C
out
AND of the
first level skip
signals (BPs)
t
CSkA
= 2B + O(log
B
N)
Variations of Carry-Skip Adders I I
Multiple Levels of Skip Logic
CSAs with large number of bits suffer from linear carry
propagation delay time.
Added higher levels of skip logic, a CSA can skip more blocks
at a time.
B.Supmonchai
Carry-Skip Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCA
CSkA
VSkA
B=2
B=3
B=4
B=5
B=6
B.Supmonchai
Idea: Precompute the
carry out of each block for
both carry_in = 0 and
carry_in = 1 (can be
done for all blocks in
parallel) and then select
the correct one

More cost effective
than the ripple carry
adder
Carry Select Adders
0 Carry Propagation
4-bit Setup
1 Carry Propagation
1
0
Multiplexer
C
in C
out

Sum Generation
Ps Gs
Cs
As Bs
Ss
B.Supmonchai
t
add
= t
setup
+ B t
carry
+ (N/B) t
mux
+ t
sum

C
out

bits 0 to 3 bits 4 to 7 bits 8 to 11 bits 12 to 15
0 carry
Setup
Mux
Sum Gen
Ps Gs
Cs
Ss
As Bs
1 carry
0 carry
Setup
Mux
Sum Gen
Ps Gs
Cs
Ss
As Bs
1 carry
0 carry
Setup
Mux
Sum Gen
Ps Gs
Cs
Ss
As Bs
1 carry
0 carry
Setup
Mux
Sum Gen
Ps Gs
Cs
Ss
As Bs
1 carry
C
in

Carry Select Adder: Critical Path
B.Supmonchai
Square Root Carry Select Adders
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13
S
0-1
S
2-4
S
5-8
S
9-13
C
i, 0
(4) (5) (6) (7)
(1)
(1)
(3) (4) (5) (6)
Mux
Sum
S
14-19
(7)
(8)
Bit 14-19
(9)
(3)
t
add
= t
setup
+ 2 t
carry
+ N t
mux
+ t
sum
Balance Delay - Making later block bigger
B.Supmonchai
Square root select
Linear select
Ripple adder
20 40
N
t
p
(
i
n

u
n
i
t

d
e
l
a
y
s
)
60 0
10
0
20
30
40
50
Adder Delays - Comparison
B.Supmonchai
A
N-1
, B
N-1
A
1
, B
1
P
1
S
1

S
N-1
P
N-1
C
i, N-1
S
0
P
0
C
i,0
C
i,1
A
0
, B
0
Carry Network
LookAhead - Basic I dea
C
o,k
= f(A
k
, B
k
,C
o,k-1
) = G
k
+ P
k
C
o,k-1
B.Supmonchai
C
o,3
C
i,0
V
DD
P
0
P
1
P
2
P
3
G
0
G
1
G
2
G
3
Look-Ahead: Topology
By expanding carry generation
all the way:
C
1
=G
0
+P
0
C
0
C
2
=G
1
+P
1
G
0
+P
1
P
0
C
0
C
3
=G
2
+P
2
G
1
+P
2
P
1
G
0
+P
2
P
1
P
0
C
0
C
4
=G
3
+P
3
G
2
+P
3
P
2
G
1
+P
3
P
2
P
1
G
0
+P
3
P
2
P
1
P
0
C
0

B.Supmonchai
A
7
F
A
6
A
5
A
4
A
3
A
2
A
1
A
0
A
0
A
1
A
2
A
3
A
4
A
5
A
6
A
7
F
t
p
~ log
2
(N)
t
p
~ N
Logarithmic Look-Ahead Adder
B.Supmonchai
Define carry operator on (G,P) signal pairs

is associative, i.e.,
[(g,p) (g,p)] (g,p) = (g,p) [(g,p) (g,p)]
Parallel Prefix Adders (PPAs)

(G,P) (G,P)
(G,P)
where
G = G + PG
P = PP

G

!G
G

P

B.Supmonchai
PPA General Structure
Given P and G terms for each bit position, computing all the carries
is equal to finding all the prefixes in parallel
(G
0
,P
0
) (G
1
,P
1
) (G
2
,P
2
) (G
N-2
,P
N-2
) (G
N-1
,P
N-1
)
Since is associative, we can group them in any order
but note that it is not commutative
Measures to consider
number of cells
tree cell depth (time)
tree cell area
cell fan-in and fan-out
max wiring length
wiring congestion
delay path variation (glitching)
P
i
, G
i
logic (1 unit delay)
S
i
logic (1 unit delay)
C
i
parallel prefix logic tree
(1 unit delay per level)
B.Supmonchai

G
0

P
0
G
1

P
1
G
2

p
2
G
3

P
3
G
4

P
4
G
5

P
5
G
6

P
6
G
7

P
7
G
8

P
8
G
9

p
9
G
10

P
10
G
11

p
11
G
12

P
12
G
13

p
13
G
14

p
14
G
15

p
15

C
1
C
2
C
3
C
4
C
5
C
6
C
7
C
8
C
9
C
10
C
11
C
12
C
13
C
14
C
15
C
16

C
in

Brent-Kung PPA
B.Supmonchai

G
0

P
0
G
1

P
1
G
2

P
2
G
3

P
3
G
4

P
4
G
5

P
5
G
6

P
6
G
7

P
7
G
8

P
8
G
9

P
9
G
10

P
10
G
11

P
11
G
12

P
12
G
13

P
13
G
14

P
14
G
15

P
15

C
1
C
2
C
3
C
4
C
5
C
6
C
7
C
8
C
9
C
10
C
11
C
12
C
13
C
14
C
15
C
16

C
in

Kogge-Stone PPF Adder
B.Supmonchai
More Adder Comparisons
0
10
20
30
40
50
60
70
8 bits 16 bits 32 bits 48 bits 64 bits
RCA
CSkA
VSkA
KS PPA
B.Supmonchai
Adder Speed Comparisons
10
20
30
40
50
60
70
16 bits 32 bits 64 bits
RCA
MCC
CCSkA
VCSkA
CCSlA
B&K
B.Supmonchai
Adder Average Power Comparisons
0
5
10
15
20
25
30
35
16 bits 32 bits 64 bits
RCA
MCC
CCSkA
VCSkA
CCSlA
B&K
B.Supmonchai
Binary Multiplication - Basics
Given two unsigned binary numbers X (M bits)
and Y (N bits)

X = X
i
2
i
i= 0
M 1

Y = Y
j
2
j
j=0
N1
where X
i
, Y
j
e {0, 1}
The multiplication operation Z = X Y is

Z
k
2
k
k=0
M +N1
= X
i
2
i
i=0
M 1
|
\

|
.
|
Y
j
2
j
j=0
N1
|
\

|
.
|
|
= X
i
Y
j
2
i+ j
j=0
N1
|
\

|
.
|
|
i=0
M 1
B.Supmonchai
Binary Multiplication Operation
Binary Multiplication as repeated additions
1 0 1 0 1 0
1 0 1 1
1 0 1 0 1 0
1 0 1 0 1 0
0 0 0 0 0 0
1 0 1 0 1 0
1 1 1 0 0 1 1 1 0
multiplicand

multiplier

partial
product
array

double precision product

can be formed in parallel

N

M

2N

N

B.Supmonchai
Shift-and-Add Multiplication
Right Shift and Add (N bits N bits)
Multiplicand
Multiplier
N-bit Adder
0
N

N

N N
N
N+1
N
Bit out
1 0
*Left shift requires
2n-bit adder
t
shift&add_mult
= O(N t
adder
) = O(N
2
) for an RCA
B.Supmonchai
I mproving Multipliers
Making them faster (therefore, bigger area)
Use faster adders
Use higher radix (e.g., base 4) multiplication
Use multiplier recoding to simplify multiple formation
Form partial product array in parallel and add it in parallel
Making them smaller (i.e., slower)
Use array multipliers
Very regular structure with only short wires to nearest neighbor
cells. Thus, very simple and efficient layout in VLSI
Can be easily and efficiently pipelined
B.Supmonchai
partial
product
array
reduction
tree

fast carry
propagate
adder
(CPA)

mux
+
reduction
tree (log N)
+
CPA (log N)

multiple
forming
circuits

P (product)

Q (ier)

D (icand)
D
D
D
0
0
0
0
Array (or Tree) Multiplier Structure
P
P

G
e
n
e
r
a
t
i
o
n

P
P

A
c
c
u
m
u
-
l
a
t
i
o
n

F
i
n
a
l

A
d
d
i
t
i
o
n

B.Supmonchai
Partial Product (PP) Generation
Each row in the partial-product array is either a copy of
the multiplicand or a row of zeros
Careful optimization of the PP generation can lead to
some substantial delay and area reduction.
Booths and modified Booths recording
X
7
X
6
X
5
X
4
X
3
X
2
X
1
X
0

Y
i
PP
7
PP
6
PP
5
PP
4
PP
3
PP
2
PP
1
PP
0

B.Supmonchai
Array Multiplier I mplementation
Y
0
Y
1
X
3
X
2
X
1
X
0
X
3
HA
X
2
FA
X
1
FA
X
0
HA
Y
2 X
3
FA
X
2
FA
X
1
FA
X
0
HA
Z
1
Z
3
Z
6
Z
7
Z
5
Z
4
Y
3
X
3
FA
X
2
FA
X
1
FA
X
0
HA
Z
2
Z
0
HA: Half Adder
FA: Full Adder
CP: Critical Path
HW for One
Partial Product
CP1
CP2
t
array_mult
= [(M -1)+(N - 2)] t
carry
+ (N - 1) t
sum
+ t
and
= O(N)
* Assume t
add
= t
carry

B.Supmonchai
Carry-Save Multiplier
HA HA HA HA
FA FA FA HA
FA HA FA FA
FA HA FA HA
Vector Merging Adder
The idea is to save the (PP) carry and add it in the
next adder stage
In the final addition a fast carry-propagate (e.g., carry-
lookahead) adder is used.
t
CSM
= (N - 1) t
carry
+ t
merge
+ t
and
= O(N)
Unique and
Shorter CP
6 HAs
6 FAs
B.Supmonchai
S C S C S C S C
S C S C S C S C
S C S C S C S C
S
C
S
C
S
C
S
C
Z
0
Z
1
Z
2
Z
3
Z
4
Z
5
Z
6
Z
7
X
0
X
1
X
2
X
3
Y
1
Y
2
Y
3
Y
0
Vector Merging Cell
HA Multiplier Cell
FA Multiplier Cell
X and Y signals are broadcasted
through the complete array.
( )
CSM Floorplan
Regularity makes the
generation of structure
amenable to automation
B.Supmonchai
Wallace-Tree Multiplier
6 5 4 3 2 1 0
Partial Products
Bit
Position
6 5 4 3 2 1 0
First Stage
6 5 4 3 2 1 0
Second Stage
6 5 4 3 2 1 0
Final Adder
Rearranging PPs
Any Types of adder
can be used
GOAL: Minimize depth (#of stages) with min. no. of adder elements
HA
FA
HA
B.Supmonchai
Wallace-Tree Multiplier I mplementation
HA
3 HAs and 3 FAs for the reduction process (stage 1 + stage 2)
Any type of adder can be used for the final adder
B.Supmonchai
Notes on Wallace-Tree Multiplier
Wallace tree substantially saves hardware for large
multipliers
Number of partial products is reduced by two-thirds per stage
The propagation delay is found to be bound,
Although substantially faster than CSM, WTM structure
is very irregular
Difficulty in finding efficient VLSI layout
Many of todays high performance multipliers use higher
order (e.g. 4-2) compressors in stead of 3-2 compressors
(FAs)
t
WTM
= O(log
3/2
(N))
B.Supmonchai
Shifter
Control =
Shift amount
Shift direction
Shift type (logical,
arith, circular)
Consume lots of area if done in random logic gates
Parallel Programmable Shifters
Shifting a data word left or right over a constant amount
is a trivial hardware operation and is implemented by
the appropriate signal wiring
Shifters are used in multipliers, floating point units
B.Supmonchai
A Programmable Binary Shifter
A
i
A
i-1
right nop left B
i
B
i-1

A
1
A
0
0 1 0 A
1
A
0
A
1
A
0
1 0 0 0 A
1
A
1
A
0
0 0 1 A
0
0
A
i
A
i-1
B
i
B
i-1
Right Left nop
Bit-Slice i
...
Exactly one
signal is active
B.Supmonchai
4-bit Barrel Shifter
Example: Sh0 = 1
B
3
B
2
B
1
B
0
= A
3
A
2
A
1
A
0

Sh1 = 1
B
3
B
2
B
1
B
0
= A
3
A
3
A
2
A
1

Sh2 = 1
B
3
B
2
B
1
B
0
= A
3
A
3
A
3
A
2

Sh3 = 1
B
3
B
2
B
1
B
0
= A
3
A
3
A
3
A
3

Area dominated by wiring
A
0

A
1

A
2

A
3

B
0

B
1

B
2

B
3

Sh1
Sh2
Sh3
Sh0 Sh1 Sh2 Sh3
Arithmetic shift
B.Supmonchai
Notes on Barrel Shifter
Note that signal goes through at most one FET (so
constant propagation delay (in theory))
Also note, that the FET diffusion capacitance on an
output wire increases linearly with the shift width but
the FET diffusion capacitance on the input data lines
increases quadratically (i.e., N
2
for circular shifter)
Size of cell is bounded by the pitch of the metal wires.
A decoder is usually needed for shift control signals
since the amount of shift are normally given in
(encoded) binary number.
B.Supmonchai
4-bit Barrel Shifter Layout
Width
barrel
~ 2 p
m
N
N = max shift distance, p
m
= metal pitch
Buffer
Sh3 Sh2 Sh1 Sh0
A
3
A
2
A
1
A
0
Width
barrel

B.Supmonchai
log N stages

0

0

0

1

1

1

8-bit Logarithmic Shifter
A
3
A
2
A
1
A
0
!Sh1

Sh1

!Sh2

Sh2

!Sh3

Sh3

B
0
B
1
B
2
B
3
B.Supmonchai
Width
log
~ p
m
(2K+(1+2++2
K-1
)) = p
m
(2
K
+2K-1)
K = log
2
N
A
0
B
3
B
2
B
1
B
0
A
1
A
2
A
3
1 2 4
8-bit Logarithmic Shifter Layout Slice
B.Supmonchai

N

K
Barrel Logarithmic
Width Speed Width Speed
2 N p
m
1 + N diffs p
m
(2
K
+2K-1) K + 2 diffs
8 3 16 p
m
1 + 8 13 p
m
3 + 2
16 4 32 p
m
1 + 16 23 p
m
4 + 2
32 5 64 p
m
1 + 32 41 p
m
5 + 2
64 6 128 p
m
1 + 64 75 p
m
6 + 2
Shifter I mplementation Comparisons
Barrel Shifter is better for small shifters (faster, not much bigger)
while Log Shifter is preferred for larger shifters.
Log Shifters are always smaller
For large shifter we may have to start worrying about the number
of pass transistors in series.
B.Supmonchai
2-to-4
Decoder
In
0
In
1
Enable

Out
0
Out
1
Out
2
Out
3
Decoders
Decodes inputs to activate one of many outputs
Cost of 2-to-4 Decoder
two inverters, four 2-input NAND gates, four
inverters plus enable logic
how about cost for a 3-to-8, 4-to-16, etc. decoder?
= In
0
In
1
= In
0
In
1
= In
0
In
1
= In
0
In
1
B.Supmonchai
Dynamic NOR Decoder
1

1

1

1
0 1 0 1 0 1
0

0

0

1
V
dd
GND GND
A
0
A
1

B
0

B
1

B
2

B
3

precharge A
0

A
1

on
on
on
on
Active HIGH Outputs
Capacitance of the output wires increases linearly with the decoder size
B.Supmonchai
Dynamic NAND Decoder
0 1 0 1
on on
1

1

1

1
0 1
1

1

1

0
GND
A
0
A
1

B
3

precharge
B
2

B
1

B
0

A
1
A
0

Active LOW Outputs
B.Supmonchai
Notes on Dynamic Decoders
In Dynamic NOR decoder signal goes through at most
one FET
So constant propagation delay (in theory)
However, some output wires may have two or more parallel
paths to GND - effectively shortening the transition time
On the contrary, signal in dynamic NAND decoder pass
through a series of FET
The number of FETs rises linearly with the decoder size
Thus it will be slower than the NOR implementation if the
gate capacitance dominates diffusion capacitance
For the NAND decoder all the input signals must be low
during precharge else V
dd
and GND will be connected!
B.Supmonchai
Building Bigger Decoders
0 0 0 0 1
1
Active low enable, Active low output

Need to catch the output that goes to zero before it precharges again
A
4
enable

A
3
A
2
A
1
A
0
1x2
2x4
2x4
2x4
2x4
.
.
.
0 1
B.Supmonchai
Layout of Bit-Sliced Datapaths
Must have enough
drive capacity to
handle large fan-out
Sized for peak
current
Horizontal gap for
feeding signals to the
cells downstream
B.Supmonchai
Without feedthroughs or
pitch matching (4.2m
2
)
Optimizing Bit-sliced Datapaths
With feedthroughs and
pitch matching (2.2m
2
)
With feedthroughs
(3.2m
2
)

Goals of This Chapter: Designing For Performance, Area, or Power

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Goals of This Chapter: Designing For Performance, Area, or Power

Transféré par

Droits d'auteur :

Formats disponibles

B.

Vous aimerez peut-être aussi