Verilog l13 Mit

6.
111 Fall 2007 Lecture 13, Slide 1

6.111 Lecture 13
Today: Arithmetic: Multiplication
1.Simple multiplication
2.Twos complement mult.
3.Speed: CSA & Pipelining
4.Booth recoding
5.Behavioral transformations:
Fixed-coef. mult., Canonical Signed Digits, Retiming
Acknowledgements:
R. Katz, Contemporary Logic Design, Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5)
J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective Prentice Hall, 2003.
Kevin Atkinson, Alice Wang, Rex Min
6.111 Fall 2007 Lecture 13, Slide 2
Unsigned Multiplication
A
0
A
1
A
2
A
3
B
0
B
1
B
2
B
3
A
0
B
0
A
1
B
0
A
2
B
0
A
3
B
0
A
0
B
1
A
1
B
1
A
2
B
1
A
3
B
1
A
0
B
2
A
1
B
2
A
2
B
2
A
3
B
2
A
0
B
3
A
1
B
3
A
2
B
3
A
3
B
3
x
+
AB
i
called a partial product
Multiplying N-bit number by M-bit number gives (N+M)-bit result
Easy part: forming partial products
(just an AND gate since B
I
is either 0 or 1)
Hard part: adding M N-bit partial products
1. Simple Multiplication
Sequential Multiplier
Assume the multiplicand (A) has N bits and the
multiplier (B) has M bits. If we only want to invest
in a single N-bit adder, we can build a sequential
circuit that processes a single partial product at a
time and then cycle the circuit M times:
A P B
+
S
N
NC
N
xN
N
N+1
S
N-1
S
0
Init: P!0, load A and B
Repeat M times {
P ! P + (B
LSB
==1 ? A : 0)
shift P/B right one bit
}
Done: (N+M)-bit result in P/B
M bits
LSB
1
Combinational Multiplier
! Partial product computations
are simple (single AND gates)
HA
x
3
FA
x
2
FA
x
1
FA
x
2
FA
x
1
HA
x
0
FA
x
1
HA
x
0
HA
x
0
FA
x
3
FA
x
2
FA
x
3
x
3
x
2
x
1
x
0
z
0
z
1
z
2
z
3
z
4
z
5
z
6
z
7
y
3
y
2
y
1
y
0
! Propagation delay ~2N
2s Complement Multiplication
X3 X2 X1 X0
* Y3 Y2 Y1 Y0
--------------------
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
- X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
-----------------------------------------
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
X3Y0 X2Y0 X1Y0 X0Y0
+ X3Y1 X2Y1 X1Y1 X0Y1
+ X2Y2 X1Y2 X0Y2
+ X3Y3 X2Y3 X1Y3 X0Y3
+ 1 1
Step 1: twos complement operands so
high order bit is 2
N-1
. Must sign extend
partial products and subtract the last one
Step 2: dont want all those extra additions, so
add a carefully chosen constant,
remembering to subtract it at the end. Convert
subtraction into add of (complement + 1).
Step 3: add the ones to the partial
products and propagate the carries. All
the sign extension bits go away!
Step 4: finish computing the constants
Result: multiplying 2s complement operands
takes just about same amount of hardware as
multiplying unsigned operands!
X3Y0 X2Y0 X1Y0 X0Y0
+ X3Y1 X2Y1 X1Y1 X0Y1
+ X2Y2 X1Y2 X0Y2
+ X3Y3 X2Y3 X1Y3 X0Y3
+ 1
- 1 1 1 1
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0
+ 1
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1
+ 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ 1
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
+ 1
+ 1
- 1 1 1 1
B = ~B + 1
(Baugh-Wooley)
2s Complement Multiplication
FA
x
3
FA
x
2
FA
x
1
FA
x
2
FA
x
1
HA
x
0
FA
x
1
HA
x
0
HA
x
0
FA
x
3
FA
x
2
FA
x
3
HA
1
1
x
3
x
2
x
1
x
0
z
0
z
1
z
2
z
3
z
4
z
5
z
6
z
7
y
3
y
2
y
1
y
0
Multiplication in Verilog
You can use the * operator to multiply two numbers:
wire [9:0] a,b;
wire [19:0] result = a*b; // unsigned multiplication!
If you want Verilog to treat your operands as signed twos
complement numbers, add the keyword signed to your
wire or reg declaration:
wire signed [9:0] a,b;
wire signed [19:0] result = a*b; // signed multiplication!
Remember: unlike addition and subtraction, you need different
circuitry if your multiplication operands are signed vs.
unsigned. Same is true of the >>> (arithmetic right shift)
operator. To get signed operations all operands must be
signed.
To make a signed constant: 10sh37C
Multipliers in the Virtex II
The Virtex FGPA has hardware multiplier circuits:
Note that the operands are signed 18-bit numbers.
The ISE tools will often use these hardware multipliers when
you use the * operator in Verilog. Or can you instantiate
them directly yourself:
wire signed [17:0] a,b;
wire signed [35:0] result;
MULT18X18 mymult(.A(a),.B(b),.P(result));
3. Faster Multipliers: Carry-Save Adder
Last stage is still a carry-propagate adder (CPA)
Good for pipelining: delay
through each partial product
(except the last) is just
tPD,AND + tPD,FA.
No carry propagation time!
CSA
Increasing Throughput: Pipelining
= register
Idea: split processing across several
clock cycles by dividing circuit into
pipeline stages separated by
registers that hold values passing
from one stage to the next.
Throughput = 1 result per clock cycle (period is now 4*t
PD,FA
instead of 8*t
PD,FA
)
Wallace Tree Multiplier
CSA CSA CSA
CSA
.
.
.
CSA
CSA
CSA
CPA
O(log
1.5
M)
Higher fan-in adders can be
used to further reduce delays
for large M.
Wallace Tree:
Combine groups of
three bits at a
time
This is called a 3:2
counter by multiplier
hackers: counts
number of 1s on the
3 inputs, outputs 2-
bit result.
4:2 compressors and 5:3
counters are popular
building blocks.
4. Booth Recoding: Higher-radix mult.
A
N-1
A
N-2
A
4
A
3
A
2
A
1
A
0

B
M-1
B
M-2
B
3
B
2
B
1
B
0 x
...
2 M/2
B
K+1,K
*A = 0*A " 0
= 1*A " A
= 2*A " 4A 2A
= 3*A " 4A A
Idea: If we could use, say, 2 bits of the multiplier in generating
each partial product we would halve the number of columns and
halve the latency of the multiplier!
Booths insight: rewrite
2*A and 3*A cases,
leave 4A for next partial
product to do!
Booth recoding
B
K+1
0
0
0
0
1
1
1
1
B
K
0
0
1
1
0
0
1
1
B
K-1
0
1
0
1
0
1
0
1
action
add 0
add A
add A
add 2*A
sub 2*A
sub A
sub A
add 0
A 1 in this bit means the previous stage
needed to add 4*A. Since this stage is
shifted by 2 bits with respect to the
previous stage, adding 4*A in the previous
stage is like adding A in this stage!
-2*A+A
-A+A
from previous bit pair
current bit pair
" There are a large number of implementations of the
same functionality
" These implementations present a different point in the
area-time-power design space
" Behavioral transformations allow exploring the design
space a high-level
Optimization metrics:
area
time
power
1. Area of the design
2. Throughput or sample time T
S
3. Latency: clock cycles between
the input and associated
output change
4. Power consumption
5. Energy of executing a task
6.
5.Behavioral Transformations
Fixed-Coefficient Multiplication
Z
0
Z
1
Z
2
Z
3
Z
4
Z
5
Z
6
Z
7
X
0
Y
3
X
1
Y
3
X
2
Y
3
X
3
Y
3
X
0
Y
2
X
1
Y
2
X
2
Y
2
X
3
Y
2
X
0
Y
1
X
1
Y
1
X
2
Y
1
X
3
Y
1
X
0
Y
0
X
1
Y
0
X
2
Y
0
X
3
Y
0
Y
0
Y
1
Y
2
Y
3
X
0
X
1
X
2
X
3
Z = X Y
Conventional Multiplication
X Z
<< 3
Y = (1001)
2
= 2
3
+ 2
0
shifts using wiring
Z
0
Z
1
Z
2
Z
3
Z
4
Z
5
Z
6
Z
7
X
0
X
1
X
2
X
3
X
0
X
1
X
2
X
3
1 0 0 1
X
0
X
1
X
2
X
3
Z
0
Z
1
Z
2
Z
3
Z
4
Z
5
Z
6
Z
7
X
0
X
1
X
2
X
3
X
0
X
1
X
2
X
3
1 0 0 1
X
0
X
1
X
2
X
3
Z = X (1001)
2
Constant multiplication (become hardwired shifts and adds)
Transform: Canonical Signed Digits (CSD)
1 0 1 1 1
Canonical signed digit representation is used to increase the number of
zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.
Iterative encoding: replace
string of consecutive 1s
2
N-2
+ + 2
1
+ 2
0
0 1 -1 0 0
2
N-1
- 2
0
Worst case CSD has 50% non zero bits
X
<< 7 Z
<< 4
Shift translates to re-wiring
1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 -1 1 0 0 0 1 1 0 -1 1 0 0 0 1 1
0 -1 0 1 0 0 -1 0 0 -1 0 1 0 0 -1 0
(replace 1 with 2-1)
Algebraic Transformations
A B
B A
#
Commutativity
A + B = B + A
#
Distributivity
C
A
B
A C B
(A + B) C = AB + BC
#
Associativity
A
C
B
C
A
B
(A + B) + C = A + (B+C)
A B
A B
#
Common sub-expressions
X Y
X Y X
Transforms for Efficient Resource Utilization
C A
B
F D E
2
1
I G H
Time multiplexing: mapped
to 3 multipliers and 3
adders
Reduce number of
operators to 2 multipliers
and 2 adders
2
1
C A B
distributivity
F D E I G H
Retiming is the action of moving delay around in the systems
" Delays have to be moved from ALL inputs to ALL outputs or vice versa
D
D
D
D
D
Retiming: A very useful transform
Cutset retiming: A cutset intersects the edges, such that this would result in
two disjoint partitions of these edges being cut. To retime, delays are moved
from the ingoing to the outgoing edges or vice versa.
Benefits of retiming:
Modify critical path delay
Reduce total number of registers
D
D
D
Pipelining, Just Another Transformation
(Pipelining = Adding Delays + Retiming)
D
D
D
D
D
D
D
D
D
How to pipeline:
1. Add extra registers at all
inputs (or, equivalently, all
outputs)
2. Retime
retime
add input
registers
Contrary to retiming,
pipelining adds extra
registers to the system
The Power of Transforms: Lookahead
D
x(n)
y(n)
A
2D
x(n)
y(n)
D
A A
A
D
x(n)
y(n)
A
2
A
D D
loop
unrolling
distributivity
associativity
retiming
2D
x(n)
y(n)
D
A
2
A
precomputed
2D
x(n)
y(n)
D
A
A
y(n) = x(n) + A[x(n-1) + A y(n-2)]
y(n) = x(n) + A y(n-1)
Try pipelining
this structure
Summary
Simple multiplication:
O(N) delay
Twos complement easily handled (Baugh-Wooley)
Faster multipliers:
Wallace Tree O(log N)
Booth recoding:
Add using 2 bits at a time
Behavioral Transformations:
Faster circuits using pipelining
FA
x
3
FA
x
2
FA
x1
FA
x
2
FA
x1
HA
x0
FA
x
1
HA
x0
HA
x
0
FA
x
3
FA
x
2
FA
x
3
HA
1
1
x3 x
2
x
1
x
0
z0
z1
z
2
z3
z
4
z
5 z
6
z
7
y3
y2
y
1
y
0
D
x(n)
y(n)
A
2
A
D D
and algebraic properties

Verilog l13 Mit

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Verilog l13 Mit

Transféré par

Droits d'auteur :

Formats disponibles

6.

111 Fall 2007 Lecture 13, Slide 1

Vous aimerez peut-être aussi