6.111 Lecture 13 Today: Arithmetic: Multiplication 1.Simple multiplication 2.Twos complement mult. 3.Speed: CSA & Pipelining 4.Booth recoding 5.Behavioral transformations: Fixed-coef. mult., Canonical Signed Digits, Retiming Acknowledgements: R. Katz, Contemporary Logic Design, Addison Wesley Publishing Company, Reading, MA, 1993. (Chapter 5) J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective Prentice Hall, 2003. Kevin Atkinson, Alice Wang, Rex Min 6.111 Fall 2007 Lecture 13, Slide 2 Unsigned Multiplication A 0 A 1 A 2 A 3 B 0 B 1 B 2 B 3 A 0 B 0 A 1 B 0 A 2 B 0 A 3 B 0 A 0 B 1 A 1 B 1 A 2 B 1 A 3 B 1 A 0 B 2 A 1 B 2 A 2 B 2 A 3 B 2 A 0 B 3 A 1 B 3 A 2 B 3 A 3 B 3 x + AB i called a partial product Multiplying N-bit number by M-bit number gives (N+M)-bit result Easy part: forming partial products (just an AND gate since B I is either 0 or 1) Hard part: adding M N-bit partial products 1. Simple Multiplication 6.111 Fall 2007 Lecture 13, Slide 3 Sequential Multiplier Assume the multiplicand (A) has N bits and the multiplier (B) has M bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single partial product at a time and then cycle the circuit M times: A P B + S N NC N xN N N+1 S N-1 S 0 Init: P!0, load A and B Repeat M times { P ! P + (B LSB ==1 ? A : 0) shift P/B right one bit } Done: (N+M)-bit result in P/B M bits LSB 1 6.111 Fall 2007 Lecture 13, Slide 4 Combinational Multiplier ! Partial product computations are simple (single AND gates) HA x 3 FA x 2 FA x 1 FA x 2 FA x 1 HA x 0 FA x 1 HA x 0 HA x 0 FA x 3 FA x 2 FA x 3 x 3 x 2 x 1 x 0 z 0 z 1 z 2 z 3 z 4 z 5 z 6 z 7 y 3 y 2 y 1 y 0 ! Propagation delay ~2N 6.111 Fall 2007 Lecture 13, Slide 5 2s Complement Multiplication X3 X2 X1 X0 * Y3 Y2 Y1 Y0 -------------------- X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2 - X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 ----------------------------------------- Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1 + X2Y2 X1Y2 X0Y2 + X3Y3 X2Y3 X1Y3 X0Y3 + 1 1 Step 1: twos complement operands so high order bit is 2 N-1 . Must sign extend partial products and subtract the last one Step 2: dont want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction into add of (complement + 1). Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away! Step 4: finish computing the constants Result: multiplying 2s complement operands takes just about same amount of hardware as multiplying unsigned operands! X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1 + X2Y2 X1Y2 X0Y2 + X3Y3 X2Y3 X1Y3 X0Y3 + 1 - 1 1 1 1 X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + 1 + X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + 1 + X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2 + 1 + X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 + 1 + 1 - 1 1 1 1 B = ~B + 1 (Baugh-Wooley) 6.111 Fall 2007 Lecture 13, Slide 6 2s Complement Multiplication FA x 3 FA x 2 FA x 1 FA x 2 FA x 1 HA x 0 FA x 1 HA x 0 HA x 0 FA x 3 FA x 2 FA x 3 HA 1 1 x 3 x 2 x 1 x 0 z 0 z 1 z 2 z 3 z 4 z 5 z 6 z 7 y 3 y 2 y 1 y 0 6.111 Fall 2007 Lecture 13, Slide 7 Multiplication in Verilog You can use the * operator to multiply two numbers: wire [9:0] a,b; wire [19:0] result = a*b; // unsigned multiplication! If you want Verilog to treat your operands as signed twos complement numbers, add the keyword signed to your wire or reg declaration: wire signed [9:0] a,b; wire signed [19:0] result = a*b; // signed multiplication! Remember: unlike addition and subtraction, you need different circuitry if your multiplication operands are signed vs. unsigned. Same is true of the >>> (arithmetic right shift) operator. To get signed operations all operands must be signed. To make a signed constant: 10sh37C 6.111 Fall 2007 Lecture 13, Slide 8 Multipliers in the Virtex II The Virtex FGPA has hardware multiplier circuits: Note that the operands are signed 18-bit numbers. The ISE tools will often use these hardware multipliers when you use the * operator in Verilog. Or can you instantiate them directly yourself: wire signed [17:0] a,b; wire signed [35:0] result; MULT18X18 mymult(.A(a),.B(b),.P(result)); 6.111 Fall 2007 Lecture 13, Slide 9 3. Faster Multipliers: Carry-Save Adder Last stage is still a carry-propagate adder (CPA) Good for pipelining: delay through each partial product (except the last) is just tPD,AND + tPD,FA. No carry propagation time! CSA 6.111 Fall 2007 Lecture 13, Slide 10 Increasing Throughput: Pipelining = register Idea: split processing across several clock cycles by dividing circuit into pipeline stages separated by registers that hold values passing from one stage to the next. Throughput = 1 result per clock cycle (period is now 4*t PD,FA instead of 8*t PD,FA ) 6.111 Fall 2007 Lecture 13, Slide 11 Wallace Tree Multiplier CSA CSA CSA CSA . . . CSA CSA CSA CPA O(log 1.5 M) Higher fan-in adders can be used to further reduce delays for large M. Wallace Tree: Combine groups of three bits at a time This is called a 3:2 counter by multiplier hackers: counts number of 1s on the 3 inputs, outputs 2- bit result. 4:2 compressors and 5:3 counters are popular building blocks. 6.111 Fall 2007 Lecture 13, Slide 12 4. Booth Recoding: Higher-radix mult. A N-1 A N-2 A 4 A 3 A 2 A 1 A 0
B M-1 B M-2 B 3 B 2 B 1 B 0 x ... 2 M/2 B K+1,K *A = 0*A " 0 = 1*A " A = 2*A " 4A 2A = 3*A " 4A A Idea: If we could use, say, 2 bits of the multiplier in generating each partial product we would halve the number of columns and halve the latency of the multiplier! Booths insight: rewrite 2*A and 3*A cases, leave 4A for next partial product to do! 6.111 Fall 2007 Lecture 13, Slide 13 Booth recoding B K+1 0 0 0 0 1 1 1 1 B K 0 0 1 1 0 0 1 1 B K-1 0 1 0 1 0 1 0 1 action add 0 add A add A add 2*A sub 2*A sub A sub A add 0 A 1 in this bit means the previous stage needed to add 4*A. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4*A in the previous stage is like adding A in this stage! -2*A+A -A+A from previous bit pair current bit pair 6.111 Fall 2007 Lecture 13, Slide 14 " There are a large number of implementations of the same functionality " These implementations present a different point in the area-time-power design space " Behavioral transformations allow exploring the design space a high-level Optimization metrics: area time power 1. Area of the design 2. Throughput or sample time T S 3. Latency: clock cycles between the input and associated output change 4. Power consumption 5. Energy of executing a task 6. 5.Behavioral Transformations 6.111 Fall 2007 Lecture 13, Slide 15 Fixed-Coefficient Multiplication Z 0 Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 Z 7 X 0 Y 3 X 1 Y 3 X 2 Y 3 X 3 Y 3 X 0 Y 2 X 1 Y 2 X 2 Y 2 X 3 Y 2 X 0 Y 1 X 1 Y 1 X 2 Y 1 X 3 Y 1 X 0 Y 0 X 1 Y 0 X 2 Y 0 X 3 Y 0 Y 0 Y 1 Y 2 Y 3 X 0 X 1 X 2 X 3 Z = X Y Conventional Multiplication X Z << 3 Y = (1001) 2 = 2 3 + 2 0 shifts using wiring Z 0 Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 Z 7 X 0 X 1 X 2 X 3 X 0 X 1 X 2 X 3 1 0 0 1 X 0 X 1 X 2 X 3 Z 0 Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 Z 7 X 0 X 1 X 2 X 3 X 0 X 1 X 2 X 3 1 0 0 1 X 0 X 1 X 2 X 3 Z = X (1001) 2 Constant multiplication (become hardwired shifts and adds) 6.111 Fall 2007 Lecture 13, Slide 16 Transform: Canonical Signed Digits (CSD) 1 0 1 1 1 Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}. Iterative encoding: replace string of consecutive 1s 2 N-2 + + 2 1 + 2 0 0 1 -1 0 0 2 N-1 - 2 0 Worst case CSD has 50% non zero bits X << 7 Z << 4 Shift translates to re-wiring 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 -1 1 0 0 0 1 1 0 -1 1 0 0 0 1 1 0 -1 0 1 0 0 -1 0 0 -1 0 1 0 0 -1 0 (replace 1 with 2-1) 6.111 Fall 2007 Lecture 13, Slide 17 Algebraic Transformations A B B A # Commutativity A + B = B + A # Distributivity C A B A C B (A + B) C = AB + BC # Associativity A C B C A B (A + B) + C = A + (B+C) A B A B # Common sub-expressions X Y X Y X 6.111 Fall 2007 Lecture 13, Slide 18 Transforms for Efficient Resource Utilization C A B F D E 2 1 I G H Time multiplexing: mapped to 3 multipliers and 3 adders Reduce number of operators to 2 multipliers and 2 adders 2 1 C A B distributivity F D E I G H 6.111 Fall 2007 Lecture 13, Slide 19 Retiming is the action of moving delay around in the systems " Delays have to be moved from ALL inputs to ALL outputs or vice versa D D D D D Retiming: A very useful transform Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa. Benefits of retiming: Modify critical path delay Reduce total number of registers D D D 6.111 Fall 2007 Lecture 13, Slide 20 Pipelining, Just Another Transformation (Pipelining = Adding Delays + Retiming) D D D D D D D D D How to pipeline: 1. Add extra registers at all inputs (or, equivalently, all outputs) 2. Retime retime add input registers Contrary to retiming, pipelining adds extra registers to the system 6.111 Fall 2007 Lecture 13, Slide 21 The Power of Transforms: Lookahead D x(n) y(n) A 2D x(n) y(n) D A A A D x(n) y(n) A 2 A D D loop unrolling distributivity associativity retiming 2D x(n) y(n) D A 2 A precomputed 2D x(n) y(n) D A A y(n) = x(n) + A[x(n-1) + A y(n-2)] y(n) = x(n) + A y(n-1) Try pipelining this structure 6.111 Fall 2007 Lecture 13, Slide 22 Summary Simple multiplication: O(N) delay Twos complement easily handled (Baugh-Wooley) Faster multipliers: Wallace Tree O(log N) Booth recoding: Add using 2 bits at a time Behavioral Transformations: Faster circuits using pipelining FA x 3 FA x 2 FA x1 FA x 2 FA x1 HA x0 FA x 1 HA x0 HA x 0 FA x 3 FA x 2 FA x 3 HA 1 1 x3 x 2 x 1 x 0 z0 z1 z 2 z3 z 4 z 5 z 6 z 7 y3 y2 y 1 y 0 D x(n) y(n) A 2 A D D and algebraic properties