Alu

Arithmetic and Logic Unit
Ajit Pal
Professor
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
INDIA-721302
Computer Architecture and Organization
Ajit Pal, IIT Kharagpur
Arithmetic and Logic Unit
ALLU
Arithmetic
Where we've been:
performance
abstractions
instruction set architecture
assembly language and machine language
What's up ahead:
implementing the architecture
32
32
32
operation
result
a
b
ALU
Bits are just bits (no inherent meaning)
conventions define relationship between bits and numbers
Binary integers (base 2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...

decimal: 0, , 2
n
-1
Of course it gets more complicated:
bit strings are finite, but
for some fractions and real numbers, finitely many bits is not
enough, so
overflow & approximation errors: e.g., represent 1/3 as binary!
negative integers
How do we represent negative integers?
which bit patterns will represent which integers?
Numbers
Sign Magnitude: One's Complement Two's Complement
000 = 0 000 = 0 000 = 0
001 = +1 001 = +1 001 = +1
010 = +2 010 = +2 010 = +2
011 = +3 011 = +3 011 = +3
100 = 0 100 = -3 100 = -4
101 = -1 101 = -2 101 = -3
110 = -2 110 = -1 110 = -2
111 = -3 111 = 0 111 = -1
Issues:
balance equal number of negatives and positives
ambiguous zero whether more than one zero representation
ease of arithmetic operations
Which representation is best? Can we get both balance and non-ambiguous
zero?
Possible Representations
a
m
b
i
g
u
o
u
s

z
e
r
o

a
m
b
i
g
u
o
u
s

z
e
r
o

Representation Formulae
Twos complement:

x
n
x
n-1
x
0
= x
n
* -2
n
+ x
n-1
* 2
n-1
+ + x
0
* 2
0

or

x
n
X = x
n
* -2
n
+ X (writing rightmost n bits x
n-1
x
0
as X)

= X, if x
n
= 0
-2
n
+ X, if x
n
= 1

Ones complement:

x
n
X = X, if x
n
= 0
-2
n
+ 1 + X, if x
n
= 1

32 bit signed numbers:

0000 0000 0000 0000 0000 0000 0000 0000
two
= 0
ten
0000 0000 0000 0000 0000 0000 0000 0001
two
= + 1
ten
0000 0000 0000 0000 0000 0000 0000 0010
two
= + 2
ten
...

0111 1111 1111 1111 1111 1111 1111 1110
two
= + 2,147,483,646
ten
0111 1111 1111 1111 1111 1111 1111 1111
two
= + 2,147,483,647
ten
1000 0000 0000 0000 0000 0000 0000 0000
two
= 2,147,483,648
ten
1000 0000 0000 0000 0000 0000 0000 0001
two
= 2,147,483,647
ten
1000 0000 0000 0000 0000 0000 0000 0010
two
= 2,147,483,646
ten
...

1111 1111 1111 1111 1111 1111 1111 1101
two
= 3
ten
1111 1111 1111 1111 1111 1111 1111 1110
two
= 2
ten
1111 1111 1111 1111 1111 1111 1111 1111
two
= 1
ten

MIPS 2s complement
maxint
minint
Negative integers are exactly those that have leftmost bit 1
Negation Shortcut: To negate any two's complement
integer (except for minint) invert all bits and add 1
note that negate and invert are different operations!
why does this work? Remember we dont know how to add in 2s
complement yet! Later!
Sign Extension Shortcut: To convert an n-bit integer into an
integer with more than n bits i.e., to make a narrow integer
fill a wider word replicate the most significant bit (msb) of
the original number to fill the new bits to its left
Example: 4-bit 8-bit
0010 = 0000 0010
1010 = 1111 1010
why is this correct? Prove!
Two's Complement Operations
MIPS Notes
lb vs. lbu
signed load sign extends to fill 24 left bits
unsigned load fills left bits with 0s
slt & slti
compare signed numbers
sltu & sltiu
compare unsigned numbers, i.e., treat
both operands as non-negative

Perform add just as in junior school (carry/borrow 1s)
Examples (4-bits):
0101 0110 1011 1001 1111
0001 0101 0111 1010 1110
Do these sums now!! Remember all registers are 4-bit including result
register!
So you have to throw away the carry-out from the msb!!
Have to beware of overflow : if the fixed number of bits (4, 8,
16, 32, etc.) in a register cannot represent the result of the
operation
terminology alert: overflow does not mean there was a carry-out
from the msb that we lost (though it sounds like that!) it means
simply that the result in the fixed-sized register is incorrect
as can be seen from the above examples there are cases when the
result is correct even after losing the carry-out from the msb
Twos Complement Addition
Twos Complement Addition: Verifying
Carry/Borrow method
Two (n+1)-bit integers: X = x
n
X, Y = y
n
Y

x
n
= 0, y
n
= 0 ok not ok(overflow!)
x
n
= 1, y
n
= 0 ok ok
x
n
= 0, y
n
= 1 ok ok
x
n
= 1, y
n
= 1 not ok(overflow!) ok

Prove the cases above!
Prove if there is one more bit (total n+2 then) available for the
result then there is no problem with overflow in add!

Carry/borrow
add X + Y
0 s X + Y < 2
n
(no CarryIn to last bit)
2
n
s X + Y < 2
n+1
1
(CarryIn to last bit)
Now verify the negation shortcut!
consider X + (X +1) = (X + X) + 1:
associative law but what if there is overflow in one of the adds on
either side, i.e., the result is wrong!
think minint !
Examples:
0101 = 1010 + 1 = 1011
1100 = 0011 + 1 = 0100
1000 = 0111 + 1 = 1000

No overflow when adding a positive and a negative number
No overflow when subtracting numbers with the same sign
Overflow occurs when the result has wrong sign (verify!):

Operation Operand A Operand B Result
Indicating Overflow

A + B > 0 > 0 < 0
A + B < 0 < 0 > 0
A B > 0 < 0 < 0
A B < 0 > 0 > 0

Consider the operations A + B, and A B
can overflow occur if B is 0 ?
can overflow occur if A is 0 ?

Detecting Overflow
If an exception (interrupt) occurs
control jumps to predefined address for exception
interrupted address is saved for possible resumption

Details based on software system/language
SPIM: see the EPC and Cause registers

Don't always want to cause exception on overflow
add, addi, sub cause exceptions on overflow
addu, addiu, subu do not cause exceptions on overflow

Effects of Overflow
Review: Basic Hardware

c = a . b b a
0 0 0
0 1 0
0 0 1
1 1 1
b
a
c
b
a
c
a c
c = a + b b a
0 0 0
1 1 0
1 0 1
1 1 1
1 0
0 1
c = a a
a 0
b 1
c d
0
1
a
c
b
d
1. AND gate (c = a . b)
2. OR gate (c = a + b)
3. Inverter (c = a)
4. Multiplexor
(if d = = 0, c = a;
else c = b)
Problem: Consider logic functions with three inputs: A, B, C.
output D is true if at least one input is true
output E is true if exactly two inputs are true
output F is true only if all three inputs are true

Show the truth table for these three functions
Show the Boolean equations for these three
functions
Show an implementation consisting of inverters,
AND, and OR gates.
Review: Boolean Algebra & Gates
To warm up let's build a logic unit to support the and and
or instructions for MIPS (32-bit registers)
we'll just build a 1-bit unit and use 32 of them

Possible implementation using a multiplexor :
A Simple Multi-Function Logic Unit
a
b
output
operation
selector
Selects one of the inputs to be the output
based on a control input

Lets build our ALU using a MUX (multiplexor):

Implementation with a Multiplexor
b
0
1
Result
Operation
a
.
.
.
Not easy to decide the best way to implement something
do not want too many inputs to a single gate
do not want to have to go through too many gates (= levels)
for our purposes, ease of comprehension is important
Let's look at a 1-bit ALU for addition:

How could we build a 1-bit ALU for add, and, and or?
How could we build a 32-bit ALU?

Implementations
c
out
= a.b + a.c
in
+ b.c
in

sum = a.b.c
in
+ a.b.c
in
+
a.b.c
in
+ a.b.c
in

= a b c
in

Sum
CarryIn
CarryOut
a
b
exclusive or (xor)
1-bit Adder Logic
Half-adder with one xor gate
Full-adder from 2 half-adders and
an or gate
Half-adder with the xor gate replaced
by primitive gates using the equation
AB = A.B +A.B
xor
Building a 32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31
a31
b31
Result0
CarryIn
a0
b0
Result1
a1
b1
Result2
a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
Ripple-Carry Logic for 32-bit ALU
1-bit ALU for AND, OR and add
Multiplexor control
line
Two's complement approach: just negate b and add.
How do we negate?
recall negation shortcut : invert each bit of b and set CarryIn to least
significant bit (ALU0) to 1
What about Subtraction (a b) ?
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
Tailoring the ALU to MIPS:
Test for Less-than and Equality
Need to support the set-on-less-than instruction
e.g., slt $t0, $t3, $t4
remember: slt is an R-type instruction that produces 1 if rs < rt
and 0 otherwise
idea is to use subtraction: rs < rt rs rt < 0. Recall msb of
negative number is 1
two cases after subtraction rs rt:
if no overflow then rs < rt most significant bit of rs rt = 1
if overflow then rs < rt most significant bit of rs rt = 0
why?
e.g., 5
ten
6
ten
= 0101 0110 = 0101 + 1010 = 1111 (ok!)
-7
ten
6
ten
= 1001 0110 = 1001 + 1010 = 0011 (overflow!)
therefore
set bit = msb of rs rt overflow bit
where set bit, which is output from ALU31, gives the result of slt
Fig. 4.17(lower) indicates set bit is the adder output not correct !!
set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other
Less bits are hardwired 0; so Less is the 32-bit result of slt

Supporting slt
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflow
detection
Overflow
a.
b.
Set
a31
0
ALU0 Result0
CarryIn
a0
Result1
a1
0
Result2
a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1
Less
CarryIn
CarryOut
ALU2
Less
CarryIn
CarryOut
ALU31
Less
CarryIn
1- bit ALU for the 31 least significant bits
1-bit ALU for the most significant bit
Extra set bit, to be routed to the Less input of the least significant 1-bit
ALU, is computed from the most significant Result bit and the Overflow bit
(it is not the output of the adder as the figure seems to indicate)
Less input of
the 31 most
significant ALUs
is always 0
32-bit ALU from 31 copies of ALU at top left and 1 copy
of ALU at bottom left in the most significant position
Tailoring the ALU to MIPS:
Test for Less-than and Equality
What about logic for the overflow bit ?
overflow bit = carry in to msb carry out of msb
verify!
logic for overflow detection therefore can be put in to ALU31
Need to support test for equality
e.g., beq $t5, $t6, $t7
use subtraction: rs - rt = 0 rs = rt
do we need to consider overflow?

Supporting
Test for Equality

Set
a31
0
Result0
a0
Result1
a1
0
Result2
a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0
Less
CarryIn
CarryOut
ALU1
Less
CarryIn
CarryOut
ALU2
Less
CarryIn
CarryOut
ALU31
Less
CarryIn
ALU Result
Zero
Overflow
a
b
ALU operation
CarryOut
ALU
control
lines

Bneg- Oper- Func-
ate ation tion

0 00 and
0 01 or
0 10 add
1 10 sub
1 11 slt

Symbol representing ALU
Output is 1 only if all Result bits are 0
Combine CarryIn
to least significant
ALU and Binvert to
a single control line
as both are always
either 1 or 0
32-bit MIPS ALU
Conclusion
We can build an ALU to support the MIPS instruction set
key idea: use multiplexor to select the output we want
we can efficiently perform subtraction using twos complement
we can replicate a 1-bit ALU to produce a 32-bit ALU
Important points about hardware
all gates are always working
speed of a gate depends number of inputs (fan-in) to the gate
speed of a circuit depends on number of gates in series
(particularly, on the critical path to the deepest level of logic)
Speed of MIPS operations
clever changes to organization can improve performance
(similar to using better algorithms in software)
well look at examples for addition, multiplication and division
I s a 32-bit ALU as fast as a 1-bit ALU? Why?
I s there more than one way to do addition? Yes:
one extreme: ripple-carry carry ripples through 32 ALUs, slow!
other extreme: sum-of-products for each CarryIn bit super fast!
CarryIn bits:
c
1
= b
0
.c
0
+ a
0
.c
0
+

a
0
.b
0

c
2
= b
1
.c
1
+ a
1
.c
1
+

a
1
.b
1

= a
1
.a
0
.b
0
+ a
1
.a
0
.c
0
+ a
1
.b
0
.c
0
(substituting for c
1
)

+ b
1
.a
0
.b
0
+ b
1
.a
0
.c
0
+ b
1
.b
0
.c
0
+ a
1
.b
1

c
3
= b
2
.c
2
+ a
2
.c
2
+

a
2
.b
2

= = sum of 15 4-term products
How fast? But not feasible for a 32-bit ALU! Why? Exponential
complexity!!

Problem: Ripple-carry Adder is Slow
Note: c
i
is CarryIn bit into i th ALU;
c
0
is the forced CarryIn into the
least significant ALU

An approach between our two extremes
Motivation:
if we didn't know the value of a carry-in, what could we do?
when would we always generate a carry? (generate) g
i
= a
i
. b
i

when would we propagate the carry? (propagate) p
i
= a
i
+ b
i

Express (carry-in equations in terms of generate/propagates)
c
1
= g
0
+ p
0
.c
0

c
2
= g
1
+ p
1
.c
1
= g
1
+ p
1
.g
0
+ p
1
.p
0
.c
0
c
3
= g
2
+ p
2
.c
2
= g
2
+ p
2
.g
1
+ p
2
.p
1
.g
0
+

p
2
.p
1
.p
0
.c
0
c
4
= g
3
+ p
3
.c
3
= g
3
+ p
3
.g
2
+ p
3
.p
2
.g
1
+

p
3
.p
2
.p
1
.g
0
+

p
3
.p
2
.p
1
.p
0
.c
0

Feasible for 4-bit adders with wider adders unacceptable
complexity.
solution: build a first level using 4-bit adders, then a second level on
top

Two-level Carry-lookahead Adder: First
Level
Two-level Carry-lookahead Adder:
Second Level for a 16-bit adder
Propagate signals for each of the four 4-bit adder blocks:
P
0
= p
3
.p
2
.p
1
.p
0
P
1
= p
7
.p
6
.p
5
.p
4
P
2
= p
11
.p
10
.p
9
.p
8
P
3
= p
15
.p
14
.p
13
.p
12

Generate signals for each of the four 4-bit adder blocks:
G
0
= g
3
+ p
3
.g
2
+ p
3
.p
2
.g
1
+ p
3
.p
2
.p
1
.g
0

G
1
= g
7
+ p
7
.g
6
+ p
7
.p
6
.g
5
+ p
7
.p
6
.p
5
.g
4
G
2
= g
11
+ p
11
.g
10
+ p
11
.p
10
.g
9
+ p
11
.p
10
.p
9
.g
8
G
3
= g
15
+ p
15
.g
14
+ p
15
.p
14
.g
13
+ p
15
.p
14
.p
13
.g
12
CarryIn signals for each of the four 4-bit adder blocks (see
earlier carry-in equations in terms of generate/propagates):

C
1
= G
0
+ P
0
.c
0

C
2
= G
1
+ P
1
.G
0
+ P
1
.P
0
.c
0
C
3
= G
2
+ P
2
.G
1
+ P
2
.P
1
.G
0
+

P
2
.P
1
.P
0
.c
0
C
4
= G
3
+ P
3
.G
2
+ P
3
.P
2
.G
1
+

P
3
.P
2
.P
1
.G
0
+

P
3
.P
2
.P
1
.P
0
.c
0

C a r r y I n
R e s u l t 0 - - 3
C a r r y I n
R e s u l t 4 - - 7
C a r r y I n
R e s u l t 8 - - 1 1
C a r r y I n
C a r r y O u t
R e s u l t 1 2 - - 1 5
C a r r y I n
C 1
C 2
C 3
C 4
P 0
G 0
P 1
G 1
P 2
G 2
P 3
G 3

a 0
b 0
a 1
b 1
a 2
b 2
a 3
b 3
a 4
b 4
a 5
b 5
a 6
b 6
a 7
b 7
a 8
b 8
a 9
b 9
a 1 0
b 1 0
a 1 1
b 1 1
a 1 2
b 1 2
a 1 3
b 1 3
a 1 4
b 1 4
a 1 5
b 1 5
L
o
g
i
c

t
o

c
o
m
p
u
t
e

C
1
,

C
2
,

C
3
,

C
4

4bAdder0
4bAdder1
4bAdder2
4bAdder3
16-bit carry-lookahead adder from four 4-bit
adders and one carry-lookahead unit
Carry-lookahead Logic
ALU0
ALU1
ALU2
ALU3
a1
a0
a2
a3
b1
b0
b2
b3
s0
s1
s2
s3
L
o
g
i
c

t
o

c
o
m
p
u
t
e

c
1
,

c
2
,

c
3
,

c
4
,

P
0
,

G
0

Blow-up of 4-bit adder:
(conceptually) consisting of
four 1-bit ALUs plus logic to
compute all CarryOut bits
and one super generate and
one super propagate bit.
Each 1-bit ALU is exactly as
for ripple-carry except c1, c2,
c3 for ALUs 1, 2, 3 comes
from the extra logic
CarryIn
Carry-lookahead Unit

Two-level carry-lookahead logic steps:
1. compute p
i
s and g
i
s at each 1-bit ALU
2. compute P
i
s and G
i
s at each 4-bit adder unit
3. compute C
i
s in carry-lookahead unit
4. compute c
i
s at each 4-bit adder unit
5. compute results (sum bits) at each 1-bit ALU
E.g., add using carry-lookahead logic:
0001 1010 0011 0011
1110 0101 1110 1011
Compare times for ripple-carry vs. carry-lookahead for a 16-bit
adder assuming unit delay at each gate

Multiply
Grade school shift-add method:
Multiplicand 1000
Multiplier 1001
1000
0000
0000
1000
Product 01001000
m bits x n bits = m+n bit product
Binary makes it easy:
multiplier bit 1 => copy multiplicand (1 x multiplicand)
multiplier bit 0 => place 0 (0 x multiplicand)
3 versions of multiply hardware & algorithm:

x
Shift-add Multiplier Version 1

64-bit ALU
Control test
Multiplier
Shift right
Product
Write
Multiplicand
Shift left
64 bits
64 bits
32 bits
Done
1. Test
Multiplier0
1a. Add multiplicand to product and
place the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
Start
Multiplier0 = 0 Multiplier0 = 1
No: < 32 repetitions
Yes: 32 repetitions Multiplicand register, product register, ALU are
64-bit wide; multiplier register is 32-bit wide
Algorithm
32-bit multiplicand starts at right half of multiplicand register
Product register is initialized at 0
Shift-add Multiplier Version1
Done
1. Test
Multiplier0
32nd repetition?
Start
Yes: 32 repetitions
Itera Step Multiplier Multiplicand Product
-tion
0 init 0011 0000 0010 0000 0000
values
1 1a 0011 0000 0010 0000 0010
2 0011 0000 0100 0000 0010
3 0001 0000 0100 0000 0010
2
Example: 0010 * 0011:
Algorithm
Observations on Multiply Version 1
1 step per clock cycle nearly 100 clock cycles to multiply two
32-bit numbers
Half the bits in the multiplicand register always 0
64-bit adder is wasted
0s inserted to right as multiplicand is shifted left
least significant bits of product never
change once formed

Intuition: instead of shifting multiplicand to left, shift product to
right
Multiplier
Shift right
Write
32 bits
64 bits
32 bits
Shift right
Multiplicand
32-bit ALU
Product Control test
Done
1. Test
Multiplier0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
Start
Yes: 32 repetitions
Multiplicand register, multiplier register, ALU
are 32-bit wide; product register is 64-bit wide;
multiplicand adds to left half of product register
Algorithm
Product register is initialized at 0

Done
1. Test
Multiplier0
32nd repetition?
Start
Yes: 32 repetitions
Itera Step Multiplier Multiplicand Product
-tion
0 init 0011 0010 0000 0000
values
1 1a 0011 0010 0010 0000
2 0011 0010 0001 0000
3 0001 0010 0001 0000
2
Example: 0010 * 0011:
Algorithm
Observations on Multiply
Version 2
Each step the product register wastes space that exactly
matches the current size of the multiplier

Intuition: combine multiplier register and product register


Control
test Write
32 bits
64 bits
Shift right
Product
Multiplicand
32-bit ALU
Done
1. Test
Product0
32nd repetition?
Start
Product0 = 0 Product0 = 1
Yes: 32 repetitions
No separate multiplier register; multiplier
placed on right side of 64-bit product register
Algorithm
Product register is initialized with multiplier on right

Done
1. Test
Product0
32nd repetition?
Start
Yes: 32 repetitions
Itera Step Multiplicand Product
-tion
0 init 0010 0000 0011
values
1 1a 0010 0010 0011
2 0010 0001 0001
2
Example: 0010 * 0011:
Algorithm
Observations on Multiply
Version 3
2 steps per bit because multiplier & product combined
What about signed multiplication?
easiest solution is to make both positive and remember whether to
negate product when done, i.e., leave out the sign bit, run for 31
steps, then negate if multiplier and multiplicand have opposite
signs

Booths Algorithm is an elegant way to multiply signed numbers
using same hardware it also often quicker
Motivating Booths algorithm

Example 0010 * 0110. Traditional:
0010
0110
0000 shift (0 in multiplier)
0010 add (1 in multiplier)
0010 add (1 in multiplier)
00001100
Same example. But observe there are two successive 1s in
multiplier
0110 = 2
2
+ 2
1
= 2
3
2
1
, so can replace successive 1s by subtract
and then add:
0010
0110
-0010 sub (first 1 in multiplier)
0000 shift (middle of string of 1s)
0010 add (previous step had last 1)
00001100

x

Motivating Booths Algorithm

Math idea: string of 1s 01110 has

value the sum 2
n
+ 2
n-1
+ + 2
m
= 2
n+1
2
m

Replace a string of 1s in multiplier with an initial subtract when
we first see a one and then later add after the last one
What if the string of 1s started from the left of the (2s complement) number,
e.g., 11110001 would the formula above have to be modified?!
0 1 1 1 1 0
beginning of run end of run
middle of run
successive
1s
bit = 2
n
bit = 2
m
Booth from Multiply Version 3

Modify Step 1 of the algorithm Multiply Version 3 to consider 2 bits
of the multiplier: the current bit and the bit to the right (i.e., the
current bit of the previous step). Instead of two outcomes, now
there are four:

Case Current Bit Bit to the Right Explanation Example Op
1a 0 0 Middle of run of 0s 0001111000 none
1b 0 1 End of run of 1s 0001111000 add
1c 1 0 Begins run of 1s 0001111000 sub
1d 1 1 Middle of run of 1s 0001111000 none

Modify Step 2 of Multiply Version 3 to sign extend when the
product is shifted right (arithmetic right shift, rather than logical
right shift) because the product is a signed number
Now draw the flowchart for Booths algorithm !
Multiply Version 3 and Booth share the same hardware, except
Booth requires one extra flipflop to remember the bit to the right of
the current bit in the product register which is the bit pushed out
by the preceding right shift


Booth Example (2 x 7)
1c. 0010 1110 0111 0 shift P (sign ext)
2. 0010 1111 0011 1 11 -> nop
1d. 0010 1111 0011 1 shift P (sign ext)
2. 0010 1111 1001 1 11 -> nop
1d. 0010 1111 1001 1 shift P (sign ext)
2. 0010 1111 1100 1 01 -> add P = P + M
1b. 0010 0001 1100 1 shift P (sign ext)
2. 0010 0000 1110 0 done

Operation Multiplicand Product next?
0. initial value 0010 0000 0111 0 10 -> sub P = P - M
Booth Algorithm (2 * -3)
Operation Multiplicand Product next?
0.initial value 0010 0000 1101 0 10 -> sub P = P - M
1c. 0010 1110 1101 0 shift P (sign ext)
2. 0010 1111 0110 1 01 -> add P = P + M
1b. 0010 0001 0110 1 shift P (sign ext)
2. 0010 0000 1011 0 10 -> sub P = P - M
1c. 0010 1110 1011 0 shift P
2. 0010 1111 0101 1 11 -> nop
1d. 0010 1111 0101 1 shift P
2. 0010 1111 1010 1 done

Verifying Booths Algorithm
multiplier a = a
31
a
32
a
0
, multiplicand = b
a
i
a
i-1
Operation
0 0 nop
0 1 add b
1 0 sub b
1 1 nop
0, nop
I.e., if a
i-1
a
i
= +1, add b
1, sub b
Therefore, Booth computes sum:
(a
1
a
0
) * b * 2
0
+ (a
0
a
1
) * b * 2
1
+ (a
1
a
2
) * b * 2
2

+ (a
30
a
31
) * b * 2
31

= simplify telescopic sum!

MIPS Notes
MIPS provides two 32-bit registers Hi and Lo to hold a 64-bit
product
mult, multu (unsigned) put the product of two 32-bit register
operands into Hi and Lo: overflow is ignored by MIPS but can
be detected by programmer by examining contents of Hi
mflo, mfhi moves content of Hi or Lo to a general-purpose
register
Pseudo-instructions mul (without overflow), mulo (with
overflow), mulou (unsigned with overflow) take three 32-bit
register operands, putting the product of two registers into the
third

Divide
1001 Quotient
Divisor 1000 1001010 Dividend
1000
10
101
1010
1000
10 Remainder

Junior school method: see how big a multiple of the divisor can
be subtracted, creating quotient digit at each step
Binary makes it easy first, try 1 * divisor; if too big, 0 * divisor
Dividend = (Quotient * Divisor) + Remainder
3 versions of divide hardware & algorithm:
Divide Version 1

64-bit ALU
Control
test
Quotient
Shift left
Remainder
Write
Divisor
Shift right
64 bits
64 bits
32 bits
Done
Test Remainder
2a. Shift the Quotient register to the left,
setting the new rightmost bit to 1
3. Shift the Divisor register right 1 bit
33rd repetition?
Start
Remainder < 0
Yes: 33 repetitions
2b. Restore the original value by adding
the Divisor register to the Remainder
register and place the sum in the
Remainder register. Also shift the
Quotient register to the left, setting the
new least significant bit to 0
1. Subtract the Divisor register from the
Remainder register and place the
result in the Remainder register
Remainder > 0
Divisor register, remainder register, ALU are

64-bit wide; quotient register is 32-bit wide
Algorithm
32-bit divisor starts at left half of divisor register
Remainder register is initialized with the dividend at right
Why 33? We shall see later
Quotient register is
initialized to be 0
Divide Version 1

Done
Test Remainder
2a. Shift the Quotient register to the left,
setting the new rightmost bit to 1
3. Shift the Divisor register right 1 bit
33rd repetition?
Start
Remainder < 0
Yes: 33 repetitions
the Divisor register to the Remainder
register and place the sum in the
Remainder register. Also shift the
Quotient register to the left, setting the
new least significant bit to 0
Remainder > 0
Itera- Step Quotient Divisor Remainder

tion
0 init 0000 0010 0000 0000 0111
1 1 0000 0010 0000 1110 0111
2b 0000 0010 0000 0000 0111
3 0000 0001 0000 0000 0111
2
3
4
5

Example: 0111 / 0010:
Algorithm

Observations on Divide Version 1
Half the bits in divisor always 0
1/2 of 64-bit adder is wasted
1/2 of divisor register is wasted
Intuition: instead of shifting divisor to right, shift remainder to
left

Step 1 cannot produce a 1 in quotient bit as all bits
corresponding to the divisor in the remainder register are 0
(remember all operands are 32-bit)
Intuition: switch order to shift first and then subtract can save
1 iteration

Divide Version 2
Control
test
Quotient
Shift left
Write
32 bits
64 bits
32 bits
Shift left
Divisor
32-bit ALU
Remainder
Divisor register, quotient register,
ALU are 32-bit wide; remainder
register is 64-bit wide
Remainder register is initialized
with the dividend at right
D o n e . S h i f t l e f t h a l f o f R e m a i n d e r r i g h t 1 b i t
T e s t R e m a i n d e r
3 a . S h i f t t h e R e m a i n d e r r e g i s t e r t o t h e
l e f t, setting the new rightmost bit to 0.
3 2 n d r e p e t i t i o n ?
S t a r t
R e m a i n d e r < 0
N o : < 3 2 r e p e t i t i o n s
Y e s : 3 2 r e p e t i t i o n s
3 b . R e s t o r e t h e o r i g i n a l v a l u e b y a d d i n g
t h e D i v i s o r r e g i s t e r t o t h e l e f t h a l f o f t h e
R e m a i n d e r r e g i s t e r a n d p l a c e t h e s u m
i n t h e l e f t h a l f o f t h e R e m a i n d e r r e g i s t e r .
A l s o s h i f t t h e R e m a i n d e r r e g i s t e r t o t h e
l e f t , s e t t i n g t h e n e w r i g h t m o s t b i t t o 0
2 . S u b t r a c t t h e D i v i s o r r e g i s t e r f r o m t h e
l e f t h a l f o f t h e R e m a i n d e r r e g i s t e r a n d
p l a c e t h e r e s u l t i n t h e l e f t h a l f o f t h e
R e m a i n d e r r e g i s t e r
R e m a i n d e r 0
1 . S h i f t t h e R e m a i n d e r r e g i s t e r l e f t 1 b i t

>
Also shift the Quotient register to the left
setting to the new rightmost bit to 1.
Also shift the Quotient register to the left
setting to the new rightmost bit to 0.
Algorithm
Why this correction step? We shall see later

Each step the remainder register wastes space that exactly
matches the current size of the quotient

Intuition: combine quotient register and remainder register

Divide Version 3

Done. Shift left half of Remainder right 1 bit
Test Remainder
3a. Shift the Remainder register to the
left, setting the new rightmost bit to 1
32nd repetition?
Start
Remainder < 0
Yes: 32 repetitions
the Divisor register to the left half of the
Remainder register and place the sum
in the left half of the Remainder register.
Also shift the Remainder register to the
left half of the Remainder register and
place the result in the left half of the
Remainder register
Remainder 0
1. Shift the Remainder register left 1 bit
>
Write
32 bits
64 bits
Shift left
Shift right
Remainder
32-bit ALU
Divisor
Control
test
No separate quotient register; quotient
is entered on the right side of the 64-bit
remainder register
Algorithm
Remainder register is initialized with the dividend at right
Why this correction step? We shall see later
Divide Version 3

Done. Shift left half of Remainder right 1 bit
Test Remainder
3a. Shift the Remainder register to the
32nd repetition?
Start
Remainder < 0
Yes: 32 repetitions
the Divisor register to the left half of the
Remainder register and place the sum
in the left half of the Remainder register.
Also shift the Remainder register to the
left half of the Remainder register and
place the result in the left half of the
Remainder register
Remainder 0
1. Shift the Remainder register left 1 bit
>
Example: 0111 / 0010:
Itera- Step Divisor Remainder
tion
0 init 0010 0000 0111
1 0010 0000 1110
1 2 0010 1110 1110
3b 0010 0001 1100
2
3
4
Algorithm

Number of Iterations
Why the extra iteration in Version 1?
Why the final correction step in Versions 2 & 3?

shift1 shift2 shift32 shift33

sub1 sub2 sub3 sub32 sub33

V1 starts loop
here: unnecessary
sub step
Critical situation! Only the quotient shift is
necessary as it corresponds to the
outcome of the previous sub.
So V1 is ok even though the last divisor
shift is redundant, as final divisor is ignored
any way; V2 & 3 must repair remainder
as it has shifted left one time too many
V2 & 3 start
loop here
V2 & 3 initial step:
before loop starts
One loop iteration
Ovals represent loop iterations
Shift: see the version descriptions
for which registers are shifted
Main insight sub(i+1) must actually follow shifti of
the divisor (or remainder, depending on version) and
the resulting bit in the quotient appears on shift(i+1)
Same hardware as Multiply Version 3

Signed divide:
make both divisor and dividend positive and perform division
negate the quotient if divisor and dividend were of opposite
signs
make the sign of the remainder match that of the dividend
this ensures always
dividend = (quotient * divisor) + remainder
quotient (x/y) = quotient (x/y) (e.g. 7 = 3*2 + 1 & 7 = 3*2 1)

MIPS Notes
div (signed), divu (unsigned), with two 32-bit register
operands, divide the contents of the operands and put
remainder in Hi register and quotient in Lo; overflow is ignored
in both cases

pseudo-instructions div (signed with overflow), divu
(unsigned without overflow) with three 32-bit register operands
puts quotients of two registers into third

Floating Point
We need a way to represent
numbers with fractions, e.g., 3.1416
very small numbers (in absolute value), e.g., .00000000023
very large numbers (in absolute value) , e.g., 3.15576 * 10
46
Representation:
scientific: sign, exponent, significand form:
(1)
sign
* significand * 2
exponent
. E.g., 101.001101 * 2
111001
more bits for significand gives more accuracy
more bits for exponent increases range
if 1 s significand < 10
two
(=2
ten
) then number is normalized, except
for number 0 which is normalized to significand 0
E.g., 101.001101 * 2
111001
= 1.01001101 * 2
111011
(normalized)
binary point
IEEE 754 Floating-point Standard
IEEE 754 floating point standard:
single precision: one word

double precision: two words

31
sign
bits 30 to 23
8-bit exponent
bits 22 to 0
23-bit significand
31
sign
bits 30 to 20
11-bit exponent
bits 19 to 0
upper 20 bits of 52-bit significand
bits 31 to 0
lower 32 bits of 52-bit significand
Sign bit is 0 for positive numbers, 1 for negative numbers

Number is assumed normalized and leading 1 bit of significand
left of binary point (for non-zero numbers) is assumed and not
shown
e.g., significand 1.1001 is represented as 1001,
exception is number 0 which is represented as all 0s (see next
slide)
for other numbers:
value = (1)
sign
* (1 + significand) * 2
exponent value

Exponent is biased to make sorting easier
all 0s is smallest exponent, all 1s is largest
bias of 127 for single precision and 1023 for double precision
therefore, for non-0 numbers:
value = (1)
sign
(exponent bias)

equals exponent value
Special treatment of 0:
if exponent is all 0 and significand is all 0, then the value is
0 (sign bit may be 0 or 1)
if exponent is all 0 and significand is not all 0, then the value is
(1)
sign
-127
therefore, all 0s is taken to be 0 and not 2
-127
(as would be for a non-zero
normalized number); similarly, 1 followed by all 0s is taken to be 0 and
not
- 2
-127

Example : Represent 0.75
ten
in IEEE 754 single precision

decimal: 0.75 = 3/4 = 3/2
2

binary: 11/100 = .11 = 1.1 x 2
-1
IEEE single precision floating point exponent = bias + exponent
value
= 127 + (-1) = 126
ten
= 01111110
two
IEEE single precision: 10111111010000000000000000000000

sign
exponen
t
significand

Floating Point
Addition Algorithm:

Done
2. Add the significands
4. Round the significand to the appropriate
number of bits
Still normalized?
Start
Yes
No
No
Yes
Overflow or
underflow?
Exception
3. Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
1. Compare the exponents of the two numbers.
Shift the smaller number to the right until its
exponent would match the larger exponent
Floating Point
Addition
Hardware:
0 1 0 1 0 1
Control
Small ALU
Big ALU
Sign Exponent Significand Sign Exponent Significand
Exponent
difference
Shift right
Shift left or right
Rounding hardware
Sign Exponent Significand
Increment or
decrement
0 1 0 1
Shift smaller
number right
Compare
exponents
Add
Normalize
Round
Floating Point
Multpication
Algorithm:
2. Multiply the significands
4. Round the significand to the appropriate
number of bits
Still normalized?
Start
Yes
No
No
Yes
Overflow or
underflow?
Exception
3. Normalize the product if necessary, shifting
it right and incrementing the exponent
1. Add the biased exponents of the two
numbers, subtracting the bias from the sum
to get the new biased exponent
Done
5. Set the sign of the product to positive if the
signs of the original operands are the same;
if they differ make the sign negative
Floating Point Complexities
In addition to overflow we can have underflow (number too
small)
Accuracy is the problem with both overflow and underflow
because we have only a finite number of bits to represent
numbers that may actually require arbitrarily many bits
limited precision rounding rounding error
IEEE 754 keeps two extra bits, guard and round
four rounding modes
positive divided by zero yields infinity
zero divide by zero yields not a number
other complexities
Implementing the standard can be tricky
Not implementing the standard can be even worse
see text for discussion of Pentium bug!
MIPS Floating Point
MIPS has a floating point coprocessor (numbered 1, SPIM) with
thirty-two 32-bit registers $f0 - $f31. Two of these are required to
hold doubles. Floating point instructions must use only even-
numbered registers (including those operating on single floats).
SPIM simulates MIPS floating point.

Floating point arithmetic: add.s (single addition), add.d (double
addition), sub.s, sub.d, mul.s, mul.d, div.s, div.d

Floating point comparison: c.x.s (single), c.x.d (double),
where x may be eq, neq, lt, le, gt, ge

Other instructions

Summary
Computer arithmetic is constrained by limited precision
Bit patterns have no inherent meaning but standards do exist:
twos complement
IEEE 754 floating point
Computer instructions determine meaning of the bit patterns.
Performance and accuracy are important so there are many
complexities in real machines (i.e., algorithms and
implementation)

Read Computer Arithmetic Algorithms by I. Koren
it is easy-to-read and shows new algorithms for arithmetic
there will be assignment and projects based on Korens material

TU/e Processor
Design 5Z032
72
Agenda
Arithmetic
Signed and unsigned numbers
Addition and Subtraction
Logical operations
ALU: arithmetic and logic unit
Multiply
Divide
Floating Point
notation
add
multiply

73
32
32
32
operation
result
a
b
ALU
Arithmetic
Where we've been:
Performance (seconds, cycles, instructions)
Abstractions:
Instruction Set Architecture
Assembly Language and Machine Language
Implementing the Architecture
74
Bits have no inherent meaning (no semantics)
Decimal number system, e.g.:
4382 = 4x10
3
+ 3x10
2
+ 8x10
1
+ 2x10
0

Can use arbitrary base g; value of digit c at position i:
c x g
i

Binary numbers (base 2)
n-1 n-2 1 0

a
n-1
a
n-2

a
1
a
0

2
n-1
2
n-2

2
1
2
0

(a
n-1
a
n-2
... a
1
a
0
)
two
= a
n-1
x 2
n-1
+ a
n-2
x 2
n-2
+ + a
0
x 2
0

Binary numbers (1)
position

digit

weight
75
Binary numbers (2)
So far numbers are unsigned
With n bits 2
n
possible combinations

a
0
: least significant bit (lsb)
a
n-1
: most significant bit (msb)
1 bit 2 bits 3 bits 4 bits decimal value
0 00 000 0000 0
1 01 001 0001 1
10 010 0010 2
11 011 0011 3
100 0100 4
101 0101 5
110 0110 6
111 0111 7
1000 8
1001 9
TU/e Processor
Design 5Z032
76
Binary numbers (base 2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...
decimal: 0...2
n
-1
Of course it gets more complicated:
- numbers are finite (overflow)
- fractions and real numbers
- negative numbers
e.g., no MIPS subi instruction;
However, addi can add a negative number

How do we represent negative numbers?
i.e., which bit patterns will represent which
numbers?
Binary numbers (3)
77
Conversion
Decimal -> binary
Divide by 2 Remainder

Hexadecimal: base 16. Octal: base 8
1010 1011 0011 1111
two
= ab3f
hex

4382
2191 0
1095 1
547 1
273 1
136 1
68 0
34 0
17 0
8 1
4 0
2 0
1 0
0 1
4382
ten
=
1 0001 0001 1110
two

Sign Magnitude: One's Complement Two's
Complement
000 = +0 000 = +0 000 = +0
001 = +1 001 = +1 001 = +1
010 = +2 010 = +2 010 = +2
011 = +3 011 = +3 011 = +3
100 = -0 100 = -3 100 = -4
101 = -1 101 = -2 101 = -3
110 = -2 110 = -1 110 = -2
111 = -3 111 = -0 111 = -1

Signed binary numbers
Possible representations:
32 bit signed numbers:

0000 0000 0000 0000 0000 0000 0000 0000
two
= 0
ten
0000 0000 0000 0000 0000 0000 0000 0001
two
= + 1
ten
0000 0000 0000 0000 0000 0000 0000 0010
two
= + 2
ten
...

0111 1111 1111 1111 1111 1111 1111 1110
two
= + 2,147,483,646
ten
0111 1111 1111 1111 1111 1111 1111 1111
two
= + 2,147,483,647
ten
1000 0000 0000 0000 0000 0000 0000 0000
two
= 2,147,483,648
ten
1000 0000 0000 0000 0000 0000 0000 0001
two
= 2,147,483,647
ten
1000 0000 0000 0000 0000 0000 0000 0010
two
= 2,147,483,646
ten
...

1111 1111 1111 1111 1111 1111 1111 1101
two
= 3
ten
1111 1111 1111 1111 1111 1111 1111 1110
two
= 2
ten
1111 1111 1111 1111 1111 1111 1111 1111
two
= 1
ten

Range [-2
31
.. 2
31
-1]
(a
n-1
a
n-2
... a
1
a
0
)
2s-compl
= -a
n-1
x 2
n-1
+ a
n-2
x 2
n-2
+ + a
0
x
2
0
= - 2
n
+ a
n-1
x 2
n-1
+ + a
0
x 2
0

maxint
minint
Twos complement
Negating a two's complement number: invert all
bits and add 1
remember: negate and invert are quite
different!

Proof:
a + a = 1111.1111b = -1 d = -a = a + 1
Converting n bit numbers into numbers with more than n bits:
MIPS 8 bit, 16 bit values / immediates converted to 32 bits
Copy the most significant bit (the sign bit) into the other bits
0010 -> 0000 0010
1010 -> 1111 1010
MIPS "sign extension" example instructions:
lb load byte (signed)
lbu load byte (unsigned)
slti set less than immediate (signed)
sltiu set less than immediate (unsigned)
Addition & Subtraction
Just like in grade school (carry/borrow 1s)
0111 0111 0110
+ 0110 - 0110 - 0101
Two's complement operations easy
subtraction using addition of negative numbers
0110 0110
- 0101 + 1010
Overflow (result too large for finite computer word):
e.g., adding two n-bit numbers does not yield an
n-bit number
0111
+ 0001 note that overflow term is somewhat
misleading,
1000 it does not mean a carry overflowed
No overflow when adding a positive and a negative number
No overflow when signs are the same for subtraction
Overflow occurs when the value affects the sign:
overflow when adding two positives yields a negative
or, adding two negatives gives a positive
or, subtract a negative from a positive and get a negative
or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A B
Can overflow occur if B is 0 ?
Can overflow occur if A is 0 ?
Detecting Overflow
When an exception (interrupt) occurs:
Control jumps to predefined address for exception
(interrupt vector)
Interrupted address is saved for possible
resumption in exception program counter (EPC);
new instruction: mfc0
(move from coprocessor0)
Interrupt handler handles exception (part of OS).
registers $k0 and $k1 reserved for OS
Details based on software system / language
C ignores integer overflow; FORTRAN not
Don't always want to detect overflow
new MIPS instructions: addu, addiu, subu
note: addiu and sltiu still sign-extends!
Effects of Overflow
Logic operations
Sometimes operations on individual bits needed:
Logic operation C operation MIPS instruction
Shift left logical << sll
Shift right logical >> srl
Bit-by-bit AND & and, andi
Bit-by-bit OR | or, ori
and and andi can be used to turn off some bits;
or and ori turn on certain bits
Of course, AND en OR can be used for logic operations.
Note: Language Cs logical AND (&&) and OR (||) are
conditional
andi and ori perform no sign extension !
Given: 3-input logic function of A, B and C, 2-outputs
Output D is true if at least 2 inputs are true
Output E is true if odd number of inputs true

Give truth-table

Give logic equations

Give implementation with AND and OR gates, and
Inverters.
Exercise: gates
Let's build an ALU to support the andi and ori
instructions
we'll just build a 1 bit ALU, and use 32 of them

b
a
operation
result
An ALU (arithmetic logic unit)
Selects one of the inputs to be the output, based on a
control input

Lets build our ALU and use a MUX to select the
outcome for the chosen operation
S
C
A
B
0
1
Review: The Multiplexor
note: we call this a 2-input mux
even though it has 3 inputs!
Not easy to decide the best way to build
something
Don't want too many inputs to a single gate
Dont want to have to go through too many gates
For our purposes, ease of comprehension is
important
Let's look at a 1-bit ALU for addition (= full-adder):
Different Implementations
c
out
= a b + a c
in
+ b c
in
sum = a xor b xor c
in
CarryOut
CarryIn
Sum
a
b
+
Building a 32 bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31
a31
b31
Result0
CarryIn
a0
b0
Result1
a1
b1
Result2
a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
Two's complement approach: just negate b and add
How do we negate?
A very clever solution:
What about subtraction (a b) ?
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
Need to support the set-on-less-than instruction (slt)
remember: slt rd,rs,rt is an arithmetic instruction
produces a 1 if rs < rt and 0 otherwise
use subtraction: (a-b) < 0 implies a < b

Need to support test for equality
beq $t5, $t6, label
jump to label if $t5 = $t6
use subtraction: (a-b) = 0 implies a = b
Tailoring the ALU to the MIPS
Supporting Set on less than'
Can we figure out the idea?
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflow
detection
Overflow
a.
b.
bits 0-30
bit 31
Set
a31
0
ALU0 Result0
CarryIn
a0
Result1
a1
0
Result2
a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1
Less
CarryIn
CarryOut
ALU2
Less
CarryIn
CarryOut
ALU31
Less
CarryIn
Supporting the slt operation
Test for equality
a-b = 0 a=b
Notice control
lines:

000 = and
001 = or
010 = add
110 = subtract
111 = slt

Note: signal Zero is a 1 when the
result is zero!
The Zero output is always calculated
Set
a31
0
Result0
a0
Result1
a1
0
Result2
a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0
Less
CarryIn
CarryOut
ALU1
Less
CarryIn
CarryOut
ALU2
Less
CarryIn
CarryOut
ALU31
Less
CarryIn
ALU symbol
ALU
zero
result
overflow
operation
a
b
carry-out
32
32
32
97
Conclusions
We can build an ALU to support the MIPS instruction
set
key idea: use multiplexor to select the output we
want
we can efficiently perform subtraction using twos
complement
we can replicate a 1-bit ALU to produce a 32-bit
ALU
Important points about hardware
all of the gates are always working
not efficient from energy perspective !!
the speed of a gate is affected by the number of
connected outputs it has to drive (so-called Fan-Out)
the speed of a circuit is affected by the number of
gates in series (on the critical path or the deepest
level of logic)
Unit of measure: FO4 = inverter with Fan-Out of 4
P4 (heavily superpipelined) has about 15 FO4 critical
path

Conclusions
Is a 32-bit ALU as fast as a 1-bit ALU?
Is there more than one way to do addition?
Two extremes: ripple carry and sum-of-products
How many logic layers do we need for these two
extremes?

Can you see the ripple? How could you get rid of it?

c
1
= b
0
c
0
+ a
0
c
0
+

a
0
b
0
c
2
= b
1
c
1
+ a
1
c
1
+

a
1
b
1
c
2
= (..subst c
1
..)

c
3
= b
2
c
2
+ a
2
c
2
+

a
2
b
2
c
3
=

c
4
= b
3
c
3
+ a
3
c
3
+

a
3
b
3
c
4
=

Not feasible! Why not?

Problem: Ripple carry adder is slow
An approach in-between our two extremes
Motivation:
If we didn't know the value of carry-in, what could
we do?
When would we always generate a carry?
g
i
= a
i
b
i

When would we propagate the carry?
p
i
= a
i
+ b
i

Carry-lookahead adder (1)
Cin
Cout
Cout = Gi + Pi Cin
a
b
Did we get rid of the ripple?

c
1
= g
0
+ p
0
c
0

c
2
= g
1
+ p
1
c
1
c
2
= g
1
+ p
1
(g
0
+ p
0
c
0
)

c
3
= g
2
+ p
2
c
2
c
3
=

c
4
= g
3
+ p
3
c
3
c
4
=
Feasible ?

a0
b0
a1
b1
a2
b2
a3
b3
Cin
P0
G0
ALU
Result0-3
4
P0 = p
0
.p
1
.p
2
.p
3

G0= g
3
+(p
3
.g
2
)+(p
3
.p
2
.g
1
)+(p
3
.p
2
.p
1
.g
0
)
Use principle to build
bigger adders
Cant build a 16 bit adder
this way... (too big)
Could use ripple carry of
4-bit CLA adders
Better: use the CLA
principle again!

CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0
G0
P1
G1
P2
G2
P3
G3
pi
gi
pi + 1
gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2
gi + 2
pi + 3
gi + 3
a0
b0
a1
b1
a2
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
Carry-lookahead unit
More complicated than addition
accomplished via shifting and addition
More time and more area
Let's look at 3 versions based on gradeschool
algorithm
0010 (multiplicand)
__*_1011 (multiplier)
Negative numbers: convert and multiply
there are better techniques, we wont look at them
now
Multiplication (1)
Multiplication (2)
Done
1. Test
Multiplier0
32nd repetition?
Start
Yes: 32 repetitions
64-bit ALU
Control test
Multiplier
Shift right
Product
Write
Multiplicand
Shift left
64 bits
64 bits
32 bits
First implementation
Product initialized to 0
105
Multiplication (3)
Multiplier
Shift right
Write
32 bits
64 bits
32 bits
Shift right
Multiplicand
32-bit ALU
Product Control test
Done
1. Test
Multiplier0
32nd repetition?
Start
Yes: 32 repetitions
Second version
Multiplication (4)
Control
test Write
32 bits
64 bits
Shift right
Product
Multiplicand
32-bit ALU
Done
1. Test
Product0
32nd repetition?
Start
Yes: 32 repetitions
Final version
Product initialized with multiplier
Fast multiply: Booths Algorithm
Exploit the fact that: 011111 = 100000 - 1
Therefore we can replace multiplier, e.g.:

0001111100 = 0010000000 - 100
Rules:
Current
bit
Bit to the
right
Explanation Operation
1 0 Begin 1s Subtract
multiplicand
1 1 Middle of 1s nothing
0 1 End of 1s Add
multiplicand
0 0 Middle of 0s nothing
Booths Algorithm (2)
Booths algorithm works for signed 2s
complement as well (without any modification)
Proof: lets multiply b * a
(a
i-1
- a
i
) indicates what to do:

0 : do nothing
+1: add b
-1 : subtract
We get b*a =
(
- + - -
= - -
=
=
i
i
i
i
i
i
i
a a b
b a a
2 2
2 ) (
30
0
31
31
31
0
1
This is exactly what we need !
Division
Similar to multiplication: repeated subtract

Divide (1)
Well known algorithm:
Dividend
Divisor 1000/1001010\1001 Quotient
-1000
10
101
1010
-1000
10 Remainder
Division (2)
Implementation:
6 4 - b i t A L U
C o n t r o l t e s t
W r i t e
6 4 b i t s
6 4 b i t s
3 2 b i t s
Divisor
Shift right
Remainder
Quotient
Shift left
Start
1. Substract the Divisor register from the
Test Remainder
2.a Shift the Quotient register
to the left, setting the
rightmost bit to 1
2.b Restore the original value by
adding the Divisor register. Also,
shift a 1 into the Quotient register
Shift Divisor Register right 1 bit
Done
33rd repetition?
>= 0 < 0
yes
no
Multiply / Divide in MIPS
MIPS provides a separate pair of 32-bit registers for
the result of a multiply and divide: Hi and Lo

mult $s2,$s3 # Hi,Lo = $s2 * $s3
div $s2,$s3 # Hi,Lo = $s2 mod $s3,
$s2 / $s3

Copy result to general purpose register
mfhi $s1 # $s1 = Hi
mflo $s1 # $s1 = Lo

There are also unsigned variants of mult and div:
multu and divu
Shift instructions

sll
srl
sra

Why not sla instruction ?

Shift: a quick way to multiply and divide with power of 2
(strength reduction). Is this always allowed?
Floating Point (a brief look)
We need a way to represent
numbers with fractions, e.g., 3.1416
very small numbers, e.g., .000000001
very large numbers, e.g., 3.15576 10
9
Representation:
sign, exponent, significand: (1)
sign
significand
2
exponent
more bits for significand gives more accuracy
more bits for exponent increases range
IEEE 754 floating point standard:
single precision : 8 bit exponent, 23 bit significand
double precision: 11 bit exponent, 52 bit significand
IEEE 754 floating-point standard
Leading 1 bit of significand is implicit
Exponent is biased to make sorting easier
all 0s is smallest exponent all 1s is largest
bias of 127 for single precision and 1023 for double
precision
summary: (1)
sign
(1+significand) 2
exponent bias

Example:

decimal: -.75 = -3/4 = -3/2
2

binary : -.11 = -1.1 x 2
-1
floating point: exponent = -1+bias = 126 = 01111110
IEEE single precision:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Floating Point Complexities
Operations more complicated: align, renormalize, ...
In addition to overflow we can have underflow
Accuracy can be a big problem
IEEE 754 keeps two extra bits, guard and round,
and additional sticky bit (indicating if one of the
remaining bits unequal zero)
four rounding modes
positive divided by zero yields infinity
zero divide by zero yields not a number
other complexities
Implementing the standard can be tricky
Not using the standard can be even worse
see text for description of 80x86 and Pentium
bug!
Floating Point on MIPS
Separate register file for floats: 32 single precision registers; can
be used as 16 doubles
MIPS-1 floating point instruction set (pg 288/291)
addition add.f (f =s (single) or f=d (double))
subtraction sub.f
multiplication mul.f
division div.f
comparison c.x.f where x=eq, neq, lt, le, gt or ge
sets a bit in (implicit) condition reg. to true or false
branch bc1t (branch if true) and bclf (branch if false)
c1 means instruction from coprocessor one !
load and store: lwc1, swc1
Study examples on page 293, and 294-296
Floating Point on MIPS
MIPS has 32 single-precision FP registers ($f0,$f1,
,$f31) or 16 double-precision ($f0,$f2,...)
MIPS FP instructions:
FP add single add.s $f0,$f1,$f2 $f0 = $f1+$f2
FP substract single sub.s $f0,$f1,$f2 $f0 = $f1-$f2
FP multiply single mul.s $f0,$f1,$f2 $f0 = $f1x$f2
FP divide single div.s $f0,$f1,$f2 $f0 = $f1/$f2
FP add double add.d $f0,$f2,$f4 $f0 = $f2+$f4
FP substract double sub.d $f0,$f2,$f4 $f0 = $f2-$f4
FP multiply double mul.d $f0,$f2,$f4 $f0 = $f2x$f4
FP divide double div.d $f0,$f2,$f4 $f0 = $f2/$f4
load word coprocessor 1 lwc1 $f0,100($s1) $f0 = Memory[$s1+100]
store word coprocessor 1 swc1 $f0,100($s1) Memory[$s1+100] = $f0
branch on copr.1 true bc1t 25 if (cond) goto PC+4+100
branch on copr.1 false bc1f 25 if (!cond) goto PC+4+100
FP compare single c.lt.s $f0,$f1 cond = ($f0 < $f1)
FP compare double c.ge.d $f0,$f2 cond = ($f0 >= $f2)
Conversion: decimal IEEE 754 FP
Decimal number (base 10)
123.456 = 1x10
2
+2x10
1
+3x10
0
+4x10
-1
+5x10
-2
+6x10
-3
Binary number (base 2)

101.011 = 1x2
2
+0x2
1
+1x2
0
+0x2
-1
+1x2
-2
+1x2
-3
Example conversion: 5.375
Multiply with power of 2, to get rid of fraction:
5.375 = 5.375x16 / 16 = 86 x 2
-4

Convert to binary, and normalize to 1.xxxxx
86 x 2
-4
= 1010110 x 2
-4
= 1.01011 x 2
2
Add bias (127 for single precision) to exponent:
exponent field = 2 + 127 = 129 = 1000 0001
IEEE single precision format (remind the leading 1
bit):

sign exponent significand
0 10000001 01011000000000000000000
Floating point on Intel 80x86
8087 coprocessor announced in 1980 as an
extension of the 8086
(similar 80287, 80387)
80 bit internal precision (extended double format)
8 entry stack architecture
addressing modes:
one operand = TOS
other operand is TOS, ST(i) or in Memory
Four types of instructions:
data transfer, arithmetic, compare
transcendental (tan, sin, cos, arctan, exponent,
logarithm)
Fallacies and pitfalls
Associative law does not hold:

(x+y) + z is not always equal x + (y+z)

Summary
Computer arithmetic is constrained by limited precision
Bit patterns have no inherent meaning but standards do
exist
twos complement
IEEE 754 floating point

Computer instructions determine meaning of the bit
patterns

Performance and accuracy are important so there are many
complexities in real machines (i.e., algorithms and
implementation).

We are ready to move on (and implement the processor)

Case Study
MIPS Instruction Set Processor
Introduction
Starting point:
The specification of the MIPS instruction set drives the design of the
hardware.
Will restrict design to integer type instructions
Identify common functions to all instructions, and within instruction
classes easy to do in a RISC architecture
Instruction fetch
Access one or more registers
Use ALU
Asserted signals a high or low level of a signal which implies a logically
true condition an action level. The text will only assert a logically
high level, ie., a 1.
Clocking
Assume edge triggered clocking (as opposed to level sensitive).
A storage circuit or flip-flop stores a value on the clock transition
edge.
Model is flip-flops with combinational logic between them
Propagation delay through combinations logic between storage
elements determines clock cycle length.
Single clock cycle vs. multi-clock cycle design approach
Single Versus Multi-clock Cycle Design
Start out with a single long clock cycle for each instruction .
Entire instruction gets executed in a single clock pulse
Controller is pure combinational logic
Design is simple
You would think that a single clock cycle per instruction
execution would give us super high performance but not so:
Slowest instruction determines speed of all instructions.
Because various phases of the instructions need the same
hardware resource, and all is needed at the same time (clock
pulse)
Some hardware is redundant another disadvantage of
single phase
Examples:
2 memories: instruction and data memory
2 adders and an ALU
Design Summary
Has a performance bottleneck
The clock cycle time is determined by the longest path
in the machine
The simple jmp instruction will take as long as the load
word (lw)
The instruction which uses the longest data path
dictates the time for all others.
What about a variable time clock design?
Still a single clock
Clock pulse interval is a function of the opcode
Average time for instruction theoretically improves
But
It difficult to implement - lots of overhead to overcome

Lets start simple with a single clock cycle design for
simplicity reasons and later convert to multi-clock cycle.

Basic Abstract View of the Data Path
Shows common functions for most instructions
Registers
Register #
Data
Register #
Data
memory
Address
Data
Register #
PC Instruction ALU
Instruction
memory
Address
Data Path for Instruction Fetching
PC
Instruction
memory
Read
address
Instruction
4
Add
Basic Data Path for R-type Instruction
Red lines are for control signals
generated by the controller
Instruction
Registers
Write
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Write
data
ALU
result
ALU
Zero
RegWrite
ALU operation
3
Adding the Data Path for lw & sw Instruction
Implements:
lw $t1, offset_value($t2)
sw $t1, offset_value($t2)
The offset value is a 16-bit signed immediate
field & must be sign extended to 32 bits
Immediate
offset data

Instruction
16 32
Registers
Wri te
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Data
memory
Write
data
Read
data
Wri te
data
Sign
extend
ALU
result
Zero
ALU
Address
MemRead
MemWrite
RegWrite
ALU operation
3
Adding the Data Path for beq Instruction
Implements beq $t1, $t2, offset
Offset is a signed 16 bit
immediate field, & thus must be
sign extended. In addition we
shift left by 2 (make low bits are 00)
to address to a word boundary
To PC
16 32
Sign
extend
Zero ALU
Sum
Shift
left 2
To branch
control logic
Branch target
PC + 4 frominstruction datapath
Instruction
Add
Registers
Write
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Write
data
ALU operation
3
RegWrite
Putting It All Together
j instruction to be added later
Need controls circuits to drive control lines in red.
Two control units will be designd: ALU Control & Main Control
Incremented PC
or beq branch
address
unsuccessful branch
Successful branch
PC
Instruction
memory
Read
address
Instruction
16 32
Add ALU
result
M
u
x
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Shift
left 2
4
M
u
x
ALU operation
3
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALU
result
Zero
ALU
Data
memory
Address
Write
data
Read
data
M
u
x
Sign
extend
Add
Instruction RegDst RegWrite ALUSrc MemRea
d
MemWri
te
MemToReg PCSrc ALU
operation
R-format 1 1 0 0 0 0 0 0000
(and)
0001
(or)
0010
(add)
0110
(sub)
lw 0 1 1 1 0 1 0 0010
(add)
sw X 0 1 0 1 X 0 0010
(add)
beq x 0 0 0 0 X 1 or 0 0110
(sub)
Control
We next add the control unit that generates
write signal for each state element
control signals for each multiplexer
ALU control signal
Input to control unit: instruction opcode and function
code
Control Unit
Divided into two parts
Main Control Unit
Input: 6-bit opcode
Output: all control signals for Muxes, RegWrite,
MemRead, MemWrite and a 2-bit ALUOp signal
ALU Control Unit
Input: 2-bit ALUOp signal generated from Main
Control Unit and 6-bit instruction function code
Output: 4-bit ALU control signal
Truth Table for Main Control Unit
Main Control Unit
ALU Control Unit
Must describe hardware to compute 4-bit ALU
control input given
2-bit ALUOp signal from Main Control Unit
function code for arithmetic
Describe it using a truth table (can turn into gates):

ALU Control bits
0010
0010
0110
0010
0110
0000
0001
0111
Putting It All Together Again
PC
Instruction
memory
Read
address
Instruction
[310]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address
Use this for R-type, memory, & beq instructions scenarios.
Addition of the Unconditional Jump
We now add one more op code to our single cycle design:
Op code 2: j
The format is op field 28-31 is a 2
Remaining 26 low bits is the immediate target address

The full 32 bit target address is computed by concatenating:
Upper 4 bits of PC+4
26 bit immediate field of the jump instruction
Bits 00 in the lowest positions (word boundary)
See text chapter 3, p. 150

An additional control line from the main controller will have to
be generated to select this new instruction

A two bit shifter is also added to get the two low order zeros
Final Design with jump Instruction
Shift
left 2
PC
Instruction
memory
Read
address
Instruction
[310]
Data
memory
Read
data
Write
data
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Instruction[15 11]
Instruction[20 16]
Instruction[25 21]
Add
ALU
result
Zero
Instruction[5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
Jump
RegDst
ALUSrc
Instruction[31 26]
4
M
u
x
Instruction[250] Jumpaddress[310]
PC+4 [31 28]
Sign
extend
16 32
Instruction[15 0]
1
M
u
x
1
0
M
u
x
0
1
M
u
x
0
1
ALU
control
Control
Add
ALU
result
M
u
x
0
1 0
ALU
Shift
left 2
26 28
Address
Thanks!

Alu

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Alu

Transféré par

Droits d'auteur :

Formats disponibles

Arithmetic and Logic Unit

Divisor register, remainder register, ALU are

Itera- Step Quotient Divisor Remainder

Vous aimerez peut-être aussi