Vous êtes sur la page 1sur 15

2

Fractional Numbers

Fractional Number Notations


Ver. 1.4

Fractional numbers have the form:


xxxxxxxxx.yyyyyyyyy
where the xes constitute the integer
part of the value and the ys the
fractional part
There are two main methods to encode
fractional numbers:

2010 - Claudio Fornaro

fixed-point notation
floating-point notation

Fixed-point Notation

Fixed-point notation splits the available


n bits in 2 portions:

Fixed-point Notation

If needed:

one for the integer part


one for the fractional part
integer

fractional

The radix point is not stored (does not


uses up bits): its position is just known
The number of bits for the integer and
the fractional part are chosen before
making any calculation

the integer part must be padded with 0es


on the left
the fractional part must be padded with
0es on the right

Examples

5.25 in FX on 4+4 bits: 01010100


5.25 in FX on 6+2 bits: 00010101
Radix points are supposed here

Fixed-point Notation

Fixed-point Notation

For relative fractional values, both SM


and 2C notations can be used
The n bits are then divided into 3 parts:

Examples

Convert value +12.25 in FX 2C 1+4+3

sign (1 bit)
integer part (m bits)
fractional part (n-m-1 bits)
E.g. 1+7+8 means 1 bit for sign, 7 for the
integer part and 8 for the fractional part

Convert value 12.25 in FX 2C 1+4+3

Operations are the same seen as for


integer values, provided that the values
have the same format

01100010
01100010 10011110
Note: when using the 1st 2C-operation method,
1 must be added to the LSB, not to unity place:
01100010
2C-Operation
10011101+
1=
10011110

Exercises

Convert the values as requested

151.0
FX 2C on 16 bits (1+8+7)
151.25
FX 2C on 16 bits (1+8+7)
111100101010 from FX 2C (1+7+4) ()10
100110011000 from FX 2C (1+6+5) ()10

Calculate on FX 2C 16 bits (1+7+8) and


identify any overflow

(111.6 44.57) / 2
(68.22 71.25) * 64

Exercises

Solutions
0 10010111 0000000
1 01101001 0000000
151.25

0 10010111 0100000
1 01101000 1100000
Note that the integer part is not the same
111100101010
0 0001101 0110
13.37510
100110011000
0 110011 01000
51.2510

151.0

Exercises

10

Exercises

Solutions

(111.6 44.57) / 2
1101111.10011001201101111100110012C
101100.100100012 00101100100100012C
11010011011011112C
1 0110111110011001+
1101001101101111=
0100001100001000 0010000110000100
+33.515625

(68.22 71.25) * 64
1000100.00111000201000100001110002C
1000111.012
01000111010000002C
10111000110000002C
1 0100010000111000+
1011100011000000=
1111110011111000 0011111000000000
OVERFLOW

Radix points are supposed here

Solutions

Radix points are supposed here

11

Fixed-point Uses

Fixed-point notation is sometimes used


by simulating it with the integer notation
that microprocessors use (i.e. 2C)
This allows faster computations than
operations using floating-point notation
(intrinsically slower)

12

Fixed-point Problems

Suppose the following (unsigned) values


have to be coded using Fixed-point on a
total of 8 bits:

37.25
12.625
5.4375
1.2890625

100101.01
1100.1010
101.01110
1.0100101

All of them can be coded in 8 bits, but


there is not a unique position for the
radix point suitable for all

13

Fixed-point Problems

Fixed-point Problems

Suppose you have to represent some


fractional values 0 x < 8 using a
Fixed-point coding 4+4 bits:

7.2732

2.3748

5.4375

14

1.2890

The first bit is always 0, and the


fractional part is rounded to 4 bits
If we could move the fractional point 1
positions to the left, we could have 1
more bit for precision

Suppose you have to represent some


values with fractional part x.0, x.5 or
x.25 only, using a Fixed-point coding
4+4 bits
The last two bits are always 00, and the
integer part is limited to 15
If we could move the fractional point 2
positions to the right, we could have 2
more bits for the integer part (values
up to 63)

15

Fixed-point Problems

The problem with Fixed-point notation


is the fixed position of the radix point
To solve this problem, the radix point
must be made movable (floating), this
requires that its position be stored
along with each number

16

Exponential Notation

Exponential notation represents a


number as a value (mantissa or
significand ) that multiplies a whole
power of the base (exponent )
Mantissa
Examples (in decimal):

Exponent
123.45678 = 0.12345678103
0.0087654321 = 0.8765432110-2
87655678 = 0.87655678108

17

Exponential Notation

Exponential Notation

Very big and very small values are


obtained by just varying the exponent
The same value can be expressed in
many forms:

18

123.45 = 0.12345103 = 1234510-2

Among these forms, form 0.x (x0) is


chosen to have a unique representation
for values, this is called the

When the number of digits is not enough


to store the whole number only the most
important (leftmost) digits are stored
The most significant digits are thus
preserved, but approximation errors are
introduced because of truncation
Example (only 4 decimal digits):

normalized form

0.001234567 0.123410-2=0.001234000
876543 0.8765106 = 876500

19

Exponential Notation

The maximum representation error with


n digits is 10-n relative to the power of

the whole part

If the whole part power is m :


= 10-n 10m = 10m-n
which is the power of the rightmost
digit (LSD)

20

Exponential Notation

Example

Suppose the value has only 4 decimal digits


876543 0.8765106 (normalized)
The whole part is 0106 m =6
= 10-4 106 = 102 (maximum error)
This can also be seen by writing the value
as a sum of powers:
0106+8105+7104+6103+5102
for this value, the error is:
| 876543 876500 | = 43 (< 102 )

21

Exponential Notation

IEEE-P754 Floating-Point

Example

22

Suppose the value has only 4 decimal digits


0.001234567 0.123410-2=0.001234000
The whole part is 010-2 m = 2
= 10-4 10-2 = 10-6 (maximum error)
Writing the value as a sum of powers:
010-2+110-3+210-4+310-5+410-6
for this value, the error is:
| 0.001234567 0.001234000 | =
= 0.000000567 (< 10-6 )

The IEEE-P754 standard describes the


most common notations used by
computer FPUs (Floating-Point Units) to
compute floating-point values
The two exponential binary floating
point notations described have the form
mantissa 2exponent and are:

Single precision (SP)


Double precision (DP)

23

IEEE-P754 Single Precision

Single precision values uses 32 bits


divided in 3 parts:

sign: 1 bit
exponent field: 8 bits
mantissa (or significand) field: 23 bit
s exponent

mantissa

The sign bit is defined as follows:

0 is used for values 0


1 is used for values 0 (negative zero!)

24

IEEE-P754 Single Precision

The mantissa (or significand ) is in the


normalized form 1.xxxxx, where the 1
before the radix point is the leftmost 1
(MSB) in the binary representation
Only the fractional part of the binary
mantissa is stored in the mantissa field:
the leftmost 1 is already known to be
present (called hidden bit ), this allows
for one more bit of precision (23 bits
stored + 1 hidden = 24 bits effective)

25

IEEE-P754 Single Precision

26

IEEE-P754 Single Precision

The exponent is a relative integer


value on 8 bits, the IEEE-P754 SP
standard does not use SM or 2C
notations, but a biased notation called
excess 127: the FP exponent field is
computed by adding constant value 127
(bias constant ) to the exponent of the
normalized value

Excess notation is efficient, especially


for number comparison
The offset value is 2n1 1
(n is the number of bits) in order to
consider the first half of the range as
negative numbers

27

IEEE-P754 Single Precision

Example: +13.2510 IEEE-P754(SP)

sign is positive: sign bit = 0


convert the value to binary
13.25 = 1101.01
normalize the value
Note: base 2
1101.01 = 1.1010123
compute the exponent by adding 127 to
the real base 2 exponent
3+127=130=10000010
Compose the pieces adding padding 0es
0 10000010 10101000000000000000000

28

IEEE-P754 Single Precision

Example: convert from IEEE-P754(SP)


1 01100000 01000000000000000000000

sign bit = 1
extract the mantissa, add the hidden bit,
and convert to decimal
1.012= 1.2510
compute the real exponent by subtracting
127 from the extracted exponent
1100000 = 96 96127=31
compose the parts: -1.2510231 =-5.821010

29

IEEE-P754 Single Precision

30

IEEE-P754 Single Precision

The SP decimal range is:


(1.41045 3.410+38)
The decimal exponent varies from 45
to +38, corresponding to a binary
exponent from 126 to 127
Values are approximated to 7 decimal
digits (corresponding to the 24 bits used
by the mantissa)

The representation error is the


absolute weight of the LSB
This is computed by multiplying the
weight of the integer part (hidden bit)
times the relative weight of the mantissa
LSB (i.e. the weight of the LSB with
respect to the integer part)
This results in adding the exponents

1.10010..1 20
1.10010..1 25
1.10010..1 294

= 20-23 = 2-23
= 25-23 = 2-18
= 294-23 = 271

31

IEEE-P754 Single Precision

The binary exponent varies from 126


to 127, corresponding to excess 127
values from 1 to 254
Exponent values 00000000 (0) and
11111111 (255) are used for special
numbers:

Zeroes
Infinities
NaNs
Denormalized values

32

IEEE-P754 Single Precision

Zero
Exponent=00000000, Mantissa=0
0/1 0000 0000
by definition, not by computation, because
there is not any 1 for normalization
Positive and negative are considered
equivalent
Infinity
Exponent=11111111, Mantissa=0
0/1 1111 0000
Operations with infinitives are well defined

33

IEEE-P754 Single Precision

IEEE-P754 Single Precision

Not a Number (NaN)


Exponent=11111111, Mantissa0
0/1 1111 <not 0000>
NaNs are used to indicate values that does
not represent real numbers
There are 2 types of NaNs:

34

Special Operations

Quiet NaNs: denote indeterminate operations

(mantissa MSB set), the result of an operation


is not mathematically defined
Signalling NaNs: denote an invalid operation
(mantissa MSB clear)

N / INF
INF INF
N/0
INF + INF
0/0
INF INF
INF / INF
INF 0

=0
= INF
= INF
= INF
= NaN
= NaN
= NaN
=NaN

Any operation with NaN yields a NaN result

35

IEEE-P754 Single Precision

IEEE-P754 standard allows values in


non-normalized form too (denormalized )
Exponent=00000000, Mantissa0
Hidden bit is now 0 and not 1
The exponent value is considered 126
Value is:
0.mantissa 2126

36

IEEE-P754 Double Precision

Double precision notation just extends


the SP notation to use 64 bits
The differences are:

exponent bits: 11
mantissa bits: 52
bias constant: 1023
exponent range: 1022, +1023
equivalent decimal range:
(4.910-324 1.710+308)
with 15 decimal digits
denormalized exponent: 1022

37

IEEE-P754 Compact Notation

IEEE-P754 Exercises

For ease of writing and copying,


floating-point numbers (as any other bit
sequence) can be translated to base 16
as they were (they are not!) a pure
binary number

38

Convert the following values to/from


IEEE-P754:

0 10000000 0010000 40100000


1 01111111 1100000 BFE00000
C3C41000 1100001111000100000100

1324.25 to SP and DP
0.02324 to SP and DP with an absolute

precision of 1/1000

0 10000000 0010000 to decimal


1 01111111 1100000 to decimal
EB141000 to decimal

39

IEEE-P754 Exercises

Solutions

1324.25
10100101100.01 = 1.010010110001210

10+127 = 137 = 10001001


10+1023 = 1033 = 10000001001

then:

SP: 1 10001001 01001011000100


in compact form: C4A58800
DP: 1 10000001001 01001011000100
in compact form: C094B10000000000

40

IEEE-P754 Exercises

Solutions

0.02324
= 1/1000 n =10 (fractional bits)
0.0000010111 = 1.011126

6+127 = 121 = 01111001


6+1023 = 1017 = 01111111001

then:

SP: 0 01111001 011100


in compact form: 3CB80000
DP: 0 01111111001 011100
in compact form: 3F97000000000000

41

IEEE-P754 Exercises

Floating-point Addition

Solutions

42

0 10000000 0010000
+1.00122128-127= 10.012 =+2.25
1 01111111 1100000
1.1122127-127= 1.75
EB141000 = 1 11010110 0010100000100
1.0010122214-127= 1.1562510287=
= 1.1562510287 = 1.15625 280 27=
1 1024 102 = 1026 (approx.)
the non-approximated value is:
1.78921021302965117856514048 1026

To add two FP values, these must have


the same exponent before adding their
mantissas: the smaller value is converted
to have the same exponent as the
greater (it is de-normalized)
As the exponent is increased (e.g. by 3),
the mantissa must decrease (right shift 3
bits) to not change the overall value
1.01000216 + 1.101000213
1.01000216 + 0.001101216

43

Underflow

If the conversion of the smaller value


shifts away all of the mantissa bits
(including the hidden bit), the value is
approximated to 0, thus the operation
result is equal to the greater while the
smaller is just ignored
There is an underflow condition when,
adding 2 values, the result is equal to
the greater of them

44

Underflow

Example in SP
1.101243+ 1.01218
1.01218 must be converted to the form
xxx243, this causes a right shift of 25
bits on the mantissa, thus shifting away
all the 24 mantissa bits and resulting in 0
Adding up many small values, it is
possible that a partial sum becomes so
big to cause underflow for each of the
subsequent values (only the first part of
the values is added up)

45

IEEE-P754 Exercises

IEEE-P754 Exercises

Calculate the following operations


(IEEE-P754) and express the result in
the same compact form, identify any
Overflow/Underflow:

46

Solution N.1:

2B1A5F20 + 4F1A3BB0
C4A58000 + C2B80000
63AB102F 709B1BC2
7F600000 + 7F100000

2B1A5F20
0 01111110 00110100101111100100000
E=01010110=86
4F1A3BB0
0 10011110 00110100011101110110000
E=10011110=158
Difference of exponents= 72
72 > 24 UNDERFLOW
Result: 4F1A3BB0

47

IEEE-P754 Exercises

Solution N.2:

C4A58000
1 10001001 01001011000000000000000
E=10001001 =137 (non biased: 10)
M=1. 01001011
C2B80000
1 10000101 01110000000000000000000
E=10000101 =133 (non biased: 6)
M=1.0111
Difference of exponents: 137 133= 4

48

IEEE-P754 Exercises

Solution N.2 (continuation):

De-normalized mantissa of the 2nd value to


have exponent=10 (4 right shifts):
0.00010111
Addition: 1.01001011 210
+
10
0.00010111 2
=
10
1.01100010 2
Result:
1 10001001 01100010000000000000000
C4B18000

49

IEEE-P754 Exercises

IEEE-P754 Exercises

Solution N.3

50

63AB102F 709B1BC2
0 11000111 01010110001000000111111
E=199
0 11100001 00110110001101111000010
E=225
Difference of exponents: 225 199 = 26
26 > 24 UNDERFLOW
Result: F09B1BC2 (SIGN CHANGED!)

Solution N.4

7F600000
0 11111110 11000000000000000000000
E=254 (non biased=127)
7F100000
0 11111110 00100000000000000000000
E=254 (non biased=127)
Difference of exponents: 0

51

IEEE-P754 Exercises

Solution N.4 (continuation)

1.110 2127 +
1.001 2127 =
10.111 2127
Renormalization: 1.01112128
Max exponent is 127 OVERFLOW
Result: (+Infinity)
0 11111111 00000000000000000000000
7F800000

52

IEEE-P754 Exercises

Calculate in the IEEE-P754 SP format


the following operations with DECIMAL
numbers, identify any
Overflow/Underflow:

92000000010 92000000110

53

IEEE-P754 Puzzles

IEEE-P754 SP Ranges

Solution:

54

Values differ on the LSB


The two numbers have 9 decimal digits
corresponding to about 93=27 bits
After normalization, the relative weight of
the LSB is 2-27
Having only 24 bits, power 2-27 is discarded
The two values are considered equal
Result is 0

Maximum normalized positive number is


1.1111112127 with 23 fractional bits

If there were all the bits, the value would


be: 1.1111112127 with 127 fractional bits,
1.1111112127 = 2128 1
Having just 23 fractional bits, the value is
approximated to 1.1111 0000 2127
with 23 fractional bits set to 1 and the
rightmost 12723=104 bits set to 0
104 bits set to 1 are value 2104 1

55

IEEE-P754 SP Ranges

Maximum normalized positive value:


1.1111 0000 2127 =
(2128 1) (2104 1) =2128 2104

56

IEEE-P754 SP Ranges

3.4028234663852885981170418348452e+38

Minimum normalized positive number:


1.0000002-126

1.1754943508222875079687365372222e38

Maximum denormalized positive number


is 0.1111112-126 with 23 fractional bits

the rightmost bit power is: 12623= 149


(2-126 1) (2-149 1) =2-126 2-149

Minimum denormalized positive number


is 0.0000012-126 with 23 fractional bits

the rightmost bit power is: 12623= 149


2-149

57

IEEE-P754 Puzzles

IEEE-P754 Puzzles

Determine the difference between value


44A58800 and the next one (44A58801)

58

Value in binary is:


0 10001001 01001011000100 =
1.01001011000100210
Next one differs for just the LSB:
1.01001011000101210
Difference is 1LSB weight = 210-23 = 2-13

Determine the range of the consecutive


integer values in SP.

59

IEEE-P754 Puzzles

Determine the (absolute) representation


error for value N=61018 in IEEE-P754 SP.

N = 6 1018 6 260 requires 63 bits


N = 1.xxx 262
In SP there are only 23 bits for the mantissa
The relative weight of the LSB is 262-23=39
The representation error is 239

Values are in the form 1.xxxx with 23


fractional bits (denormals are not integers)
24 bits (hidden bit included) result in 224
combinations of bits (0 to 2241), each
corresponds to a value and an appropriate
exponent makes it an integer value
224 is represented too
Range: 224 +224

Vous aimerez peut-être aussi