3 Fractional Numbers

2
Fractional Numbers
Fractional Number Notations

Ver. 1.4
Fractional numbers have the form:

xxxxxxxxx.yyyyyyyyy
where the xes constitute the integer
part of the value and the ys the
fractional part
There are two main methods to encode
fractional numbers:
2010 - Claudio Fornaro
fixed-point notation
floating-point notation
Fixed-point Notation
Fixed-point notation splits the available

n bits in 2 portions:
If needed:
one for the integer part

one for the fractional part
integer
fractional
The radix point is not stored (does not

uses up bits): its position is just known
The number of bits for the integer and
the fractional part are chosen before
making any calculation
the integer part must be padded with 0es

on the left
the fractional part must be padded with
0es on the right
Examples
5.25 in FX on 4+4 bits: 01010100

5.25 in FX on 6+2 bits: 00010101
Radix points are supposed here
For relative fractional values, both SM

and 2C notations can be used
The n bits are then divided into 3 parts:
Examples
Convert value +12.25 in FX 2C 1+4+3
sign (1 bit)
integer part (m bits)
fractional part (n-m-1 bits)
E.g. 1+7+8 means 1 bit for sign, 7 for the
integer part and 8 for the fractional part
Convert value 12.25 in FX 2C 1+4+3
Operations are the same seen as for

integer values, provided that the values
have the same format
01100010
01100010 10011110
Note: when using the 1st 2C-operation method,
1 must be added to the LSB, not to unity place:
01100010
2C-Operation
10011101+
1=
10011110
Exercises
Convert the values as requested
151.0
FX 2C on 16 bits (1+8+7)
151.25
FX 2C on 16 bits (1+8+7)
111100101010 from FX 2C (1+7+4) ()10
100110011000 from FX 2C (1+6+5) ()10
Calculate on FX 2C 16 bits (1+7+8) and

identify any overflow
(111.6 44.57) / 2
(68.22 71.25) * 64
Exercises
Solutions
0 10010111 0000000
1 01101001 0000000
151.25
0 10010111 0100000
1 01101000 1100000
Note that the integer part is not the same
111100101010
0 0001101 0110
13.37510
100110011000
0 110011 01000
51.2510
151.0
Exercises
10
Exercises
Solutions
(111.6 44.57) / 2
1101111.10011001201101111100110012C
101100.100100012 00101100100100012C
11010011011011112C
1 0110111110011001+
1101001101101111=
0100001100001000 0010000110000100
+33.515625
(68.22 71.25) * 64
1000100.00111000201000100001110002C
1000111.012
01000111010000002C
10111000110000002C
1 0100010000111000+
1011100011000000=
1111110011111000 0011111000000000
OVERFLOW
Solutions
11
Fixed-point Uses
Fixed-point notation is sometimes used

by simulating it with the integer notation
that microprocessors use (i.e. 2C)
This allows faster computations than
operations using floating-point notation
(intrinsically slower)
12
Fixed-point Problems
Suppose the following (unsigned) values

have to be coded using Fixed-point on a
total of 8 bits:
37.25
12.625
5.4375
1.2890625
100101.01
1100.1010
101.01110
1.0100101
All of them can be coded in 8 bits, but

there is not a unique position for the
radix point suitable for all
13
Suppose you have to represent some

fractional values 0 x < 8 using a
Fixed-point coding 4+4 bits:
7.2732
2.3748
5.4375
14
1.2890
The first bit is always 0, and the

fractional part is rounded to 4 bits
If we could move the fractional point 1
positions to the left, we could have 1
more bit for precision
Suppose you have to represent some

values with fractional part x.0, x.5 or
x.25 only, using a Fixed-point coding
4+4 bits
The last two bits are always 00, and the
integer part is limited to 15
If we could move the fractional point 2
positions to the right, we could have 2
more bits for the integer part (values
up to 63)
15
The problem with Fixed-point notation

is the fixed position of the radix point
To solve this problem, the radix point
must be made movable (floating), this
requires that its position be stored
along with each number
16
Exponential Notation
Exponential notation represents a

number as a value (mantissa or
significand ) that multiplies a whole
power of the base (exponent )
Mantissa
Examples (in decimal):
Exponent
123.45678 = 0.12345678103
0.0087654321 = 0.8765432110-2
87655678 = 0.87655678108
17
Very big and very small values are

obtained by just varying the exponent
The same value can be expressed in
many forms:
18
123.45 = 0.12345103 = 1234510-2
Among these forms, form 0.x (x0) is

chosen to have a unique representation
for values, this is called the
When the number of digits is not enough

to store the whole number only the most
important (leftmost) digits are stored
The most significant digits are thus
preserved, but approximation errors are
introduced because of truncation
Example (only 4 decimal digits):
normalized form
0.001234567 0.123410-2=0.001234000
876543 0.8765106 = 876500
19
The maximum representation error with

n digits is 10-n relative to the power of
the whole part
If the whole part power is m :

= 10-n 10m = 10m-n
which is the power of the rightmost
digit (LSD)
20
Example
Suppose the value has only 4 decimal digits

876543 0.8765106 (normalized)
The whole part is 0106 m =6
= 10-4 106 = 102 (maximum error)
This can also be seen by writing the value
as a sum of powers:
0106+8105+7104+6103+5102
for this value, the error is:
| 876543 876500 | = 43 (< 102 )
21
IEEE-P754 Floating-Point
Example
22
Suppose the value has only 4 decimal digits

0.001234567 0.123410-2=0.001234000
The whole part is 010-2 m = 2
= 10-4 10-2 = 10-6 (maximum error)
Writing the value as a sum of powers:
010-2+110-3+210-4+310-5+410-6
for this value, the error is:
| 0.001234567 0.001234000 | =
= 0.000000567 (< 10-6 )
The IEEE-P754 standard describes the

most common notations used by
computer FPUs (Floating-Point Units) to
compute floating-point values
The two exponential binary floating
point notations described have the form
mantissa 2exponent and are:
Single precision (SP)

Double precision (DP)
23
IEEE-P754 Single Precision
Single precision values uses 32 bits

divided in 3 parts:
sign: 1 bit
exponent field: 8 bits
mantissa (or significand) field: 23 bit
s exponent
mantissa
The sign bit is defined as follows:
0 is used for values 0

1 is used for values 0 (negative zero!)
24
The mantissa (or significand ) is in the

normalized form 1.xxxxx, where the 1
before the radix point is the leftmost 1
(MSB) in the binary representation
Only the fractional part of the binary
mantissa is stored in the mantissa field:
the leftmost 1 is already known to be
present (called hidden bit ), this allows
for one more bit of precision (23 bits
stored + 1 hidden = 24 bits effective)
25
26
The exponent is a relative integer

value on 8 bits, the IEEE-P754 SP
standard does not use SM or 2C
notations, but a biased notation called
excess 127: the FP exponent field is
computed by adding constant value 127
(bias constant ) to the exponent of the
normalized value
Excess notation is efficient, especially

for number comparison
The offset value is 2n1 1
(n is the number of bits) in order to
consider the first half of the range as
negative numbers
27
Example: +13.2510 IEEE-P754(SP)
sign is positive: sign bit = 0

convert the value to binary
13.25 = 1101.01
normalize the value
Note: base 2
1101.01 = 1.1010123
compute the exponent by adding 127 to
the real base 2 exponent
3+127=130=10000010
Compose the pieces adding padding 0es
0 10000010 10101000000000000000000
28
Example: convert from IEEE-P754(SP)

1 01100000 01000000000000000000000
sign bit = 1
extract the mantissa, add the hidden bit,
and convert to decimal
1.012= 1.2510
compute the real exponent by subtracting
127 from the extracted exponent
1100000 = 96 96127=31
compose the parts: -1.2510231 =-5.821010
29
30
The SP decimal range is:

(1.41045 3.410+38)
The decimal exponent varies from 45
to +38, corresponding to a binary
exponent from 126 to 127
Values are approximated to 7 decimal
digits (corresponding to the 24 bits used
by the mantissa)
The representation error is the

absolute weight of the LSB
This is computed by multiplying the
weight of the integer part (hidden bit)
times the relative weight of the mantissa
LSB (i.e. the weight of the LSB with
respect to the integer part)
This results in adding the exponents
1.10010..1 20
1.10010..1 25
1.10010..1 294
= 20-23 = 2-23
= 25-23 = 2-18
= 294-23 = 271
31
The binary exponent varies from 126

to 127, corresponding to excess 127
values from 1 to 254
Exponent values 00000000 (0) and
11111111 (255) are used for special
numbers:
Zeroes
Infinities
NaNs
Denormalized values
32
Zero
Exponent=00000000, Mantissa=0
0/1 0000 0000
by definition, not by computation, because
there is not any 1 for normalization
Positive and negative are considered
equivalent
Infinity
Exponent=11111111, Mantissa=0
0/1 1111 0000
Operations with infinitives are well defined
33
Not a Number (NaN)

Exponent=11111111, Mantissa0
0/1 1111 <not 0000>
NaNs are used to indicate values that does
not represent real numbers
There are 2 types of NaNs:
34
Special Operations
Quiet NaNs: denote indeterminate operations
(mantissa MSB set), the result of an operation

is not mathematically defined
Signalling NaNs: denote an invalid operation
(mantissa MSB clear)
N / INF
INF INF
N/0
INF + INF
0/0
INF INF
INF / INF
INF 0
=0
= INF
= INF
= INF
= NaN
= NaN
= NaN
=NaN
Any operation with NaN yields a NaN result
35
IEEE-P754 standard allows values in

non-normalized form too (denormalized )
Exponent=00000000, Mantissa0
Hidden bit is now 0 and not 1
The exponent value is considered 126
Value is:
0.mantissa 2126
36
IEEE-P754 Double Precision
Double precision notation just extends

the SP notation to use 64 bits
The differences are:
exponent bits: 11
mantissa bits: 52
bias constant: 1023
exponent range: 1022, +1023
equivalent decimal range:
(4.910-324 1.710+308)
with 15 decimal digits
denormalized exponent: 1022
37
IEEE-P754 Compact Notation
IEEE-P754 Exercises
For ease of writing and copying,

floating-point numbers (as any other bit
sequence) can be translated to base 16
as they were (they are not!) a pure
binary number
38
Convert the following values to/from

IEEE-P754:
0 10000000 0010000 40100000

1 01111111 1100000 BFE00000
C3C41000 1100001111000100000100
1324.25 to SP and DP
0.02324 to SP and DP with an absolute
precision of 1/1000
0 10000000 0010000 to decimal

1 01111111 1100000 to decimal
EB141000 to decimal
39
IEEE-P754 Exercises
Solutions
1324.25
10100101100.01 = 1.010010110001210
10+127 = 137 = 10001001

10+1023 = 1033 = 10000001001
then:
SP: 1 10001001 01001011000100

in compact form: C4A58800
DP: 1 10000001001 01001011000100
in compact form: C094B10000000000
40
IEEE-P754 Exercises
Solutions
0.02324
= 1/1000 n =10 (fractional bits)
0.0000010111 = 1.011126
6+127 = 121 = 01111001

6+1023 = 1017 = 01111111001
then:
SP: 0 01111001 011100

in compact form: 3CB80000
DP: 0 01111111001 011100
in compact form: 3F97000000000000
41
IEEE-P754 Exercises
Floating-point Addition
Solutions
42
0 10000000 0010000
+1.00122128-127= 10.012 =+2.25
1 01111111 1100000
1.1122127-127= 1.75
EB141000 = 1 11010110 0010100000100
1.0010122214-127= 1.1562510287=
= 1.1562510287 = 1.15625 280 27=
1 1024 102 = 1026 (approx.)
the non-approximated value is:
1.78921021302965117856514048 1026
To add two FP values, these must have

the same exponent before adding their
mantissas: the smaller value is converted
to have the same exponent as the
greater (it is de-normalized)
As the exponent is increased (e.g. by 3),
the mantissa must decrease (right shift 3
bits) to not change the overall value
1.01000216 + 1.101000213
1.01000216 + 0.001101216
43
Underflow
If the conversion of the smaller value

shifts away all of the mantissa bits
(including the hidden bit), the value is
approximated to 0, thus the operation
result is equal to the greater while the
smaller is just ignored
There is an underflow condition when,
adding 2 values, the result is equal to
the greater of them
44
Underflow
Example in SP
1.101243+ 1.01218
1.01218 must be converted to the form
xxx243, this causes a right shift of 25
bits on the mantissa, thus shifting away
all the 24 mantissa bits and resulting in 0
Adding up many small values, it is
possible that a partial sum becomes so
big to cause underflow for each of the
subsequent values (only the first part of
the values is added up)
45
IEEE-P754 Exercises
IEEE-P754 Exercises
Calculate the following operations

(IEEE-P754) and express the result in
the same compact form, identify any
Overflow/Underflow:
46
Solution N.1:
2B1A5F20 + 4F1A3BB0
C4A58000 + C2B80000
63AB102F 709B1BC2
7F600000 + 7F100000
2B1A5F20
0 01111110 00110100101111100100000
E=01010110=86
4F1A3BB0
0 10011110 00110100011101110110000
E=10011110=158
Difference of exponents= 72
72 > 24 UNDERFLOW
Result: 4F1A3BB0
47
IEEE-P754 Exercises
Solution N.2:
C4A58000
1 10001001 01001011000000000000000
E=10001001 =137 (non biased: 10)
M=1. 01001011
C2B80000
1 10000101 01110000000000000000000
E=10000101 =133 (non biased: 6)
M=1.0111
Difference of exponents: 137 133= 4
48
IEEE-P754 Exercises
Solution N.2 (continuation):
De-normalized mantissa of the 2nd value to

have exponent=10 (4 right shifts):
0.00010111
Addition: 1.01001011 210
+
10
0.00010111 2
=
10
1.01100010 2
Result:
1 10001001 01100010000000000000000
C4B18000
49
IEEE-P754 Exercises
IEEE-P754 Exercises
Solution N.3
50
63AB102F 709B1BC2
0 11000111 01010110001000000111111
E=199
0 11100001 00110110001101111000010
E=225
Difference of exponents: 225 199 = 26
26 > 24 UNDERFLOW
Result: F09B1BC2 (SIGN CHANGED!)
Solution N.4
7F600000
0 11111110 11000000000000000000000
E=254 (non biased=127)
7F100000
0 11111110 00100000000000000000000
E=254 (non biased=127)
Difference of exponents: 0
51
IEEE-P754 Exercises
Solution N.4 (continuation)
1.110 2127 +
1.001 2127 =
10.111 2127
Renormalization: 1.01112128
Max exponent is 127 OVERFLOW
Result: (+Infinity)
0 11111111 00000000000000000000000
7F800000
52
IEEE-P754 Exercises
Calculate in the IEEE-P754 SP format

the following operations with DECIMAL
numbers, identify any
Overflow/Underflow:
92000000010 92000000110
53
IEEE-P754 Puzzles
IEEE-P754 SP Ranges
Solution:
54
Values differ on the LSB

The two numbers have 9 decimal digits
corresponding to about 93=27 bits
After normalization, the relative weight of
the LSB is 2-27
Having only 24 bits, power 2-27 is discarded
The two values are considered equal
Result is 0
Maximum normalized positive number is

1.1111112127 with 23 fractional bits
If there were all the bits, the value would

be: 1.1111112127 with 127 fractional bits,
1.1111112127 = 2128 1
Having just 23 fractional bits, the value is
approximated to 1.1111 0000 2127
with 23 fractional bits set to 1 and the
rightmost 12723=104 bits set to 0
104 bits set to 1 are value 2104 1
55
IEEE-P754 SP Ranges
Maximum normalized positive value:

1.1111 0000 2127 =
(2128 1) (2104 1) =2128 2104
56
IEEE-P754 SP Ranges
3.4028234663852885981170418348452e+38
Minimum normalized positive number:

1.0000002-126
1.1754943508222875079687365372222e38
Maximum denormalized positive number

is 0.1111112-126 with 23 fractional bits
the rightmost bit power is: 12623= 149

(2-126 1) (2-149 1) =2-126 2-149
Minimum denormalized positive number

is 0.0000012-126 with 23 fractional bits
the rightmost bit power is: 12623= 149

2-149
57
IEEE-P754 Puzzles
IEEE-P754 Puzzles
Determine the difference between value

44A58800 and the next one (44A58801)
58
Value in binary is:

0 10001001 01001011000100 =
1.01001011000100210
Next one differs for just the LSB:
1.01001011000101210
Difference is 1LSB weight = 210-23 = 2-13
Determine the range of the consecutive

integer values in SP.
59
IEEE-P754 Puzzles
Determine the (absolute) representation

error for value N=61018 in IEEE-P754 SP.
N = 6 1018 6 260 requires 63 bits

N = 1.xxx 262
In SP there are only 23 bits for the mantissa
The relative weight of the LSB is 262-23=39
The representation error is 239
Values are in the form 1.xxxx with 23

fractional bits (denormals are not integers)
24 bits (hidden bit included) result in 224
combinations of bits (0 to 2241), each
corresponds to a value and an appropriate
exponent makes it an integer value
224 is represented too
Range: 224 +224

3 Fractional Numbers

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

3 Fractional Numbers

Transféré par

Droits d'auteur :

Formats disponibles

2

Fractional Number Notations

Fractional numbers have the form:

2010 - Claudio Fornaro

Fixed-point notation splits the available

one for the integer part

The radix point is not stored (does not

the integer part must be padded with 0es

5.25 in FX on 4+4 bits: 01010100

For relative fractional values, both SM

Convert value +12.25 in FX 2C 1+4+3

Convert value 12.25 in FX 2C 1+4+3

Operations are the same seen as for

Convert the values as requested

Calculate on FX 2C 16 bits (1+7+8) and

Radix points are supposed here

Radix points are supposed here

Fixed-point notation is sometimes used

Suppose the following (unsigned) values

All of them can be coded in 8 bits, but

Suppose you have to represent some

The first bit is always 0, and the

Suppose you have to represent some

The problem with Fixed-point notation

Exponential notation represents a

Very big and very small values are

123.45 = 0.12345103 = 1234510-2

Among these forms, form 0.x (x0) is

When the number of digits is not enough

The maximum representation error with

the whole part

If the whole part power is m :

Suppose the value has only 4 decimal digits

Suppose the value has only 4 decimal digits

The IEEE-P754 standard describes the

Single precision (SP)

IEEE-P754 Single Precision

Single precision values uses 32 bits

The sign bit is defined as follows:

0 is used for values 0

IEEE-P754 Single Precision

The mantissa (or significand ) is in the

IEEE-P754 Single Precision

IEEE-P754 Single Precision

The exponent is a relative integer

Excess notation is efficient, especially

IEEE-P754 Single Precision

Example: +13.2510 IEEE-P754(SP)

sign is positive: sign bit = 0

IEEE-P754 Single Precision

Example: convert from IEEE-P754(SP)

IEEE-P754 Single Precision

IEEE-P754 Single Precision

The SP decimal range is:

The representation error is the

IEEE-P754 Single Precision

The binary exponent varies from 126

IEEE-P754 Single Precision

IEEE-P754 Single Precision

IEEE-P754 Single Precision

Not a Number (NaN)

Quiet NaNs: denote indeterminate operations

(mantissa MSB set), the result of an operation

Any operation with NaN yields a NaN result