Vous êtes sur la page 1sur 3

DSE-1: Numerical Techniques-Unit-1

Floating point representation and computer arithmetic : Computers use


binary arithmetic, representing each number as a binary number. Computers use 2
formats for numbers. Fixed-point numbers are used to store integers that have a
limited range. An alternative to fixed-point is floating-point numbers that approximate
real numbers.

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard
for floating-point computation which was established in 1985 by the Institute of
Electrical and Electronics Engineers (IEEE). The standard addressed many
problems found in the diverse floating point implementations that made them difficult
to use reliably and reduced their portability. IEEE Standard 754 floating point is the
most common representation today for real numbers on computers, including Intel-
based PC’s, Macs, and most Unix platforms.

There are several ways to represent floating point number but IEEE 754 is the most
efficient in most cases. IEEE 754 has 3 basic components:
1. The Sign of Mantissa – This is as simple as the name. 0 represents a positive
number while 1 represents a negative number.
2. The Biased Exponent – The exponent field needs to represent both positive and
negative exponents. A bias is added to the actual exponent in order to get the
stored exponent.
3. The Normalised Mantisa – The mantissa is part of a number in scientific
notation or a floating-point number, consisting of its significant digits. Here we
have only 2 digits, i.e. 0 and 1. So a normalised mantissa has only one 1 to the
left of the decimal.
There are two ways of representing floaint point numbers in IEEE 754 Standard:
single precision and double precision.

32 Bits
Sign Exponent Mantissa
1 Bit 8 Bits 23 Bits
(IEEE 754 Single Precesion Floating Point Representation)

64 Bits
Sign Exponent Mantissa
1 Bit 11 Bits 52 Bits
(IEEE 754 Double Precesion Floating Point Representation)

Manas Ku Mishra, Asst. Prof. of Comp. Sc., FM (A) College, BLS. Page 1 of 3
Example – Represent 85.125 in IEEE 754 Single Precision and Double Precision
format.
85 = 1010101 2. Double precision:
0.125 = 001 biased exponent 1023+6=1029
85.125=1010101.001=1.010101001x2^6 1029 = 10000000101
sign = 0 Normalised mantisa = 010101001
1. Single precision: we will add 0's to complete the 52 bits
biased exponent 127+6=133
133 = 10000101 The IEEE 754 Double precision is:
Normalised mantisa = 010101001 = 0 10000000101
we will add 0's to complete the 23 bits 0101010010000000000000000000000
The IEEE 754 Single precision is: 000000000000000000000
0 10000101 01010100100000000000000 This can be written in hexadecimal form
This can be written in hexadecimal form as 4055480000000000
as 42AA4000

Special Values: IEEE has reserved some values that can ambiguity.
 Zero – Zero is a special value denoted with an exponent and mantissa of 0. -0
and +0 are distinct values, though they both are equal.
 Infinity – The values +infinity and -infinity are denoted with an exponent of all
ones and a mantissa of all zeros. The sign bit distinguishes between negative
infinity and positive infinity. Operations with infinite values are well defined in
IEEE.
 Not A Number (NAN) – The value NAN is used to represent a value that is an
error. This is represented when exponent field is all ones with a zero sign bit or a
mantissa that it not 1 followed by zeros. This is a special value that might be used
to denote a variable that doesn’t yet hold a value.
EXPONENT MANTISA VALUE
0 0 0 Similar for Double
255 0 Infinity precision,
255 is replaced with 2047
Not a number
255 not 0
(NAN)

Significant digits: Significant digits or precision or resolution of a number in


positional notation are digits in the number that are absolutely necessary to
indicate the quantity of something. The significant digits of a number include all
the digits except the following:
 All leading zeros: For example, 013 kg has two significant figures, 1 and 3,
and the leading zero is not significant since it is not necessary to indicate the
mass; 013 kg = 13 kg so 0 is not necessary. 0.056 m has two insignificant
leading zeros since 0.056 m = 56 mm so the leading zeros are not absolutely
necessary to indicate the length.
 Trailing zeros when they are merely placeholders to indicate the scale of the
number.
Manas Ku Mishra, Asst. Prof. of Comp. Sc., FM (A) College, BLS. Page 2 of 3
 Spurious digits introduced: for example, by calculations carried out to
greater precision than that of the original data, or measurements reported to a
greater precision than the equipment supports.

Errors:
The error may be obtained due to rounding or chopping.
For example π = 3.14159 is approximated as 3.141 (chopping up to 4 significant
digits) or 3.142 (rounding up to 3 decimal places)
Error = Exact value – Approximate value
Absolute error = modulus of error
Relative error = Absolute error / (Exact value)
Percentage error = Relative error X 100
Here, absolute error = | 3.14159 – 3.141 | = 0.00059 for chopping or truncation
Absolute error = | 3.14159 – 3.142 | = 0.00041 for rounding
Relative error = 0.00059 / 3.14159 for chopping
Relative error = 0.00041 / 3.14159 for rounding
Percentage error = (0.00059 / 3.14159) * 100 for chopping
Percentage error = (0.00041 / 3.14159) * 100 for rounding
Local truncation error vs Global truncation error : Local truncation error is
the amount of truncation error that occurs in one step of
a numerical approximation. Global truncation error is the amount of truncation
error that occurs in the use of a numerical approximation to solve a problem which
may involve several steps.
Convergence: In numerical techniques, the speed at which a solution to a
numerical problem achieved is called the rate of convergence.

Manas Ku Mishra, Asst. Prof. of Comp. Sc., FM (A) College, BLS. Page 3 of 3

Vous aimerez peut-être aussi