Chapter 1

Numerical Methods Preliminaries
Professor PhD Henry Arguello Fuentes
Universidad Industrial de Santander

Colombia
March 9, 2017
High Dimensional Signal Processing Group

www.hdspgroup.com
henarfu@uis.edu.co
LP 304
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 1 / 78

Outline
1 Introduction
2 Binary numbers
3 Error Analysis

Introduction: numerical methods applications
(a) Model the probable evolution of (b) Model and simulate the growth
a pathology of a tumor
(c) Microscopy super-resolution (d) Thermal management

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Base 2 numbers
Base 10 numbers: Expanded form of the number 1563
1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

Base 2 numbers
1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).
Let N denote a positive integer; then the digits a0 , a1 , ..., ak exist so that
N has the base 10 expansion
Base 10 expansion
N = (ak × 10k ) + (ak−1 × 10k−1 ) + · · · + (a1 × 101 ) + (a0 × 100 ), (1)
Where the digits ak are chosen from 0, 1, ..., 8, 9.

Base 2 numbers
1563 =(1 × 210 ) + (1 × 29 ) + (0 × 28 ) + (0 × 27 ) + (0 × 26 ) + (0 × 25 )+

(1 × 24 ) + (1 × 23 ) + (0 × 22 ) + (1 × 21 ) + (1 × 20 ).
So that:
1563 = 1024 + 512 + 16 + 8 + 2 + 1.

Base 2 numbers
Let N denote a positive integer; the digits b0 , b1 , ..., bJ exist so that N

has the base 2 expansion
Base 2 expansion
N = (bJ × 2J ) + (bJ−1 × 2J−1 ) + · · · + (b1 × 21 ) + (b0 × 20 ), (2)
Where each digit bj is either a 0 or 1. Thus N is expressed in binary

notation as
N = bJ bJ−1 · · · b2 b1 b0two . (3)

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.

Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.
Example:
𝒌 1563 𝑸𝒌 𝑹𝒌
0 1563/2= 781 1
1 781/2= 390 1
2 390/2= 195 0
3 195/2= 97 1
4 97/2= 48 1
5 48/2= 24 0
6 24/2= 12 0
7 12/2= 6 0
8 6/2= 3 0
9 3/2= 1 1
10 1/2= 0 1
1 1 0 0 0 0 1 1 0 1 1
b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
Most Significant Bit -MSB Least Significant Bit - LSB

Exercise 1: Find the base 2 representation of 697

Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1

Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1
Continue the process until finding Qk = 0, for some integer k = J.
Qk = Qk−1 /2
𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1
2
3
4
5
6
7
8
9
b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0

Solution
𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1 348/2= 174 0
2 174/2= 87 0
3 87/2= 43 1
4 43/2= 21 1
5 21/2= 10 1
6 10/2= 5 0
7 5/2= 2 1
8 2/2= 1 0
9 1/2= 0 1
1 0 1 0 1 1 1 0 0 1
b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0
Then, 69710 = 10101110012

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Commonly, when you express a rational number in decimal form, you

require infinitely many digits.
1
For example, in = 0.3 , the symbol 3 means that the digit 3 is repeated
3
forever to form an infinite repeating decimal.
1
But, the number is the shorthand notation for the infinite series S
3
S = (3 × 10−1 ) + (3 × 10−2 ) + · · · + (3 × 10−∞ )

∞
X 1
S= 3(10)−k = .
3
k=1

Definition 1.
The infinite series S
∞
X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0
where c 6= 0 and r 6= 0, is called a geometric series with ratio r.

Definition 1.
The infinite series S
∞
X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0
where c 6= 0 and r 6= 0, is called a geometric series with ratio r.
Theorem 1. (Geometric Series)

The geometric series has the following properties:
∞
X c
If |r| < 1, then crn = .
1−r (5)
n=0
If |r| > 1, then the series diverges.

Example: The series S is given by

1 2 ∞ X∞ n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1


1 2 ∞ X∞ n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1
∞ n
X 1
which is equal to − 7 + 7 ,
7
n=0


1 2 ∞ X∞ n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1
∞ n
X 1
which is equal to − 7 + 7 ,
7
n=0
7 7
and acording with (5) S = −7 + = = 1.16,
1 6
1−
7
7
Then, is the shorthand notation for the infinite series S
6

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Binary Fractions
A binary fraction is a serie of sums with negative powers of 2, which is

used to express a real number R that lies in the range 0 < R < 1.

Binary Fractions
A binary fraction is a serie of sums with negative powers of 2, which is

used to express a real number R that lies in the range 0 < R < 1.
Binary fractions
R = (d1 × 2−1 ) + (d2 × 2−2 ) + · · · + (dn × 2−n ) + · · · , (6)
where dj ∈ 0, 1 and 0 < R < 1.
Binary fraction Representation of R

P∞ −j
R = 0.d1 d2 · · · dn · · ·two R= j=1 dj (2)

Binary Fractions-Decimal to binary
Process: Generate sequences dk and Fk multiplying by two.

Process: Generate sequences dk and Fk multiplying by two.
Example:
d1 d2 d 3 d 4 d 5 d6 d7 d8 d9
0. 1 0 1 1 0 0 1 1 0 …
𝑗 0.7 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.7)(2) = 1.4 1 0.4
2 (0.4)(2) = 0.8 0 0.8
3 (0.8)(2) = 1.6 1 0.6
4 (0.6)(2) = 1.2 1 0.2
5 (0.2)(2) = 0.4 0 0.4
6 (0.4)(2) = 0.8 0 0.8
7 (0.8)(2) = 1.6 1 0.6
8 (0.6)(2) = 1.2 1 0.2
9 (0.2)(2) = 0.4 0 0.4
0.7  0.10110 2
…

Exercise 2: Calculate the binary fraction for 0.6.
Start by multiplying 0.6 by 2, to generate sequences dj and Fj
d1 d2 d3 d4 d5 d6 d7 d8 d9
…
1 (0.6)(2) = 1.2 1 0.2
2
3
4
5
6
7
8
9
…

Solution
d1 d2 d3 d4 d5 d6 d7 d8 d9
…
1 (0.6)(2) = 1.2 1 0.2
2 (0.2)(2) = 0.4 0 0.4
3 (0.4)(2) = 0.8 0 0.8
4 (0.8)(2) = 1.6 1 0.6
5 (0.6)(2) = 1.2 1 0.2
6 (0.2)(2) = 0.4 0 0.4
7 (0.4)(2) = 0.8 0 0.8
8 (0.8)(2) = 1.6 1 0.6
9 (0.6)(2) = 1.2 1 0.2
…
0.6 = 0. 1001

Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.

Example:
0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

the expression above is writted as

Example:
0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0
1 2 1
= −1 + = −1 + = .
1 3 3
1−
4

Example:
0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0
1 2 1
= −1 + = −1 + = .
1 3 3
1−
4
1
then, is the 10 rational number associated to 0.012
3
Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Binary shifting
Let R be
R = 0.00000110002 . (7)

Binary shifting
Let R be
R = 0.00000110002 . (7)
Multiplying both sides of (7) by Multiplying both sides of (7) by

25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .

Binary shifting
Let R be
R = 0.00000110002 . (7)
Multiplying both sides of (7) by Multiplying both sides of (7) by

25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .
Taking the difference 1024R − 32R = 11000.110002 − 0.110002 ,

we obtain 992R = 110002 ,
given that 110002 = 2410 we find that,
3
992R = 24, Therefore R = .
124 10

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Scientific Notation
The scientific notation is a standard way to present a real number. It is

obtained by properly shifting the decimal point.

Scientific Notation
The scientific notation is a standard way to present a real number. It is

obtained by properly shifting the decimal point.
Examples
0.0000747 = 7.47 × 10−5
31.4159265 = 3.14159265 × 10
9, 700, 000.000 = 9.7 × 109
The Avogadro’s constant used in chemistry = 6.02252 × 1023 .
The quantity 1K = 1.024 × 103 used in computer science.

Contents
1 Introduction
2 Binary numbers
Base 2 numbers
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers
3 Error Analysis

Machine Numbers
Machine Numbers
A mathematical quantity x is stored in a computer as a binary approxi-

mation given by
x ≈ ±q × 2n . (8)
The finite binary number q is the mantissa, where 1/2 ≤ q ≤ 1.

The integer n is the exponent.

Floating-point format
A real number is stored in a computer as a set of binary numbers

expressing:
The sign
The exponent
The mantissa
𝑆𝑖𝑔𝑛 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝑀𝑎𝑛𝑡𝑖𝑠𝑠𝑎
The sign is always one bit where, S = 0 if, x > 0 and S = 1, if x < 0.
The amount of bits for the exponent and the mantissa depends on
the precision of the machine.

Floating-point format-IEEE 754 standard
Precision Total Sign Exponent Man4ssa Exponent

bias
Single 32 bits 1 bit 8 bits 23 bits 127
Double 64 bits 1 bit 11 bits 52 bits 1023


bias
Note: Biasing is done because exponents have to be signed values to

be able to represent both tiny and huge values, but two’s complement.


bias

Then, the exponent is biased by adjusting its value.


bias

Then, the exponent is biased by adjusting its value.
The exponent bias is calculated as bias = 2exp−1 −1, where exp indicates
the amount of bits for the exponent.
Example:
if exp = 15 bits, then, bias = 215−1 − 1 = 16383

Possible cases:
Sign (S) Exponent (E) Man0ssa (M) Value

0-‐1 All 0 < E < All 1 M (-‐1)S (1.M)(2E-‐bias)
0 E=all 1 M=0 +∞
1 E=all 1 M=0 -‐∞
0-‐1 E=all 1 M≠0 NaN
0-‐1 E=all 0 M=0 0
0-‐1 E=all 0 M≠0 (-‐1)S (0.M)(21-‐bias)

Example: Determine the floating point format to stored the number
59.187510 in a computer with 32 bits of precision.

1. Find the binary representation of the number 59.187510

59.187510
Integer part Decimal part

5910 0.187510
59 2
0.1875×2 = 0.375 0 MSB
LBS 1 29 2
0.375×2 = 0.75 0
1 14 2
2 0.75×2 = 1.5 1
0 7
1 3 2 0.5×2 = 1.0 1 LBS
1 1 2
MSB 1 0
111011.00112

2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

111011.00112 = 1.1101100112 × 25
3. Calculate the bias

bias = 28−1 − 1 = 127

111011.00112 = 1.1101100112 × 25

bias = 28−1 − 1 = 127
4. Determine the mantissa

Mantissa = 1101100112

111011.00112 = 1.1101100112 × 25

bias = 28−1 − 1 = 127

Mantissa = 1101100112
5. Determine the exponent

exp = 5 + bias = 5 + 127 = 13210 = 100001002

111011.00112 = 1.1101100112 × 25

bias = 28−1 − 1 = 127

Mantissa = 1101100112

exp = 5 + bias = 5 + 127 = 13210 = 100001002
S E M
0 10000100 11011001100000000000000

1. Find the binary representation of the number 132.2812510
132.2812510
Integer part Decimal part
132 2 13210
0.2812510
LBS 0 66 2
0.28125×2 = 0.5625 0 MSB
0 33 2
1 16 2 0.375×2 = 1.125 1
0 8 2
0.125×2 = 0.25 0
0 4 2
0 2 2 0.25×2 = 0.5 0
0 1 2 0.5×2 = 1.0 1 LBS
MSB 1 0
10000100.010012

10000100.010012 = 1.0000100010012 × 27

10000100.010012 = 1.0000100010012 × 27

bias = 28−1 − 1 = 127

10000100.010012 = 1.0000100010012 × 27

bias = 28−1 − 1 = 127

Mantissa = 0000100010012

10000100.010012 = 1.0000100010012 × 27

bias = 28−1 − 1 = 127

Mantissa = 0000100010012

exp = 7 + bias = 7 + 127 = 13410 = 100001102

10000100.010012 = 1.0000100010012 × 27

bias = 28−1 − 1 = 127

Mantissa = 0000100010012

exp = 7 + bias = 7 + 127 = 13410 = 100001102
S E M
0 10000110 00001000100100000000000
The real value associated with a given 32 bit binary is calculated as

23
!
X
value = (−1)S 1 + d(23−i) 2−i × 2(E−127)
i=1
Where,
S = The sign
E = Exponent
127 = Bias
dj = Bits of the mantissa

Exercise: Find the real value for the binary data:
S E M
0 01010010 01101000000100100000000

S E M
0 01010010 01101000000100100000000
23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1

S E M
0 01010010 01101000000100100000000
23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45

S E M
0 01010010 01101000000100100000000
23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45
Thus
value = 1.4065246582 × 2−45
Example: Find the real value for the binary data:
S E M
1 10000100 01000000000000000000000
    
31 30 23 22 0
In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25

Example: Find the real value for the binary data:
S E M
1 10000100 01000000000000000000000
    
31 30 23 22 0
In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25
Thus
value = 1.25 × 25 = −40.

Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.
The absolute error is the difference between the true value and
the approximate value.
The relative error expresses the error as a percentage of the true
value.

Example: Find the absolute and relative error in the following three
cases:

cases:
Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012

Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25

cases:
Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012

Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25
Observe that as |p| moves away from 1 (greater than or less than) the
relative error Rp is a better indicator than Ep of the accuracy of the ap-
proximation.

Definition 3.
The number bp is said to approximate p to d significant digits if d is the
largest nonnegative integer for which
|p − p|
b 101−d
< .
|p| 2

Example:
Let ŵ be the approximation for w = 2.1645, then
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|

Example:
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..

Example:
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies

Example:
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies

Example:
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies
101−3
if d = 3: 2.07900 × 10− 3 < 2 = 0.005 Xsatisfies

Example:
|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
1−2
if d = 2: 2.07900 × 10− 3 < 102 = 0.05 Xsatisfies
1−3
if d = 3: 2.07900 × 10− 3 < 102 = 0.005 Xsatisfies
1−4
if d = 4: 2.07900 × 10− 3 < 102 = 0.0005 X does not satisfy
Then, ŵ approximate w to 3 significant digits.

Other examples:
x|/|x| = 0.000507 < 10−2 /2.

x = 3.14, then |x − b
If x = 3.141592 and b
Therefore, bx approximates x to three significant digits.
If y = 1, 000, 000 and by = 999, 996, then

|y − by|/|y| = 0.000004 < 10−5 /2. Therefore, by approximates y to six
significant digits.
If z = 0.000012 and bz = 0.000009, then |z − bz|/|z| = 0.25 < 10−0 /2.

Therefore, bz approximates z to one significant digits.

Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Truncation Error
Round-off Error

Truncation Error
Truncation error refers to errors introduced when a more complicated

mathematical expression is "replaced" with a more elementary formula.

Truncation Error
Truncation error refers to errors introduced when a more complicated

mathematical expression is "replaced" with a more elementary formula.
For example, the infinite Taylor series
2 x4 x6 x8 x2n
ex = 1 + x 2 + + + + ··· + + ···
2! 3! 4! n!
x4 x6 x8
might be replaced with just the first five terms 1 + x2 + + + .
2! 3! 4!
Then a truncation error appears.

Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:

Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:
1/2 x=1/2
x4 x6 x8 x3 x5 x7 x9
Z
2

1+x + + + dx = x + + + +
0 2! 3! 4! 3 5(2!) 7(3!) 9(4!) x=0
1 1 1 1 1
= + + + +
2 24 320 5376 110592
2109491
= = 0.544986720817 = b p
3870720
Since
|p − bp| 101−6
= 7.03442 × 10−7 < = 5 × 106
|p| 2
then, the approximation b
p agrees with the true value to 6 significant digits.
Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Truncation Error
Round-off Error

Round-off Error
The accuracy of the representation of a real number stored in a

computer is determined by the precision of the mantissa.

Round-off Error

The error occurred due to the mantissa precision is the round-off

error.

Round-off Error


error.
The actual number that is stored in the computer may be

chopping or rounding of the last digit.

Round-off Error


error.
The actual number that is stored in the computer may be

chopping or rounding of the last digit.
The computer hardware works with a limited number of digits in

machine numbers, errors are introduced and propagated in
successive computations.

Chopping Off versus Rounding Off
Example:
Consider p expressed in normalized decimal form:
p = ±0.d1 d2 d3 · · · dk dk+1 · · · × 10n ,
where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for j > 1.

Example:
Consider p expressed in normalized decimal form:
p = ±0.d1 d2 d3 · · · dk dk+1 · · · × 10n ,
where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for j > 1.
If k is the maximum number of decimal digits; then the real number p is

represented by flchop (p), which is given by
flchop (p) = ±0.d1 d2 d3 · · · dk × 10n , (9)
Where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j ≤ k. The number flchop (p) is

called the chopped floating-point representation of p.

On the other hand, the rounded floating-point representation

flround (p) is given by
flround (p) = ±0.d1 d2 d3 · · · rk × 10n , (10)
where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j < k and the last digit, rk , is

obtained by rounding the number dk dk+1 dk+2 · · · to the nearest integer.

Example:
22
The real number p = = 3.142857142857142857... has the following
7
six-digit representations:
flchop (p) = 0.314285 × 101 ,

flround (p) = 0.314286 × 101 .
For common purposes the chopping and rounding would be written as

3.14285 and 3.14286, respectively.

Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Truncation Error
Round-off Error

Consider p = 3.14155926536 ans q = 3.1415957341, which are

nearly equal and both carry 11 decimal digits of precision.
Their difference is formed: p − q = −0, 0000030805. Since the first

six digits of p and q are the same, their difference p − q contains
only five decimal digits of precision.
This phenomenon is called loss of significance.

Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x

Example:
√ √ x
x+1+ x
For the first function,
√ √
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

Example:
√ √ x
x+1+ x
√ √
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500
For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437

Example:
√ √ x
x+1+ x
√ √
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500
For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437
The second function, g(x), is algebraically equivalent to f (x), but the answer,
g(500) = 11.1748, involves less error and it is the same as that obtained by
rounding the true 11.174755300747198... to six digits.
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where
ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
about x = 0.

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
about x = 0.
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
about x = 0.
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001
For the second function

1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
about x = 0.
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001
For the second function

1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24
The answer P(0.01) = 0.501671 contains less error and it is the same as that
obtained rounding the true answer 0.5016708416805... to six digits.

Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Truncation Error
Round-off Error

O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:
|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

For functions
Definition 4.
Example: Consider f (x) = x2 + 1 and g(x) = x3 .

For functions
Definition 4.
Since x2 ≤ x3 and 1 ≤ x3 for x ≥ 1

For functions
Definition 4.
it follows that x2 + 1 ≤ 2x3 for x ≥ 1.

For functions
Definition 4.
it follows that x2 + 1 ≤ 2x3 for x ≥ 1.
Therefore, f (x) = O(g(x)), whenever h ≥ 1.
The big Oh notation provides an useful way of describing the rate of

growth of a function in terms of the well-known elementary function (xn ,
x1/n , ax , loga (x), etc.).
For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that
|xn | ≤ C|yn | whenever n ≥ N. (12)

For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that
|xn | ≤ C|yn | whenever n ≥ N. (12)
Example:
n2 − 1 n2 − 1 n2

1 1
=O , since ≤ = whenever n ≥ 1.
n3 n n3 n3 n

Definition 6.
Assume that f (h) is approximated by the function p(h) and there exist a
real constant M > 0 and a positive integer n so that
|f (h) − p(h)|
≤ M for sufficiently small h. (13)
hn
We say that p(h) approximates f (h) with order of approximation O(hn )
and write
f (h) = p(h) + O(hn ) (14)
When relation (13) is rewritten in the form |f (h) − p(h)| ≤ M|hn |, we see
that the notation O(hn ) stands in place of the error bound M|hn |.

Theorem 2. Order of approximation for basic operations

Assume that f (h) = p(h) + O(hn ), g(h) = q(h) + O(hm ), and
r = min(m, n). Then
f (h) + g(h) = p(h) + q(h) + O(hr ), (15)
f (h)g(h) = p(h)q(h) + O(hr ), (16)

and
f (h) p(h)
= + O(hr ) provided that g(h) 6= 0 and q(h) 6= 0. (17)
g(h) q(h)

Theorem 3. (Taylor’s Theorem).

Assume f ∈ Cn+1 [a, b]. If both x0 and x = x0 + h lie in [a, b], then
n
X f (k)(x0 )
f (x0 + h) = hk + O(hn+1 ). (18)
k!
k=0
Additional properties:
(i) O(hp ) + O(hp ) = O(hp ),
(ii) O(hp ) + O(hq ) = O(hr ), where r = min(m, n), and
(iii) O(hp )O(hq ) = O(hs ), where s = p + q.

Example:
Consider the Taylor polynomial expansions
h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!
Determine the order of approximation for their sum and product.

Example:
Consider the Taylor polynomial expansions
h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!
Determine the order of approximation for their sum and product.

For the sum we have
h2 h3 h2 h4
eh + cos(h) =1 + h + + + O(h4 ) + 1 − + + O(h6 )
2! 3! 2! 4!
h3 h4
=2+h+ + O(h4 ) + + O(h6 )
3! 4!

h4
Since O(h4 ) + = O(h4 ) and O(h4 ) + O(h6 ) = O(h4 ), this reduces to
4!
h3
eh + cos(h) = 2 + h + + O(h4 ),
3!
and the order of approximation is O(h4 ).

The product is treated similarly:

h2 h3 h2 h4

eh cos(h) = 1 + h + + + O(h4 ) 1− + + O(h6 )
2! 3! 2! 4!
h2 h3 h2 h4

= 1+h+ + 1− + +
2! 3! 2! 4!
h2 h3 h2 h4

6
1+h+ + O(h ) + 1 − + O(h4 ) + O(h4 )O(h6 )
2! 3! 2! 4!
h3 5h4 h5 h6 h7
=1 + h − − − + + + O(h6 ) + O(h4 ) + O(h4 )O(h6 ).
3 24 24 48 144

Since O(h4 )O(h6 ) = O(h10 ) and
−5h4 h5 h6 h7
− + + + O(h6 ) + O(h4 ) + O(h10 )
24 24 48 144
Since O(h0 ) + O(h4 ) + O(h10 ) = O(h4 ), the preceding equation is
simplified to yield
h3
eh cos(h) = 1 + h + + O(h4 ),
3
and the order of approximation is O(h4 ).

Order of Convergence of a Sequence
Convergence of a sequence
Definition 7.
Suppose that limn−→∞ xn = x and {rn }∞ n=1 is a sequence with
limn−→∞ rn = 0. We say that {xn }∞
n=1 converges to x with the order
of convergence O(rn ), if there exists a constant K ≥ 0 such that
|xn − x|
≤ K for n sufficiently large. (19)
|rn |
This is indicated by writing xn = x + O(rn ), or xn −→ x with order

of convergence O(rn )

Order of Convergence of a Sequence
Definition 7.
Example:
Let xn = cos(n)/n2 and rn = 1/n2 then,
limn−→∞ xn = 0
with a rate of convergence O(1/n2 ). This follows immediately from the

relation
|cos(n)/n2 |
= |cos(n) ≤ 1| for all n.
|1/n2 |

Contents
1 Introduction
2 Binary numbers
3 Error Analysis
Truncation Error
Round-off Error

Addition consider two numbers p and q (the true values) with the
approximate values b p and bq, which contains errors p and q ,
respectively. Starting with p = b
p + p and q = b
q + q , the sum is
p + q = (b
p + p ) + (b
q + q ) = (b
p+b
q) + (p + q ). (20)
Hence, for addition, the error in the sum is the sum of the errors in
the addends.
s = p + q .

The propagation of error in multiplication is more complicated. The

product is
pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)


product is
pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get
pq − bq=b
pb pq + b
qp + p q . (22)


product is
pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get
pq − bq=b
pb pq + b
qp + p q . (22)
Suppose that b p 6= 0 and b
q 6= 0; then we can divide (22) by pq to obtain
the relative error in the product pq:
pq − b
pb
q pq + b
b qp + p q pq b
b qp p q
Rpq = = = + + . (23)
pq pq pq pq pq

Furthermore, suppose that b p and b q are good approximations for b

p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship
pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq

Furthermore, suppose that b p and b q are good approximations for b

p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship
pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq
This shows that the relative error in the product pq is approximately the
sum of the relative errors in the approximations p b and qb.
A quality that is desirable for any numerical process is that a small error
in the initial conditions will produce small changes in the final result.
An algorithm with this feature is called stable; otherwise, it is called
unstable.

Definition 8.
Suppose that represents an initial error and (n) represents the growth
of the error after n steps. If |(n)| ≈ n, the growth of error is said to be
linear. If |(n)| ≈ K n , the growth of error is called exponential. If
K > 1, the exponential error growns without bound as n −→ ∞, and if
0 < K < 1, the exponential error diminishes to zero as n −→ ∞.

Propagation of error
Example: Show that the following three schemes can be used with finite-
precision arithmetic to recursively generate the terms in the sequence {1/3n }∞
n=0 .
1
r0 = 1 and rn = rn−1 for n = 1, 2, · · · , (25)
3
1 4 1
p0 = 1, p1 = , and pn = pn−1 − pn−2 for n = 1, 2, · · · , (26)
3 3 3
1 10
q0 = 1, q1 = , and qn = qn−1 − qn−2 for n = 1, 2, · · · , (27)
3 3

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:

4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2

4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired.

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:

4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2

4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired. In (27) the
difference equation has the general solution qn = A(1/3n ) + B3n . This too
verified by substitution:

10 10 A n−1 A n−2
qn−1 − qn−2 = + B3 − + B3
3 3 3n−1 3n−2

10 9 1
= n
− n A − (10 − 1)3n−1 B = A n + B3n = qn
3 3 3

Example:
Generate approximations to the sequences {xn } = 1/3n using hemes
1
r0 = 0.99996 and rn = rn−1 for n = 1, 2, · · · , (28)
3
4 1
p0 = 1, p1 = 0.33332, and pn = pn−1 − pn−2 for n = 1, 2, · · · ,
3 3
(29)
10
q0 = 1, q1 = 0.33332, and qn = pn−1 − pn−2 for n = 1, 2, · · · ,
3
(30)
In (28) the initial error in r0 is 0.00004, and in (29) and (30) the initial
errors in p1 and q1 are 0.000013. Investigate the propagation of error for
each scheme.
Table: Sequence xn = 1/3n and the approximations rn , pn , and qn
n xn rn pn qn
0 1.0000000000 0.9999600000 1.0000000000 1.0000000000
1 0.3333333333 0.3333200000 0.3333200000 0.3333200000
2 0.1111111111 0.1111066667 0.1110933333 0.1110666667
3 0.0370370370 0.0370355556 0.0370177778 0.0369022222
4 0.0123456790 0.0123451852 0.0123259259 0.0119407407
5 0.0041152263 0.0041150617 0.0040953086 0.0029002469
6 0.0013717421 0.0013716872 0.0013517695 -0.0022732510
7 0.0004572474 0.0004572291 0.0004372565 -0.0104777503
8 0.0001524158 0.0001524097 0.0001324188 -0.0326525834
9 0.0000508053 0.0000508032 0.0000308063 -0.0983641945
10 0.0000169351 0.0000169344 -0.0000030646 -0.2952280648

Table: Error sequences xn − rn , xn − pn , and xn − qn
n xn − rn xn − pn xn − qn
0 0.0000400000 0.0000000000 0.0000000000
1 0.0000133333 0.0000133333 0.0000133333
2 0.0000044444 0.0000177778 0.0000444444
3 0.0000014815 0.0000192593 0.0001348148
4 0.0000004938 0.0000197531 0.0004049383
5 0.0000001646 0.0000199177 0.0012149794
6 0.0000000549 0.0000199726 0.0036449931
7 0.0000000183 0.0000199909 0.0109349977
8 0.0000000061 0.0000199970 0.0328049992
9 0.0000000020 0.0000199990 0.0984149997
10 0.0000000007 0.0000199997 0.2952449999

−5 −5
x 10 x 10
6 2
1.5
4
xn−pn
xn−rn
1
2
0.5
0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4
0.3
xn−qn
0.2
0.1
0
0 2 4 6 8 10
n

−5 −5
x 10 x 10
6 2
1.5
4
xn−pn
xn−rn
1
2
0.5
0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4
0.3
xn−qn
0.2
0.1
0
0 2 4 6 8 10
n
The error for {rn } is stable and decreases in an exponential manner. The error
{pn } is stable. The errror for {qn } is unstable and grows at an exponential rate.
Although the error for {pn } is stable, the terms pn −→ 0 as n −→ ∞, so that the
error eventually dominates and teh terms past p8 have no significant digits.

Chapter 1

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chapter 1

Transféré par

Droits d'auteur :

Formats disponibles

Numerical Methods Preliminaries

Professor PhD Henry Arguello Fuentes

Universidad Industrial de Santander

High Dimensional Signal Processing Group

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 1 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 2 / 78

(c) Microscopy super-resolution (d) Thermal management

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 3 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 4 / 78

Base 10 numbers: Expanded form of the number 1563

1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 5 / 78

Base 10 numbers: Expanded form of the number 1563

1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

N = (ak × 10k ) + (ak−1 × 10k−1 ) + · · · + (a1 × 101 ) + (a0 × 100 ), (1)

Where the digits ak are chosen from 0, 1, ..., 8, 9.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 5 / 78

Base 2 numbers: Expanded form of the number 1563

1563 =(1 × 210 ) + (1 × 29 ) + (0 × 28 ) + (0 × 27 ) + (0 × 26 ) + (0 × 25 )+

1563 = 1024 + 512 + 16 + 8 + 2 + 1.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 6 / 78

Let N denote a positive integer; the digits b0 , b1 , ..., bJ exist so that N

Where each digit bj is either a 0 or 1. Thus N is expressed in binary

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 7 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 8 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 9 / 78

Most Significant Bit -MSB Least Significant Bit - LSB

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78

Then, 69710 = 10101110012

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 12 / 78

Commonly, when you express a rational number in decimal form, you

S = (3 × 10−1 ) + (3 × 10−2 ) + · · · + (3 × 10−∞ )

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 13 / 78

where c 6= 0 and r 6= 0, is called a geometric series with ratio r.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78

where c 6= 0 and r 6= 0, is called a geometric series with ratio r.

Theorem 1. (Geometric Series)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78

Example: The series S is given by

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Example: The series S is given by

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Example: The series S is given by

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 16 / 78

A binary fraction is a serie of sums with negative powers of 2, which is

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 17 / 78

A binary fraction is a serie of sums with negative powers of 2, which is

where dj ∈ 0, 1 and 0 < R < 1.

Binary fraction Representation of R

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 17 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 19 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 20 / 78

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·