Vous êtes sur la page 1sur 141

Numerical Methods Preliminaries

Professor PhD Henry Arguello Fuentes

Universidad Industrial de Santander


Colombia

March 9, 2017

High Dimensional Signal Processing Group


www.hdspgroup.com
henarfu@uis.edu.co
LP 304

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 1 / 78


Outline

1 Introduction

2 Binary numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 2 / 78


Introduction: numerical methods applications

(a) Model the probable evolution of (b) Model and simulate the growth
a pathology of a tumor

(c) Microscopy super-resolution (d) Thermal management

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 3 / 78


Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 4 / 78


Base 2 numbers

Base 10 numbers: Expanded form of the number 1563

1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 5 / 78


Base 2 numbers

Base 10 numbers: Expanded form of the number 1563

1563 = (1 × 103 ) + (5 × 102 ) + (6 × 101 ) + (3 × 100 ).

Let N denote a positive integer; then the digits a0 , a1 , ..., ak exist so that
N has the base 10 expansion

Base 10 expansion

N = (ak × 10k ) + (ak−1 × 10k−1 ) + · · · + (a1 × 101 ) + (a0 × 100 ), (1)

Where the digits ak are chosen from 0, 1, ..., 8, 9.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 5 / 78


Base 2 numbers

Base 2 numbers: Expanded form of the number 1563

1563 =(1 × 210 ) + (1 × 29 ) + (0 × 28 ) + (0 × 27 ) + (0 × 26 ) + (0 × 25 )+


(1 × 24 ) + (1 × 23 ) + (0 × 22 ) + (1 × 21 ) + (1 × 20 ).

So that:

1563 = 1024 + 512 + 16 + 8 + 2 + 1.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 6 / 78


Base 2 numbers

Let N denote a positive integer; the digits b0 , b1 , ..., bJ exist so that N


has the base 2 expansion

Base 2 expansion
N = (bJ × 2J ) + (bJ−1 × 2J−1 ) + · · · + (b1 × 21 ) + (b0 × 20 ), (2)

Where each digit bj is either a 0 or 1. Thus N is expressed in binary


notation as
N = bJ bJ−1 · · · b2 b1 b0two . (3)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 7 / 78


Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 8 / 78


Base 2 representation of the integer N
Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 9 / 78


Base 2 representation of the integer N
Process: Generate sequences Qk and Rk of quotients and remainders,
respectively. End the process when Qk = 0, for some integer k = J.
Example:
𝒌 1563 𝑸𝒌 𝑹𝒌
0 1563/2= 781 1
1 781/2= 390 1
2 390/2= 195 0
3 195/2= 97 1
4 97/2= 48 1
5 48/2= 24 0
6 24/2= 12 0
7 12/2= 6 0
8 6/2= 3 0
9 3/2= 1 1
10 1/2= 0 1

1 1 0 0 0 0 1 1 0 1 1
b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0

Most Significant Bit -MSB Least Significant Bit - LSB


Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 9 / 78
Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78


Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697
Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78


Base 2 representation of the integer N
Exercise 1: Find the base 2 representation of 697
Start by dividing the integer N from 2 to calculate Q0 and R0 .
697/2 = 348.5 → Q0 = 348 and R0 = 1
Continue the process until finding Qk = 0, for some integer k = J.
Qk = Qk−1 /2

𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1
2
3
4
5
6
7
8
9

b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 10 / 78


Base 2 representation of the integer N
Solution

𝒌 𝟔𝟗𝟕 𝑸𝒌 𝑹𝒌
0 697/2= 348 1
1 348/2= 174 0
2 174/2= 87 0
3 87/2= 43 1
4 43/2= 21 1
5 21/2= 10 1
6 10/2= 5 0
7 5/2= 2 1
8 2/2= 1 0
9 1/2= 0 1

1 0 1 0 1 1 1 0 0 1
b9 b8 b7 b6 b5 b4 b3 b2 b 1 b0

Then, 69710 = 10101110012


Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 11 / 78
Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 12 / 78


Sequences and Series

Commonly, when you express a rational number in decimal form, you


require infinitely many digits.

1
For example, in = 0.3 , the symbol 3 means that the digit 3 is repeated
3
forever to form an infinite repeating decimal.

1
But, the number is the shorthand notation for the infinite series S
3

S = (3 × 10−1 ) + (3 × 10−2 ) + · · · + (3 × 10−∞ )



X 1
S= 3(10)−k = .
3
k=1

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 13 / 78


Sequences and Series

Definition 1.
The infinite series S

X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0

where c 6= 0 and r 6= 0, is called a geometric series with ratio r.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78


Sequences and Series

Definition 1.
The infinite series S

X
S= crn = c + cr + cr2 + · · · + crn + · · · , (4)
n=0

where c 6= 0 and r 6= 0, is called a geometric series with ratio r.

Theorem 1. (Geometric Series)


The geometric series has the following properties:

X c
If |r| < 1, then crn = .
1−r (5)
n=0
If |r| > 1, then the series diverges.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 14 / 78


Sequences and Series

Example: The series S is given by


 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78


Sequences and Series

Example: The series S is given by


 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

∞  n
X 1
which is equal to − 7 + 7 ,
7
n=0

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78


Sequences and Series

Example: The series S is given by


 1  2  ∞ X∞  n
1 1 1 1
S = (7) + (7) + · · · + (7) = 7 ,
7 7 7 7
n=1

∞  n
X 1
which is equal to − 7 + 7 ,
7
n=0

7 7
and acording with (5) S = −7 + = = 1.16,
1 6
1−
7
7
Then, is the shorthand notation for the infinite series S
6

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 15 / 78


Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 16 / 78


Binary Fractions

A binary fraction is a serie of sums with negative powers of 2, which is


used to express a real number R that lies in the range 0 < R < 1.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 17 / 78


Binary Fractions

A binary fraction is a serie of sums with negative powers of 2, which is


used to express a real number R that lies in the range 0 < R < 1.

Binary fractions
R = (d1 × 2−1 ) + (d2 × 2−2 ) + · · · + (dn × 2−n ) + · · · , (6)

where dj ∈ 0, 1 and 0 < R < 1.

Binary fraction Representation of R


P∞ −j
R = 0.d1 d2 · · · dn · · ·two R= j=1 dj (2)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 17 / 78


Binary Fractions-Decimal to binary
Process: Generate sequences dk and Fk multiplying by two.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78


Binary Fractions-Decimal to binary
Process: Generate sequences dk and Fk multiplying by two.
Example:
d1 d2 d 3 d 4 d 5 d6 d7 d8 d9
0. 1 0 1 1 0 0 1 1 0 …
𝑗 0.7 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.7)(2) = 1.4 1 0.4
2 (0.4)(2) = 0.8 0 0.8
3 (0.8)(2) = 1.6 1 0.6
4 (0.6)(2) = 1.2 1 0.2
5 (0.2)(2) = 0.4 0 0.4
6 (0.4)(2) = 0.8 0 0.8
7 (0.8)(2) = 1.6 1 0.6
8 (0.6)(2) = 1.2 1 0.2
9 (0.2)(2) = 0.4 0 0.4
0.7  0.10110 2

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 18 / 78


Binary Fractions-Decimal to binary
Exercise 2: Calculate the binary fraction for 0.6.
Start by multiplying 0.6 by 2, to generate sequences dj and Fj
d1 d2 d3 d4 d5 d6 d7 d8 d9

𝑗 0.6 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.6)(2) = 1.2 1 0.2
2
3
4
5
6
7
8
9

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 19 / 78


Binary Fractions-Decimal to binary

Solution
d1 d2 d3 d4 d5 d6 d7 d8 d9

𝑗 0.6 𝐹𝑗 𝑑𝑗 𝑓𝑟𝑎𝑐
1 (0.6)(2) = 1.2 1 0.2
2 (0.2)(2) = 0.4 0 0.4
3 (0.4)(2) = 0.8 0 0.8
4 (0.8)(2) = 1.6 1 0.6
5 (0.6)(2) = 1.2 1 0.2
6 (0.2)(2) = 0.4 0 0.4
7 (0.4)(2) = 0.8 0 0.8
8 (0.8)(2) = 1.6 1 0.6
9 (0.6)(2) = 1.2 1 0.2

0.6 = 0. 1001

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 20 / 78


Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78


Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·


the expression above is writted as

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78


Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·


the expression above is writted as
X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0

1 2 1
= −1 + = −1 + = .
1 3 3
1−
4

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78


Binary Fractions-Binary to decimal
The base 10 rational number R10 associated to a base 2 binary fraction
R2 can be found using geometric series.
Example:

0.012 =(0 × 2−1 ) + (1 × 2−2 ) + (0 × 2−3 ) + (1 × 2−4 ) · · ·


the expression above is writted as
X∞ ∞
X
= (2−2 )k = −1 + (2−2 )k
k=1 k=0

1 2 1
= −1 + = −1 + = .
1 3 3
1−
4
1
then, is the 10 rational number associated to 0.012
3
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 21 / 78
Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 22 / 78


Binary shifting

Let R be
R = 0.00000110002 . (7)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78


Binary shifting

Let R be
R = 0.00000110002 . (7)

Multiplying both sides of (7) by Multiplying both sides of (7) by


25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78


Binary shifting

Let R be
R = 0.00000110002 . (7)

Multiplying both sides of (7) by Multiplying both sides of (7) by


25 = 32 will shift the binary 210 = 1024 will shift the binary
point 5 places to the right point 10 places to the right
32R = 0.110002 . 1024R = 11000.110002 .

Taking the difference 1024R − 32R = 11000.110002 − 0.110002 ,


we obtain 992R = 110002 ,
given that 110002 = 2410 we find that,
3
992R = 24, Therefore R = .
124 10

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 23 / 78


Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 24 / 78


Scientific Notation

The scientific notation is a standard way to present a real number. It is


obtained by properly shifting the decimal point.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 25 / 78


Scientific Notation

The scientific notation is a standard way to present a real number. It is


obtained by properly shifting the decimal point.
Examples
0.0000747 = 7.47 × 10−5
31.4159265 = 3.14159265 × 10
9, 700, 000.000 = 9.7 × 109
The Avogadro’s constant used in chemistry = 6.02252 × 1023 .
The quantity 1K = 1.024 × 103 used in computer science.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 25 / 78


Contents

1 Introduction

2 Binary numbers
Base 2 numbers
Base 2 representation of the integer N
Sequences and Series
Binary Fractions
Binary shifting
Scientific Notation
Machine Numbers

3 Error Analysis

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 26 / 78


Machine Numbers

Machine Numbers

A mathematical quantity x is stored in a computer as a binary approxi-


mation given by
x ≈ ±q × 2n . (8)

The finite binary number q is the mantissa, where 1/2 ≤ q ≤ 1.


The integer n is the exponent.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 27 / 78


Floating-point format

A real number is stored in a computer as a set of binary numbers


expressing:
The sign
The exponent
The mantissa

𝑆𝑖𝑔𝑛 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡 𝑀𝑎𝑛𝑡𝑖𝑠𝑠𝑎

The sign is always one bit where, S = 0 if, x > 0 and S = 1, if x < 0.
The amount of bits for the exponent and the mantissa depends on
the precision of the machine.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 28 / 78


Floating-point format-IEEE 754 standard

Precision   Total   Sign   Exponent   Man4ssa   Exponent  


bias  
Single   32  bits   1  bit   8  bits   23  bits   127  
Double   64  bits   1  bit   11  bits   52  bits   1023  

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78


Floating-point format-IEEE 754 standard

Precision   Total   Sign   Exponent   Man4ssa   Exponent  


bias  
Single   32  bits   1  bit   8  bits   23  bits   127  
Double   64  bits   1  bit   11  bits   52  bits   1023  

Note: Biasing is done because exponents have to be signed values to


be able to represent both tiny and huge values, but two’s complement.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78


Floating-point format-IEEE 754 standard

Precision   Total   Sign   Exponent   Man4ssa   Exponent  


bias  
Single   32  bits   1  bit   8  bits   23  bits   127  
Double   64  bits   1  bit   11  bits   52  bits   1023  

Note: Biasing is done because exponents have to be signed values to


be able to represent both tiny and huge values, but two’s complement.
Then, the exponent is biased by adjusting its value.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78


Floating-point format-IEEE 754 standard

Precision   Total   Sign   Exponent   Man4ssa   Exponent  


bias  
Single   32  bits   1  bit   8  bits   23  bits   127  
Double   64  bits   1  bit   11  bits   52  bits   1023  

Note: Biasing is done because exponents have to be signed values to


be able to represent both tiny and huge values, but two’s complement.
Then, the exponent is biased by adjusting its value.
The exponent bias is calculated as bias = 2exp−1 −1, where exp indicates
the amount of bits for the exponent.
Example:
if exp = 15 bits, then, bias = 215−1 − 1 = 16383

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 29 / 78


Floating-point format-IEEE 754 standard

Possible cases:

Sign  (S)   Exponent  (E)   Man0ssa  (M)   Value  


0-­‐1   All  0  <  E  <  All  1   M   (-­‐1)S  (1.M)(2E-­‐bias)  
0   E=all  1   M=0   +∞  
1   E=all  1   M=0   -­‐∞  
0-­‐1   E=all  1   M≠0   NaN  
0-­‐1   E=all  0   M=0   0  
0-­‐1   E=all  0   M≠0   (-­‐1)S  (0.M)(21-­‐bias)  

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 30 / 78


Floating-point format
Example: Determine the floating point format to stored the number
59.187510 in a computer with 32 bits of precision.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 31 / 78


Floating-point format
Example: Determine the floating point format to stored the number
59.187510 in a computer with 32 bits of precision.

1. Find the binary representation of the number 59.187510


59.187510

Integer part Decimal part


5910 0.187510
59 2
0.1875×2 = 0.375 0 MSB
LBS 1 29 2
0.375×2 = 0.75 0
1 14 2
2 0.75×2 = 1.5 1
0 7
1 3 2 0.5×2 = 1.0 1 LBS
1 1 2
MSB 1 0

111011.00112

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 31 / 78


Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78


Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

3. Calculate the bias


bias = 28−1 − 1 = 127

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78


Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 1101100112

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78


Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 1101100112

5. Determine the exponent


exp = 5 + bias = 5 + 127 = 13210 = 100001002

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78


Floating-point format
2. Do the proper binary shifting
111011.00112 = 1.1101100112 × 25

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 1101100112

5. Determine the exponent


exp = 5 + bias = 5 + 127 = 13210 = 100001002
S   E   M  
0   10000100   11011001100000000000000  
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 32 / 78
Floating-point format
Example: Determine the floating point format to stored the number
132.2812510 in a computer with 32 bits of precision.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 33 / 78


Floating-point format
Example: Determine the floating point format to stored the number
132.2812510 in a computer with 32 bits of precision.

1. Find the binary representation of the number 132.2812510

132.2812510

Integer part Decimal part

132 2 13210
0.2812510
LBS 0 66 2
0.28125×2 = 0.5625 0 MSB
0 33 2
1 16 2 0.375×2 = 1.125 1
0 8 2
0.125×2 = 0.25 0
0 4 2
0 2 2 0.25×2 = 0.5 0
0 1 2 0.5×2 = 1.0 1 LBS
MSB 1 0
10000100.010012

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 33 / 78


Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78


Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

3. Calculate the bias


bias = 28−1 − 1 = 127

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78


Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 0000100010012

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78


Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 0000100010012

5. Determine the exponent


exp = 7 + bias = 7 + 127 = 13410 = 100001102

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78


Floating-point format
2. Do the proper binary shifting
10000100.010012 = 1.0000100010012 × 27

3. Calculate the bias


bias = 28−1 − 1 = 127

4. Determine the mantissa


Mantissa = 0000100010012

5. Determine the exponent


exp = 7 + bias = 7 + 127 = 13410 = 100001102
S   E   M  
0   10000110   00001000100100000000000  
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 34 / 78
Floating-point format

The real value associated with a given 32 bit binary is calculated as


23
!
X
value = (−1)S 1 + d(23−i) 2−i × 2(E−127)
i=1

Where,
S = The sign
E = Exponent
127 = Bias
dj = Bits of the mantissa

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 35 / 78


Floating-point format
Exercise: Find the real value for the binary data:

S   E   M  
0   01010010   01101000000100100000000  

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78


Floating-point format
Exercise: Find the real value for the binary data:

S   E   M  
0   01010010   01101000000100100000000  

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78


Floating-point format
Exercise: Find the real value for the binary data:

S   E   M  
0   01010010   01101000000100100000000  

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78


Floating-point format
Exercise: Find the real value for the binary data:

S   E   M  
0   01010010   01101000000100100000000  

23
!
X
−i
value = (−1) S
1+ d(23−i) 2 × 2(E−127)
i=1
In this example:
S=0
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 + 2−3 + 2−5 + 2−12 + 2−15 = 1.4065246582
1 4
+26 )−127)
2(E−127) = 2((2 +2 = 282−127 = 2−45
Thus
value = 1.4065246582 × 2−45
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 36 / 78
Floating-point format

Example: Find the real value for the binary data:

S   E   M  
1   10000100   01000000000000000000000  
Ÿ   Ÿ   Ÿ   Ÿ   Ÿ  
31  30   23  22   0  

In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 37 / 78


Floating-point format

Example: Find the real value for the binary data:

S   E   M  
1   10000100   01000000000000000000000  
Ÿ   Ÿ   Ÿ   Ÿ   Ÿ  
31  30   23  22   0  

In this example:
S=1
P23
1 + i=1 d(23−i) 2−i = 1 + 2−2 = 1.25
2(E−127) = 2(132−127) = 25
Thus
value = 1.25 × 25 = −40.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 37 / 78


Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 38 / 78


Absolute and relative error

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 39 / 78


Absolute and relative error

Definition 2.
Suppose that b p is an approximation to p. The absolute error is
Ep = |p − b
p|, and the relative error is Rp = |p − b
p|/|p|, provided that
p 6= 0.

The absolute error is the difference between the true value and
the approximate value.
The relative error expresses the error as a percentage of the true
value.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 39 / 78


Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78


Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012


Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78


Absolute and relative error

Example: Find the absolute and relative error in the following three
cases:

Real |p| x = 3.141592 y = 1, 000, 000 z = 0.000012


Approximation p̂ x = 3.14
b by = 999, 996 bz = 0.000009
Absolute Error Ep Ex = |x − b
x| Ey = |y − by| Ez = |z − bz|
= 0.001592 =4 = 0.000003
Relative Error Rp Rx = Ex /|x| Ry = Ey /|y| Rz = Ez /|z|
= 5.067×10−4 = 0.000004 = 0.25

Observe that as |p| moves away from 1 (greater than or less than) the
relative error Rp is a better indicator than Ep of the accuracy of the ap-
proximation.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 40 / 78


Absolute and relative error

Definition 3.
The number bp is said to approximate p to d significant digits if d is the
largest nonnegative integer for which

|p − p|
b 101−d
< .
|p| 2

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 41 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
101−2
if d = 2: 2.07900 × 10− 3 < 2 = 0.05 Xsatisfies
101−3
if d = 3: 2.07900 × 10− 3 < 2 = 0.005 Xsatisfies

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Example:
Let ŵ be the approximation for w = 2.1645, then

|2.1645 − 2.16|
= 2.07900 × 10− 3
|2.1645|
101−0
if d = 0: 2.07900 × 10− 3 < 2 = 5 Xsatisfies. However, as we need
to find the largest integer d, we need to continue..
101−1
if d = 1: 2.07900 × 10− 3 < 2 = 0.5 Xsatisfies
1−2
if d = 2: 2.07900 × 10− 3 < 102 = 0.05 Xsatisfies
1−3
if d = 3: 2.07900 × 10− 3 < 102 = 0.005 Xsatisfies
1−4
if d = 4: 2.07900 × 10− 3 < 102 = 0.0005 X does not satisfy
Then, ŵ approximate w to 3 significant digits.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 42 / 78


Absolute and relative error

Other examples:

x|/|x| = 0.000507 < 10−2 /2.


x = 3.14, then |x − b
If x = 3.141592 and b
Therefore, bx approximates x to three significant digits.

If y = 1, 000, 000 and by = 999, 996, then


|y − by|/|y| = 0.000004 < 10−5 /2. Therefore, by approximates y to six
significant digits.

If z = 0.000012 and bz = 0.000009, then |z − bz|/|z| = 0.25 < 10−0 /2.


Therefore, bz approximates z to one significant digits.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 43 / 78


Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 44 / 78


Truncation Error

Truncation error refers to errors introduced when a more complicated


mathematical expression is "replaced" with a more elementary formula.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 45 / 78


Truncation Error

Truncation error refers to errors introduced when a more complicated


mathematical expression is "replaced" with a more elementary formula.

For example, the infinite Taylor series

2 x4 x6 x8 x2n
ex = 1 + x 2 + + + + ··· + + ···
2! 3! 4! n!
x4 x6 x8
might be replaced with just the first five terms 1 + x2 + + + .
2! 3! 4!
Then a truncation error appears.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 45 / 78


Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 46 / 78


Truncation Error
R 1/2 2
Example: Given p = 0 ex dx = 0.544987104184. Determine the accu-
2
racy of the approximation obtained by replacing the integrand f (x) = ex
x4 x6 x8
with the truncated Taylor series P8 (x) = 1 + x2 + + + .
2! 3! 4!
R 1/2
Determine 0 P8 (x)dx:
1/2  x=1/2
x4 x6 x8 x3 x5 x7 x9
Z   
2

1+x + + + dx = x + + + +
0 2! 3! 4! 3 5(2!) 7(3!) 9(4!) x=0
1 1 1 1 1
= + + + +
2 24 320 5376 110592
2109491
= = 0.544986720817 = b p
3870720
Since
|p − bp| 101−6
= 7.03442 × 10−7 < = 5 × 106
|p| 2
then, the approximation b
p agrees with the true value to 6 significant digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 46 / 78
Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 47 / 78


Round-off Error

The accuracy of the representation of a real number stored in a


computer is determined by the precision of the mantissa.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 48 / 78


Round-off Error

The accuracy of the representation of a real number stored in a


computer is determined by the precision of the mantissa.

The error occurred due to the mantissa precision is the round-off


error.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 48 / 78


Round-off Error

The accuracy of the representation of a real number stored in a


computer is determined by the precision of the mantissa.

The error occurred due to the mantissa precision is the round-off


error.

The actual number that is stored in the computer may be


chopping or rounding of the last digit.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 48 / 78


Round-off Error

The accuracy of the representation of a real number stored in a


computer is determined by the precision of the mantissa.

The error occurred due to the mantissa precision is the round-off


error.

The actual number that is stored in the computer may be


chopping or rounding of the last digit.

The computer hardware works with a limited number of digits in


machine numbers, errors are introduced and propagated in
successive computations.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 48 / 78


Chopping Off versus Rounding Off

Example:
Consider p expressed in normalized decimal form:

p = ±0.d1 d2 d3 · · · dk dk+1 · · · × 10n ,

where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for j > 1.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 49 / 78


Chopping Off versus Rounding Off

Example:
Consider p expressed in normalized decimal form:

p = ±0.d1 d2 d3 · · · dk dk+1 · · · × 10n ,

where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for j > 1.

If k is the maximum number of decimal digits; then the real number p is


represented by flchop (p), which is given by

flchop (p) = ±0.d1 d2 d3 · · · dk × 10n , (9)

Where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j ≤ k. The number flchop (p) is


called the chopped floating-point representation of p.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 49 / 78


Chopping Off versus Rounding Off

On the other hand, the rounded floating-point representation


flround (p) is given by

flround (p) = ±0.d1 d2 d3 · · · rk × 10n , (10)

where 1 ≤ d1 ≤ 9 and 0 ≤ dj ≤ 9 for 1 < j < k and the last digit, rk , is


obtained by rounding the number dk dk+1 dk+2 · · · to the nearest integer.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 50 / 78


Chopping Off versus Rounding Off

Example:
22
The real number p = = 3.142857142857142857... has the following
7
six-digit representations:

flchop (p) = 0.314285 × 101 ,


flround (p) = 0.314286 × 101 .

For common purposes the chopping and rounding would be written as


3.14285 and 3.14286, respectively.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 51 / 78


Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 52 / 78


Loss of Significance

Consider p = 3.14155926536 ans q = 3.1415957341, which are


nearly equal and both carry 11 decimal digits of precision.

Their difference is formed: p − q = −0, 0000030805. Since the first


six digits of p and q are the same, their difference p − q contains
only five decimal digits of precision.

This phenomenon is called loss of significance.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 53 / 78


Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78


Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78


Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78


Loss of Significance
Example:
Compare the results of calculating f (500) and g(500) using six digits and round-
√ √ x
ing. Where, f (x) = x( x + 1 − x) and g(x) = √ √ .
x+1+ x
For the first function,
√ √ 
f (500) =500 501 − 500
500(22.3830 − 22.3607) = 500(0.0223) = 11.1500

For g(x)
500
g(500) = √ √
501 + 500
500 500
= = 11.1748.
22.3830 + 22.3607 44.7437

The second function, g(x), is algebraically equivalent to f (x), but the answer,
g(500) = 11.1748, involves less error and it is the same as that obtained by
rounding the true 11.174755300747198... to six digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 54 / 78
Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
about x = 0.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78


Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
about x = 0.
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78


Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
about x = 0.
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

For the second function


1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78


Loss of Significance
Example: Compare the results of calculating f (0.01) and P(0.01) using six
digits and rounding, where

ex − 1 − x 1 x x2
f (x) = and P(x) = + +
x2 2 6 24
The function P(x) is the Taylor polynomial of degree n = 2 for f (x) expanded
about x = 0.
For the first function
e0.01 − 1 − 0.01 1.010050 − 1 − 0.01
f (0.01) = = = 0.5.
(0.01)2 0.001

For the second function


1 0.01 0.001
P(0.01) = + + = 0.5 + 0.001667 + 0.000004 = 0.501671.
2 6 24
The answer P(0.01) = 0.501671 contains less error and it is the same as that
obtained rounding the true answer 0.5016708416805... to six digits.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 55 / 78


Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 56 / 78


O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78


O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

Example: Consider f (x) = x2 + 1 and g(x) = x3 .

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78


O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

Example: Consider f (x) = x2 + 1 and g(x) = x3 .

Since x2 ≤ x3 and 1 ≤ x3 for x ≥ 1

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78


O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

Example: Consider f (x) = x2 + 1 and g(x) = x3 .

Since x2 ≤ x3 and 1 ≤ x3 for x ≥ 1

it follows that x2 + 1 ≤ 2x3 for x ≥ 1.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78


O(hn ) Order of Approximation
For functions
Definition 4.
The function f (h) is said to be big Oh of g(h), denoted f (h) = O(g(h)),
if there exist constants C and c such that:

|f (h)| ≤ C|g(h)| whenever h ≥ c. (11)

Example: Consider f (x) = x2 + 1 and g(x) = x3 .

Since x2 ≤ x3 and 1 ≤ x3 for x ≥ 1

it follows that x2 + 1 ≤ 2x3 for x ≥ 1.

Therefore, f (x) = O(g(x)), whenever h ≥ 1.

The big Oh notation provides an useful way of describing the rate of


growth of a function in terms of the well-known elementary function (xn ,
x1/n , ax , loga (x), etc.).
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 57 / 78
O(hn ) Order of Approximation

For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that

|xn | ≤ C|yn | whenever n ≥ N. (12)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 58 / 78


O(hn ) Order of Approximation

For sequences
Definition 5.
Let xn = 1∞ and yn = 1∞ be two sequences. The sequence xn is said
to be of order big Oh of yn , denoted xn = O(yn ), if there exist constants
C and N such that

|xn | ≤ C|yn | whenever n ≥ N. (12)

Example:
n2 − 1 n2 − 1 n2
 
1 1
=O , since ≤ = whenever n ≥ 1.
n3 n n3 n3 n

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 58 / 78


O(hn ) Order of Approximation

Definition 6.
Assume that f (h) is approximated by the function p(h) and there exist a
real constant M > 0 and a positive integer n so that

|f (h) − p(h)|
≤ M for sufficiently small h. (13)
hn
We say that p(h) approximates f (h) with order of approximation O(hn )
and write
f (h) = p(h) + O(hn ) (14)

When relation (13) is rewritten in the form |f (h) − p(h)| ≤ M|hn |, we see
that the notation O(hn ) stands in place of the error bound M|hn |.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 59 / 78


O(hn ) Order of Approximation

Theorem 2. Order of approximation for basic operations


Assume that f (h) = p(h) + O(hn ), g(h) = q(h) + O(hm ), and
r = min(m, n). Then

f (h) + g(h) = p(h) + q(h) + O(hr ), (15)

f (h)g(h) = p(h)q(h) + O(hr ), (16)


and
f (h) p(h)
= + O(hr ) provided that g(h) 6= 0 and q(h) 6= 0. (17)
g(h) q(h)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 60 / 78


O(hn ) Order of Approximation

Theorem 3. (Taylor’s Theorem).


Assume f ∈ Cn+1 [a, b]. If both x0 and x = x0 + h lie in [a, b], then
n
X f (k)(x0 )
f (x0 + h) = hk + O(hn+1 ). (18)
k!
k=0

Additional properties:
(i) O(hp ) + O(hp ) = O(hp ),
(ii) O(hp ) + O(hq ) = O(hr ), where r = min(m, n), and
(iii) O(hp )O(hq ) = O(hs ), where s = p + q.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 61 / 78


O(hn ) Order of Approximation

Example:
Consider the Taylor polynomial expansions

h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!

Determine the order of approximation for their sum and product.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 62 / 78


O(hn ) Order of Approximation

Example:
Consider the Taylor polynomial expansions

h2 h3 h2 h4
eh = 1+h+ + +O(h4 ) and cos(h) = 1 − + + O(h6 ).
2! 3! 2! 4!

Determine the order of approximation for their sum and product.


For the sum we have

h2 h3 h2 h4
eh + cos(h) =1 + h + + + O(h4 ) + 1 − + + O(h6 )
2! 3! 2! 4!
h3 h4
=2+h+ + O(h4 ) + + O(h6 )
3! 4!

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 62 / 78


O(hn ) Order of Approximation

h4
Since O(h4 ) + = O(h4 ) and O(h4 ) + O(h6 ) = O(h4 ), this reduces to
4!

h3
eh + cos(h) = 2 + h + + O(h4 ),
3!
and the order of approximation is O(h4 ).

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 63 / 78


O(hn ) Order of Approximation

The product is treated similarly:


h2 h3 h2 h4
  
eh cos(h) = 1 + h + + + O(h4 ) 1− + + O(h6 )
2! 3! 2! 4!
h2 h3 h2 h4
  
= 1+h+ + 1− + +
2! 3! 2! 4!
h2 h3 h2 h4
   
6
1+h+ + O(h ) + 1 − + O(h4 ) + O(h4 )O(h6 )
2! 3! 2! 4!
h3 5h4 h5 h6 h7
=1 + h − − − + + + O(h6 ) + O(h4 ) + O(h4 )O(h6 ).
3 24 24 48 144

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 64 / 78


O(hn ) Order of Approximation

Since O(h4 )O(h6 ) = O(h10 ) and

−5h4 h5 h6 h7
− + + + O(h6 ) + O(h4 ) + O(h10 )
24 24 48 144
Since O(h0 ) + O(h4 ) + O(h10 ) = O(h4 ), the preceding equation is
simplified to yield

h3
eh cos(h) = 1 + h + + O(h4 ),
3
and the order of approximation is O(h4 ).

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 65 / 78


Order of Convergence of a Sequence

Convergence of a sequence
Definition 7.
Suppose that limn−→∞ xn = x and {rn }∞ n=1 is a sequence with
limn−→∞ rn = 0. We say that {xn }∞
n=1 converges to x with the order
of convergence O(rn ), if there exists a constant K ≥ 0 such that

|xn − x|
≤ K for n sufficiently large. (19)
|rn |

This is indicated by writing xn = x + O(rn ), or xn −→ x with order


of convergence O(rn )

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 66 / 78


Order of Convergence of a Sequence

Definition 7.

Example:
Let xn = cos(n)/n2 and rn = 1/n2 then,

limn−→∞ xn = 0

with a rate of convergence O(1/n2 ). This follows immediately from the


relation
|cos(n)/n2 |
= |cos(n) ≤ 1| for all n.
|1/n2 |

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 67 / 78


Contents

1 Introduction

2 Binary numbers

3 Error Analysis
Absolute and relative error
Truncation Error
Round-off Error
Loss of Significance
Order of Approximation
Propagation of Error

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 68 / 78


Propagation of Error

Addition consider two numbers p and q (the true values) with the
approximate values b p and bq, which contains errors p and q ,
respectively. Starting with p = b
p + p and q = b
q + q , the sum is

p + q = (b
p + p ) + (b
q + q ) = (b
p+b
q) + (p + q ). (20)
Hence, for addition, the error in the sum is the sum of the errors in
the addends.

s = p + q .

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 69 / 78


Propagation of Error

The propagation of error in multiplication is more complicated. The


product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78


Propagation of Error

The propagation of error in multiplication is more complicated. The


product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get

pq − bq=b
pb pq + b
qp + p q . (22)

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78


Propagation of Error

The propagation of error in multiplication is more complicated. The


product is

pq = (b
p + p )(b
q + q ) = bq+b
pb pp + b
qp + p q . (21)
Hence, if bp and bq are larger than 1 in absolute value, the terms bpq and
qp show that there is a possibility of magnification of the original errors
b
p and q . Insights are gained if we look at the relative error. Rearrange
the terms in (21) to get

pq − bq=b
pb pq + b
qp + p q . (22)
Suppose that b p 6= 0 and b
q 6= 0; then we can divide (22) by pq to obtain
the relative error in the product pq:
pq − b
pb
q pq + b
b qp + p q pq b
b qp p q
Rpq = = = + + . (23)
pq pq pq pq pq

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 70 / 78


Propagation of Error

Furthermore, suppose that b p and b q are good approximations for b


p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship

pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 71 / 78


Propagation of Error

Furthermore, suppose that b p and b q are good approximations for b


p and
p/p ≈ 1, b
q; then b
b q/q ≈ 1, and Rp Rq = (p /p)(q /q) ≈ 0 (Rp and Rq are
the relative errors in the approximations b p and b
q). Then making these
substitutions yields the simplified relationship

pq − b
pb
q
Rpq = ≈ q /q + p /p + 0 = Rq + Rp . (24)
pq
This shows that the relative error in the product pq is approximately the
sum of the relative errors in the approximations p b and qb.

A quality that is desirable for any numerical process is that a small error
in the initial conditions will produce small changes in the final result.
An algorithm with this feature is called stable; otherwise, it is called
unstable.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 71 / 78


Propagation of Error

Definition 8.
Suppose that  represents an initial error and (n) represents the growth
of the error after n steps. If |(n)| ≈ n, the growth of error is said to be
linear. If |(n)| ≈ K n , the growth of error is called exponential. If
K > 1, the exponential error growns without bound as n −→ ∞, and if
0 < K < 1, the exponential error diminishes to zero as n −→ ∞.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 72 / 78


Propagation of error

Example: Show that the following three schemes can be used with finite-
precision arithmetic to recursively generate the terms in the sequence {1/3n }∞
n=0 .

1
r0 = 1 and rn = rn−1 for n = 1, 2, · · · , (25)
3

1 4 1
p0 = 1, p1 = , and pn = pn−1 − pn−2 for n = 1, 2, · · · , (26)
3 3 3
1 10
q0 = 1, q1 = , and qn = qn−1 − qn−2 for n = 1, 2, · · · , (27)
3 3

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 73 / 78


Propagation of error

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:
   
4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2
   
4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired.

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 74 / 78


Propagation of error

Formula (25) is obvious. In (26) the difference equation has the general solu-
tion pn = A(1/3n ) + B. This can be verified by direct substitution:
   
4 1 4 A 1 A
pn−1 − pn−2 = + B − + B
3 3 3 3n−1 3 3n−2
   
4 3 4 1 1
= − A − − B = A n + B = pn
3n 3n 3 3 3
Setting A = 1 and B = 0 will generate the sequence desired. In (27) the
difference equation has the general solution qn = A(1/3n ) + B3n . This too
verified by substitution:
   
10 10 A n−1 A n−2
qn−1 − qn−2 = + B3 − + B3
3 3 3n−1 3n−2
 
10 9 1
= n
− n A − (10 − 1)3n−1 B = A n + B3n = qn
3 3 3

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 74 / 78


Propagation of error
Example:
Generate approximations to the sequences {xn } = 1/3n using hemes

1
r0 = 0.99996 and rn = rn−1 for n = 1, 2, · · · , (28)
3

4 1
p0 = 1, p1 = 0.33332, and pn = pn−1 − pn−2 for n = 1, 2, · · · ,
3 3
(29)

10
q0 = 1, q1 = 0.33332, and qn = pn−1 − pn−2 for n = 1, 2, · · · ,
3
(30)
In (28) the initial error in r0 is 0.00004, and in (29) and (30) the initial
errors in p1 and q1 are 0.000013. Investigate the propagation of error for
each scheme.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 75 / 78
Propagation of error

Table: Sequence xn = 1/3n and the approximations rn , pn , and qn

n xn rn pn qn
0 1.0000000000 0.9999600000 1.0000000000 1.0000000000
1 0.3333333333 0.3333200000 0.3333200000 0.3333200000
2 0.1111111111 0.1111066667 0.1110933333 0.1110666667
3 0.0370370370 0.0370355556 0.0370177778 0.0369022222
4 0.0123456790 0.0123451852 0.0123259259 0.0119407407
5 0.0041152263 0.0041150617 0.0040953086 0.0029002469
6 0.0013717421 0.0013716872 0.0013517695 -0.0022732510
7 0.0004572474 0.0004572291 0.0004372565 -0.0104777503
8 0.0001524158 0.0001524097 0.0001324188 -0.0326525834
9 0.0000508053 0.0000508032 0.0000308063 -0.0983641945
10 0.0000169351 0.0000169344 -0.0000030646 -0.2952280648

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 76 / 78


Propagation of error

Table: Error sequences xn − rn , xn − pn , and xn − qn

n xn − rn xn − pn xn − qn
0 0.0000400000 0.0000000000 0.0000000000
1 0.0000133333 0.0000133333 0.0000133333
2 0.0000044444 0.0000177778 0.0000444444
3 0.0000014815 0.0000192593 0.0001348148
4 0.0000004938 0.0000197531 0.0004049383
5 0.0000001646 0.0000199177 0.0012149794
6 0.0000000549 0.0000199726 0.0036449931
7 0.0000000183 0.0000199909 0.0109349977
8 0.0000000061 0.0000199970 0.0328049992
9 0.0000000020 0.0000199990 0.0984149997
10 0.0000000007 0.0000199997 0.2952449999

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 77 / 78


Propagation of error
−5 −5
x 10 x 10
6 2

1.5
4

xn−pn
xn−rn

1
2
0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4

0.3
xn−qn

0.2

0.1

0
0 2 4 6 8 10
n

Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 78 / 78


Propagation of error
−5 −5
x 10 x 10
6 2

1.5
4

xn−pn
xn−rn

1
2
0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
n n
0.4

0.3
xn−qn

0.2

0.1

0
0 2 4 6 8 10
n

The error for {rn } is stable and decreases in an exponential manner. The error
{pn } is stable. The errror for {qn } is unstable and grows at an exponential rate.
Although the error for {pn } is stable, the terms pn −→ 0 as n −→ ∞, so that the
error eventually dominates and teh terms past p8 have no significant digits.
Professor PhD Henry Arguello Fuentes Numerical methods March 9, 2017 78 / 78

Vous aimerez peut-être aussi