Vous êtes sur la page 1sur 17

SALES AND MARKETING Department

MATHEMATICS

2nd Semester

________ Bivariate statistics ________

SOLUTIONS of tutorials and exercises

Online document: http://jff-dut-tc.weebly.com section DUT Maths S2

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020
Exercise 1. (Tutorial for lesson page 5)
Are people’s behaviour in relation to tobacco and people’s gender related, with a 10% significant level?
Here are the results of a survey made on a sample of 51 men and 66 women:
G : variable "gender" B : variable "behaviour in relation to tobacco"
Gm : men Bn : never smoked
Gw : women Bs : smoke
Bss : stopped smoking

observed theoretical frequencies


frequencies: according to H0: Detailed Chi-squares and total:
Gm Gw Gm Gw Gm Gw
Bn 12 23 35 Bn 15.26 19.74 35 Bn 0.69507 0.53710
Bs 31 26 57 Bs 24.85 32.15 57 Bs 1.52417 1.17777
Bss 8 17 25 Bss 10.90 14.10 25 Bss 0.77038 0.59529
51 66 117 51 66 117 5.300

1) Place the subtotals and the general total in the first table, and in the second one, identically.
2) Fill the second table (6 central theoretical values) following proportional calculations.
3) Table #3: calculate the six Chi-square, then add them to get the value χ²calc.

4) Test writing:
Null hypothesis: H0 : Gender and tobacco behaviour are independent
Observed χ²
Value of the variable χ² between the observed and the theoretical samples: χ²calc = 5.3
Rejection area
Significance level: α = 10 %
Number of dof: (r-1)(k-1) = (3 – 1)(2 - 1) = 2
Value of the variable χ² limit until rejection : χ²lim = 4.61
Comparison and decision:
As χ²calc > χ²lim , H0 can be rejected, at a 10% significance level.

In other words, we can say with less than 10% risk of being wrong, that men and women behave
differently with tobacco. However, we could not reject our null hypothesis at a 5% significance level:
χ²lim is 5.99 in such conditions, and so isn’t reached by χ²calc , thus showing us that claiming dependence
is done with more than 5% risk of being wrong.

Exercise 2.
Two candidates A and B compete for a presidential election. In a little town, there are 500 voters. 100 are
retired people, 50 are unemployed and 350 are employees. There, the vote results are:
candidates blank/
A B
voters abstention
unemployed 24 16 10
employees 122 148 80
retired 36 27 37
1) Decide, with a 1% significance level, whether people’s opinion depends on their social group or not.
* H0: "The type of vote is independent of the social group"
* Let’s perform the necessary calculations in order to get χ²calc:
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 1 / 16
observations in theory (indep.) Chi-square
24 16 10 50 18.2 19.1 12.7 50 1.848 0.503 0.574
122 148 80 350 127.4 133.7 88.9 350 0.229 1.529 0.891
36 27 37 100 36.4 38.2 25.4 100 0.004 3.284 5.298
182 191 127 500 182 191 127 500 Chi²calc = 14.16

* Rejection area: with α = 1 %, and with 4 degrees of freedom : Chi²lim = 13.28


* Decision: as Chi²calc > Chi²lim, we can reject H0 (so: claim that People’s opinion depends on their social group)
with a 1% chance of being wrong.

2. What can we say if we do not include blank votes and abstentions?


Let’s take back the analysis, excluding blank votes and abstentions:
* observations in theory (indep.) Chi-square
24 16 40 19.52 20.48 40 1.03 0.981
122 148 270 131.7 138.3 270 0.72 0.687
36 27 63 30.74 32.26 63 0.9 0.858
182 191 373 182 191 373 Chi²calc = 5.175

* with 2 dof : Chi²lim = 5.991 with α = 5 % and Chi²lim = 4.605 with α = 10 %.


We can assess that people’s opinion depends on their social group, with 10 % chances of being wrong, but we
couldn’t assess it if we wanted to take only 5 % chances of being wrong.

Exercise 3.
The table shows attendance in two stores A and B: how many people store
made at least one purchase. These clients are sorted by age group (10 to age A B
15 years old, and so on). 10 - 15 46 24
15 - 20 29 35
1. Say, with a 5% significance level, whether the chosen store depends on 20 - 40 14 17
the age of a client. > 40 12 18

* store store store


obs A B th A B χ² A B
10 to 15 46 24 70 10 to 15 36.26 33.74 70 10 to 15 2.6185 2.8135 5.4320
15 to 20 29 35 64 15 to 20 33.15 30.85 64 15 to 20 0.5192 0.5579 1.0771
20 to 40 14 17 31 20 to 40 16.06 14.94 31 20 to 40 0.2634 0.2830 0.5464
40 + 12 18 30 40 + 15.54 14.46 30 40 + 0.8058 0.8658 1.6716
101 94 195 101 94 195 4.2069 4.5202 8.727
* with 3 dof and a 5% level, the table gives χ²lim = 7.815.
* Thus, this limit value has been exceeded. With a 5 % significance level, we can reject the hypothesis that
the choice of the store and the age group are independent.

2) What age group mostly contributes to the previous result? Explain.


The age group « 10 to 15 year old » mostly contributes to the total χ². It could be easily stated that people
that are over 15 year old show quite the same purchasing behaviour. On the contrary, the first age group
shows a very different frequency distribution (first table, in blue), compared to other customers.

3) Give the meaning of the “5% significance level” on your first answer.
We assume the dependence between age and chosen store with a 5 % chance to be wrong.

4) According to your Chi² table, can you be more accurate about the chance taken in this statement (your first
answer)?
If we wanted to reach a 2% level, χ²calc would have been more than 9.837, but our value isn’t. So, the χ²
table (form) doesn’t allow us to say more than “the risk is between 2% and 5%”.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 2 / 16
Exercise 4.
In a survey, 100 people were asked about their age and their attendance at theatres (cinema). We name X the
variable "age" and Y the variable "number of annual cinema shows". The survey result is the following table of
quotes (fr.: citations) :
Y X [15 ; 25[ [25 ; 50[ ≥ 50
none 4 6 13
1 to 11 10 16 15
12 to 23 13 8 4
≥ 24 6 3 2

1) By a χ² independence test, with a 2% significance level, decide whether there’s a link or not between the
age and the level of attendance at the cinema.
Y X [15 ; 25[ [25 ; 50[ 50 and more total
obs th χ² obs th χ² obs th χ² obs th χ²
none 4 7.59 1.698 6 7.59 0.333 13 7.82 3.431 23 23 5.462
1 to 11 10 13.53 0.921 16 13.53 0.451 15 13.94 0.081 41 41 1.453
12 to 23 13 8.25 2.735 8 8.25 0.008 4 8.5 2.382 25 25 5.125
≥ 24 6 3.63 1.547 3 3.63 0.109 2 3.74 0.81 11 11 2.466
total 33 33 6.901 33 33 0.901 34 34 6.704 100 100 14.51
With 6 dof and α = 2%, the χ² table gives Chi²lim = 15.03.
Our Chi²calc (14.51) doesn’t exceed it. So, at a 2% significance level, we can’t reject the idea that age and
level of attendance at the cinema are independent.
2) Using your form table, discuss the level of confidence you can assign to the assertion : “they are
dependent”.
Our Chi²calc (14.51) is located between both Chi²lim of levels 2% and 5%. Thus, we can assume dependence
with more than 95% confidence, but with less than 98% confidence.
3) Identify the most important partial Chi-2s and give the meaning of these high values.
The biggest partial Chi² has been obtained with the “50 year old and more” whose attendance is zero: the
observed frequency (13) is much higher than the expected one (7.82).
The partial Chi² of the “50 year old and more” whose attendance is “between 12 and 23 times a year” is big
too: the observed frequency is much lower than the theoretical one
The partial Chi² of the “15 to 25 year old” whose attendance is “between 12 and 23 times a year” is big too:
the observed frequency is much higher than the theoretical one.

Exercise 5.
Using the data series introduced in the exercice 11, decide, by the mean of a Chi-square test, whether both
variables are independent or not.

Y [0 ; 15[ [15 ; 25[ [25 ; 40[ total


X obs th χ² obs th χ² obs th χ² obs th χ²
1 23 60,06 22,87 92 84,63 0,642 80 50,31 17,52 195 195 41,03
2 77 59,75 4,979 84 84,2 5E-04 33 50,05 5,809 194 194 10,79
3 42 27,72 7,356 35 39,06 0,422 13 23,22 4,498 90 90 12,28
4 12 6,468 4,731 6 9,114 1,064 3 5,418 1,079 21 21 6,875
total 154 154 39,93 217 217 2,128 129 129 28,91 500 500 70,97
With 6 dof and α = 1%, the χ² table gives Chi²lim = 16.8.
Our Chi²calc (70.97) is much bigger. There are more 99% chances of dependence between both variables.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 3 / 16
Exercise 6. (Tutorial for lesson page 6)
Let’s have a close look of a company’s turnover evolution through time.
Year N N+1 N+2 N+3
tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4
(M€) 28 45 49 36 30 44 48 40 28 46 52 37 31 42 54 39
Though there are big seasonal variations, due to its particular activity, is it possible to find out a global
trend on several years?

× × × × × × × × × × × × ×

Let’s decide to calculate and display the 4 by 4 moving means:


(do it as a group job: divide the set of calculations with your neighbours and share your results)
1-4 2-5 3-6 … 13-16
X 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
Y 39.5 40 39.75 39.5 40.5 40 40.5 41.5 39.75 41.5 40.5 41 41.5
calculations:
The values of X (on the graph) correspond to the quantity of trimesters since the beginning:
1st trimester year N → x = 1 ; 2nd trimester year N → x = 2 ; and so on. We deduce that the values of X to be
entered in the table are 2.5, 3.5, 4.5, and so on: 1st value = mean of 1,2,3,4 = 2.5 ; 2nd value = mean of
2,3,4,5 = 3.5 ; and so on until the 13th value, which is the mean of 13,14,15,16, that equals 14.5.
The values of Y calculated in the table above are the average turnovers of the five considered trimesters.
1st value of Y = mean of 28,45,49,36 = 39.5 ; 2nd value of Y = mean of 45,49,36,30 = 40 ; and so on.

Exercise 7. (Tutorial for lesson page 7)


Let’s take back one of the examples introduced page 3 (lessons doc): effect of the amount of fertilizer on the
harvested production.
fertilizer harvest
-1
plot # X (kg.ha ) Y (q.ha-1)
1 150 46
2 80 37
3 120 46
4 220 51
5 100 43

1) For each half-cloud, determine the mean points coordinates.


Half-clouds have to be defined: since there are 5 pairs of results, let’s choose a cut in 3 points on the left and 2
points on the right (the contrary would have been allowed too), separating them by the X values (always):

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 4 / 16
1st half-cloud: (80, 37), (100, 43), (120, 46); mean point: G1(100, 42)
2nd half-cloud: (150, 46), (220, 51); mean point: G2(185, 48.5)
2) Determine the expression of the Mayer’s line (G1G2).
48.5 − 42 6.5
slope: a = = ≈ 0.07647
185 − 100 85
y = 0.07647 x + b can be written with the coordinates of G1 (for instance): 100 = 0.07647×42 + b,
which gives us b = 34.35.
Expression of the Mayer’s line: y’ = 0.07647 x + 34.35
3) On a graph, plot the initial table and draw this line.

Exercise 8.
Determine the expression of the Mayer’s line, taking back the case given in exercise 6.
The 16 values are parted in 8 for N and N+1 besides 8 for N+2 and N+3.
1 + 2 + ... + 8 9 + 10 + ... + 16
xG1 = = 4.5 xG = = 12.5 1.125
8 8 2
slope: a = = 0.140625
28 + 45 + ... + 40 28 + 46 + ... + 39 8
yG1 = = 40 yG = = 41.125
8 8 2

y’ = 0.140625 x + b can be written with the coordinates of G1 (for instance): 40 = 0.140625×4.5 + b,


which gives us b = 39.367.
Expression of the Mayer’s line: y’ = 0.140625 x + 39.367

Exercise 9. (Tutorial for lesson page 8)


Calculate or display on your calculator: the means and standard deviations; the covariance.
1) Taking the data of exercise 7 (fertilizer/harvest)
x = 134 kg.ha-1 and y = 44.6 q.ha-1 ; σ ( X ) = 48.826 kg.ha-1 and σ (Y ) = 4.5869 q.ha-1 (Stat mode).
n

∑x y i i
30900
Cov ( X , Y ) = i =1
−xy= − 134 × 44.6 = 203.6
n 5
2) Taking the data of exercise 4 (age/# of cinema shows) – choose 60 as average age for the class 50 and more;
choose 36 as average number of shows for the class 24 and more.
x = 39.375 yo and y = 10.795 shows ; σ ( X ) = 16.422 years and σ (Y ) = 10.833 shows (Stat mode).
n

∑x y i i
36890
Cov ( X , Y ) = i =1
−xy= − 39.375 × 10.795 = −56.15
n 100

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 5 / 16
Exercise 10. (Tutorial for lesson page 9)
Let’s consider the following time series: a company’s annual expenses in advertising.
X : year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Y : expense (k€) 41 60 55 66 87 61 90 95 82 120 125 118
The corresponding scatter plot is represented:

year 1: 2007
Determine the expression of the Y on X fitting line, following the least square method; then, draw it.
(D) : y’ = 7.0629 x + 37.42

Exercise 11.
500 people, having passed their driving license exam, are sorted in the table below.
They are distributed with respect to the number X of times they took the exam before passing it and to the
number Y of hours of driving lessons before their first attempt.
Y
[0 ; 15[ [15 ; 25[ [25 ; 40[
1 23 92 80
2 77 84 33
X
3 42 35 13
4 12 6 3
1) Define a margin frequency. Then, give an example from the table.
A margin frequency is the total number of individuals associated to a value of one of the variables.
e.g.: 195 (margin frequency) people passed their exam following their first attempt (value: X = 1).
2) Describe, shortly, the way to enter the data set in your calculator.
We use to enter the frequencies in List3, so 12 values here; List1 and List2 will be used for entering the
corresponding X and Y values.
3) Calculate the covariance of the pair (X, Y) and give a concrete comment about this value.
16815
Cov ( X , Y ) = − 1.874 × 19.375 = −2.679 , non-positive. Globally, the more hours of driving lessons one
500
takes, the less attempts one needs to pass the exam.
4) Among those who took between 15 and 25 hours of driving lessons, what is the rate of those who passed
their exam on the third attempt? 35/217 = 16.13 %
5) Among those who passed their exam on the third attempt, what is the rate of those who took between 15
and 25 hours of driving lessons? 35/90 = 38.89 %

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 6 / 16
Exercise 12.
A sales agent wishes to analyse his (or her) activity and efficiency. On
each appointment to a prospect have been noted the length (X, in
minutes) of the presentation of the product, and the sold quantity
(Y). The twelve values inside the table show the number of
appointments that correspond to each pair (X, Y).
1) Give the meaning of the frequency "8" found inside the table.
During each of 8 appointments with prospects, the sales agent made a 10 to 20 min-long presentation and
then sold 2 units.
2) Calculate, manually, the average time spent per appointment.
Margin frequencies of the three values of X: 7, 19 and 21. The corresponding lengths are 5, 15 and 25 (in
minutes). Total number of appointments: 47.
The average time is then (5×7 + 19×15 + 21×25)/47 = 17.98 minutes per appointment (about 18 minutes).
3) Give the covariance of the pair (X, Y).
1595
Cov ( X , Y ) = − 17.9787 × 1.80851 = 1.422
47

Exercise 13.
The following table indicates the sales price (€) of an equipment and the number of sold items, for 4 years.
year rank 1 2 3 4
sales price (€) X 300 210 270 375
# of sold items Y 198 240 222 160
1) Build the scatter plot with an orthogonal frame. The axes intersection must be the point (210, 160);
scales: 1 cm for €15 on the abscissas axis, 1 cm for 10 items on the ordinates axis.

2) Determine the coordinates of G, mean point of the cloud.


G(288.75 ; 205)
3) a. Determine the expression of the Y on X fitting line, following the least square method.
The coefficients will be expressed with 6 significant figures.
y’ = -0.498274 x + 348.876
b. Draw this regression line on the graph.
4) Which year saw the highest turnover? For which amount?
The turnover is X×Y. Its four values are: 59400, 59940, 50400 and 60000. The highest was in year # 4.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 7 / 16
going further:
5) Now, we assume that, each year, the number of sold items y and the sales price x are related this way:
y = – 0.498 x + 349. We denote S(x) the turnover achieved by selling y items, €x each.
a. Express S(x) with respect to x.
S(X) = xy = -0.498 x² + 349 x
b. Find the variations of the function S defined in [210 ; 375].
S’(X) = -0.996 x + 349 > 0 iff x < 350.4. S is decreasing in [210 ; 350.4] and increasing in [350,4 ; 375].
c. Deduce the sales price we would have to set for a fifth year if we want a maximum turnover. How many
items will be sold (round to one unit)? For what turnover?
We have to set the sales price at €350.4. number of sold items: y = – 0.498×350.4 + 349 = 174.5.
Considering 174 items, x = €350.4/unit and turnover = €60969.6;
considering 175 items, x = €350.4/unit and turnover = €61320.

Exercise 14.
A survey wishes to compare people's expense in high tech equipment compared to their sales. Each column of
the table T below represents, in a given French land, the average monthly income of people (X) and the
average monthly expense (Y) in high-tech equipment.
land A B C D E F
income X (€) 1550 1620 1770 1850 1930 2000
expense Y (€) 57 61 66 73 76 82

1) Calculate the covariance and then the linear correlation coefficient of the pair (X, Y).
Give an interpretation of both parameters.
749720
Cov ( X , Y ) = − 1786.66667 × 69.1666667 ≈ 1375.55556 , positive, showing a global upward trend
6
of the expense, as the income increases.
1375.55556
r≈ ≈ 0.9901 , very close to 1, hence an excellent linear correlation between X and Y.
160.2775 × 8.66827
2) a. Give, by the mean of your calculator, the expression of the Y on X regression line.
y’ = 0.05355 x – 26.50
b. Obtain the expression of the Mayer's line of the series, from the table T.
Let's part the table into two groups: {A, B, C} and {D, E, F} (indeed, the values of X have already been
sorted in an ascending order). The coordinates of both mean points are:
G1(1646.6667 ; 61.333333) and G2(1926.66667 ; 77)
The Mayer's fitting line, (G1G2), has a typical expression y’ = ax + b, where
yG − yG1
a= 2 ≈ 0.05595 and b = yG1 − a × xG1 ≈ −30.80 ; ( DM ) : y ′ = 0.05595 x − 30.80
xG2 − xG1

c. Both lines slightly differ. Find the income for which they both give the same expense. What makes this
common point special, inside the point cloud?
Let's act as if we didn't already know this common point.
We can seek it by an identification of both expressions: 0.05595 x – 30.80 = 0.05355 x – 26.50.
That gives: 0.0024 x = 4.3 and then x = 1791.67. We can deduce the value of y: 69.44.
Both lines give an estimated average expense of € 69.44 €, for an average income of € 1791.67.
This common point is in fact the midpoint of the cloud: 1791.67 is the actual average value of X in the
table, and 69.44 is the average value of Y (little differences can be seen, mostly due to the rounded
slopes used four lines above).
This particularity is general, as explained in the lessons of this chapter: a least square fitting line, as well
as a Mayer's fitting line, meets Mayer's criterion, which is equivalent to "the line owns G"!

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 8 / 16
Exercise 15. (Tutorial for lesson page 12)
Data about the fuel consumption of a motorcycle have been
collected. Consumption: Y, in L/100km, speed: X, in km/h):
X 10 20 30 40 50 60 70 80 90
Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6
The scatter plot, on the right, clearly shows us that a linear
regression would be inappropriate to describe the evolution of the
consumption with respect to the speed. Thus, we will propose a
variable change.
1) Let’s define the variable T by: T = (X – 60)².
Complete the following table:

T 2500 1600 900 400 100 0 100 400 900


Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6
2) Perform a linear regression of Y on T.
Cov(T, Y) = 81280/9 – 766.66667×9.111111 = 2045.926 ; r = 2045.926/780.3133/2.62782 = 0.997759
r is very close to 1, a linear fitting is appropriate, between T and Y.
Least square regression line: y’ = 0.00336 t + 6.535
3) Thus, deduce the expression of the regression curve, for the initial scatter plot.
Regression curve of the pair (X, Y) : y’ = 0.00336 (x – 60)² + 6.535

Exercise 16. quadratic fitting


A company took note of its profits Y with respect to X, produced and sold quantity:
X (tons) 2 3 5 7 11
Y (k€) 38 55 72 69 24
T -16 -9 -1 -1 -25
1) Thanks to your calculator, give the linear correlation coefficient between X and Y. Comment.
Cov(X, Y) = 1348/5 – 5.6×51.6 = -19.36 ; r = -19.36/3,2/18.315 = -0.3303
This is far from -1, the linear correlation is very bad between X and Y.
2) Let’s settle the variable T = -(X - 6)².
a. Complete the table.
b. Calculate Cov(T, Y) and then the linear correlation coefficient between both variables.
Cov(T, Y) = -1844/5 - (-10.4)×51.6 = 167.84 ; r = 167.84/9.2/18.315 = 0.9961
c. Is a linear fitting of Y on T appropriate?
r is very close to 1, a linear fitting is appropriate, between T and Y.
d. Determine the expression of the Y on T fitting line, following the least square method.
y’ = 1.983 t + 72.22
e. Deduce an expression of the regression of Y on X.
y’ = -1.983(x - 6)² + 72.22

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 9 / 16
Exercise 17. quadratic fitting
A market study was conducted on a new type of product. The table below gives, for several proposed sales
price, the number of people willing to pay that price.

unit price (€) X 2 3 4 5 6 7


number of people Y 66 47 34 25 18 14
unit p. nb X(X-20) nb sales
X Y T Y’ CA CA’
2 66 -36 62.97 132 125.9
3 47 -51 48.88 141 146.6
4 34 -64 36.66 136 146.7
5 25 -75 26.33 125 131.7
6 18 -84 17.88 108 107.3
7 14 -91 11.3 98 79.13
1) Calculate the covariance of the variables X and Y, then comment its sign.
740
Cov ( X , Y ) = − 4.5 × 34 = −29.67 , non-positive: Y values tend to improve as X decreases.
6
2) We set T = X(X - 20)
a. Calculate le the linear correlation coefficient between both variables T and Y.
−11610 337.33
Cov (T , Y ) = − ( −66.8333 ) × 34 = 337.33 . r= = 0.992487
6 18.95096 × 17.93507
b. Comment its value.
This coefficient (0.992487) is an excellent one.
c. Determine the expression of the Y on T fitting line, following the least square method.
y’ = 0.9393 t + 96.78
d. Deduce an expanded expression of the regression of Y with respect to X.
y’ = 0.9393 (x² - 20x) + 96.78 = 0.9393 x² - 18.79 x + 96.78
3) Here we examine the expected turnover (unit selling price × number of sales), if the numbers of citations
obtained in the survey are considered to be the numbers of units sold.
a. Calculate the turnovers that can be extracted from the initial table.
See above: grey table (turnover = CA = XY)
b. Calculate, for the same values of X, the turnovers CA' that can be got thanks to the formula obtained in
question 2)d.
See above: grey table (turnover = CA’ = XY’)
c. What unit selling price should we fix, so that the best turnover would be reached?
According to the model, it seems that CA’ would be maximum when X is between €3 and €4.
Le’s be a little more accurate:
X 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
CA' 146.6 147.4 148.1 148.5 148.8 148.8 148.7 148.4 148 147.4 146.6
We will recommend a selling price at about € 3.5 for an optimized turnover.

Exercise 18. inverse fitting


A perfumery, on analysing its turnover, connects the sales quantities (Y) to various perfume brands and
models prices (X). The results are gathered in the following table:
X, bottle’s price (€) 15 25 30 40 45 60 75 90
Y, # of sold bottles 202 117 107 82 78 60 55 48
Answer the questions beginning with "calculate" by using your calculator’s results.
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 10 / 16
calculator’s results:

1) a. Calculate the covariance of X and Y; comment its sign.


28000
Cov ( X , Y ) = − 47.5 × 93.625 = −947.2 , non-positive. Y is globally a decreasing function of X.
8
b. Calculate the linear correlation coefficient of X and Y; comment its value.
−947.2
rXY = = −0.8357 , not very close to 1. The linear correlation between X and Y is not
24.109 × 46.843
excellent (the point cloud may be noisy or following a curve).
850
2) In order to have a more precise idea of how X and Y are related, we set the variable change: T =
X
a. After having calculated the list of values of T, in a third list (calculator), justify that the linear correlation
is excellent between T and Y.
The values of T have been show above. The calculations, relatively to the pair (T, Y), lead to r = 0.9971,
very close to 1. Their linear correlation is excellent.
b. Give the expression of the Y on T regression line, according to the least square method.
y’ = 3.215 t + 15.62
c. What is the least square criterion?
The sum of the squared residues must be minimum (which makes the fitting line unique).
d. Deduce from question 2)b a modelled expression of Y with respect to X.
850 a 2733
y ′ = at + b = +b = + 15.62
x x
e. According to this model, how many bottles whose cost is €150 would the perfumery expect to sell?
2733
If x = 150, the estimate of y is: + 15.62 ≈ 33.84 ≈ 34 : it can expect to sell 34 bottles.
150

Exercise 19. (Tutorial for lesson page 13)


Calculate the point estimates, in the given situations.
1) Taking back exercise 10, give an estimate of the expense in 2020.
y’ = 7.0629 x + 37.42 ; x0 = 14 ; hence y’0 = k€ 136.3
2) Taking back exercise 7, give an estimate of the quantity of fertilizer that would offer a harvest of 60 q/ha.
y’ = 0.07647 x + 34.35 ; y’0 = 60 q/ha ; hence x’0 = 335.4 kg/ha
3) Taking back exercise 15, give an estimate of the fuel consumption when the speed is 100 km/h.
y’ = 0.00336 (x – 60)² + 6.535 ; x0 = 100 ; hence y’0 = 11,91 L/100km

Exercise 20. (Tutorial for lesson page 13)


Let’s take back exercise 10. We want to estimate the expense, for the year 2020, by a 95% confidence interval.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 11 / 16
1) a. Get the values of Y’, from the values of X and the expression of the fitting
line;
b. Get the values of Z, by dividing Y by Y’;
c. Then, give the mean and standard deviation of Z.
z = 1.000971 ; σ Z = 0.125286
2) Give the point estimate of the expense in 2020.
see exercise 18-1: y’0 = k€ 136.3
3) Give the coefficient u corresponding to the confidence level.
u = 1.96
4) Then, give the confidence interval.
[129.2(1.000971 – 1.96×0.125286) ; 129.2(1.000971 + 1.96 × 0.125286)] =
[97.6 ; 161]

Exercise 21. (Tutorial for lesson page 13)


With exercise 7, estimate the harvest by a 99% confidence interval, due to 300 kg/ha of fertilizer.

1) a. Get the values of Y’, from the values of X and the expression of the fitting line;
b. Get the values of Z, by dividing Y by Y’;
c. Then, give the mean and standard deviation of Z.
z = 0.9991106 ; σ Z = 0.0472554
2) Give a point estimate of the harvest.
y’ = 0.07647 x + 34.35 ; x0 = 300 kg/ha ; hence y’0 = 57.29 q/ha
3) Give the coefficient u corresponding to the confidence level.
u = 2.58
4) Then, give the confidence interval.
[57.29(0.9991 – 2.58×0.047255) ; 57.29(0.9991 + 2.58×0.047255)] = [50.25 ; 64.22]

Exercise 22. (Tutorial for lesson page 13)


On each person in a sample, a survey noted the age class (X) and the visual acuity (Y, 1/10 = 0.1):
X
[5 ; 35[ [35 ; 45[ [45 ; 55[ [55 ; 65[
0.3 1 5 10 20
Y 0.6 8 12 25 18
0.9 55 30 14 6
Estimate the visual acuity of a 80 year-old person, by a 99% confidence interval.

Variable Y, variable Z, results on Z:


z = 0.999266 ; σ Z = 0.298378

Point estimate:
y’ = -0.008422 x + 1.038 ; x0 = 80 ; hence y’0 = 0.3642

Coefficient u: u = 2.58

Confidence interval:
[0.3642(0.999266 – 2.58×0.298378) ;
0.3642(0.999266 + 2.58×0.298378)]
= [0.08358 ; 0.6444]

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 12 / 16
Exercise 23.
In a country, two variables are compared: the consumer force index and the turnover of its car industry:
consumer force (index) X 3.26 3.85 3.44 3.08 3.6
car industry turnover (G€) Y 9.3 9.56 9.36 9.24 9.47
1) Give the expression of the Y on X Mayer’s line.
Two ways to cut this data set (3 points then 2, or 2 points then 3) as X increases.
case 1: G1(3.26 ; 9.3) and G2(3.725 ; 9.515) y = 0.4624 x + 7.793
case 2: G1(3.17 ; 9.27) and G2(3.63 ; 9.463) y = 0.4283 x + 7.912
2) By the mean of a point estimate, give a value of the consumer force that would correspond to a G€ 10 car
industry turnover.
case #1: y = 10 iff x = 4.733
case #2: y = 10 iff x = 4.875
3) Is a strong correlation between two variables a sign of a cause and effect relationship between them?
Not necessarily. This numerical relationship may just be a coincidence.

Exercise 24. least square + confidence interval


Monthly revenues of a commercial website are listed below, from January to December 2018:
in k€ : 3 5 4 8 10 9 13 12 17 18 18 21
1) In a few words, describe the least square method.
This method consists in finding out the line that minimizes the sum of the squared residues (rises between
the points and the line).
2) Thanks to the global trend of the evolution of the monthly revenue, give the 95% confidence interval of the
predictable revenue in December 2019. (number the months from 1 for January 2018)
month, X 1 2 3 4 5 6 7 8 9 10 11 12
revenue, 3 5 4 8 10 9 13 12 17 18 18 21
Y
Y’ 2.5 4.136 5.573 7.409 9.045 10.68 12.32 13.95 15.59 17.23 18.86 20.5
Z 1.2 1.209 0.693 1.08 1.106 0.843 1.055 0.86 1.09 1.045 0.954 1.024
Expression of the Y on X regression line: y’ = 1.636 x + 0.8636
Point estimate of the revenue in December 2016 (x = 24): y’0 = k€ 40.14
Variable Z : z = 1.0132222 and σ Z = 0.14538387
Coefficient u for a 95 % confidence level: u = 1.96
Confidence interval: [29.23 ; 52.10]
3) Give the probability that, in December 2019, the revenue would be less than k€ 29.23.
There are 95% chances that this revenue be inside this interval. Moreover, the concept of confidence
interval involves a symmetric probability distribution (the normal law); thus, there are 2.5% chances that
the revenue would be less than the values included in the interval, and 2.5% chances that it would be more
than them. Answer: 2.5%.
4) Build the scatter plot (scale: 2 cm for one month), draw the regression line and finally represent the
confidence interval.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 13 / 16
Y
revenue (k€)

X
month

Exercise 25. Mayer + confidence interval

city X Y The given table includes eight among the major cities of a country. The variable X
A 850 58 gives, in thousands, the number of city residents; the variable Y gives, in
thousands, the number of students in this city.
B 623 37
C 587 38 1) Build the scatter plot from this data series. see below
D 360 20 2) Give the coordinates of the mean point of the cloud. G(439.1 ; 26)
E 312 16 3) a. Using Mayer’s method, determine manually the expression of the Y on X
regression line.
F 275 15
G1(273.3 ; 13.75) and G2(605 ; 38.25) slope: a = 0.07385
G 262 12 With G1: b = y – ax = -6.430 expression: y’ = 0.07385 x - 6.43
H 244 12
b. Draw this line. Does G belong to it? G always belongs to it
c. Give "Mayer’s principle". the sum of the residues must be zero

Y : # students
(thousands)

X : # residents
(thousands)

4) We will use here another fitting line, whose expression is: y' = 0.07 x – 6.
a. With this line, give the 95% confidence interval of the predictable number of students in a town that has
two million inhabitants.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 14 / 16
X 850 623 587 360 312 275 262 244
Y 58 37 58 20 16 15 12 12
Y’ 53.5 37.61 35.09 19.2 15.84 13.25 12.34 11.08
Z 1.084 0.984 1.083 1.042 1.01 1.132 0.972 1.083
Expression of the Y on X regression line: y’ = 0.07 x - 6
Point estimate of the number of students (x = 2000): y’0 = 134
Variable Z : z = 1.04877 and σ Z = 0.052588
Coefficient u associated to a 95 % confidence level: u = 1.96
Confidence interval: [126.7 ; 154.3]

b. What can we say about the chances that the number of students would exceed 155,000 in such a town ?
There are a bit less than 2.5 % chances.

Exercise 26. logarithmic fitting + confidence interval


Service life of some identical office equipment has been studied. In the following table, ti represents the
duration of use - expressed in thousands of hours - and R(ti) the rate of equipment still in use at the time ti.
(e.g. : after 1,000 hours, ti = 1, there are still 90 % left of equipment in use, R(ti) = 0.90)..
ti 1 2 3 4 5 6 7 8 9
R(ti) 0.9 0.66 0.53 0.4 0.32 0.25 0.19 0.14 0.1

1) We set yi = ln[R(ti)] where ln is the natural logarithm. Fill the following table, then build the scatter plot,
using the points Mi (ti, yi), into an orthogonal frame.
ti 1 2 3 4 5 6 7 8 9
yi -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303

2) May a linear fitting be relevant in the previous point?


Calculate the linear correlation coefficient between T and Y.
These points are almost collinear; a linear fitting appears to be relevant.
3) Using the least square method, determine an expression of the Y on T regression line.
Deduce from this expression that there are two positive real numbers k and λ such that: R(t) = k e- λt.
y’ = -0.26604 t + 0.1605 . y = ln R(t) implies R(t) = ey = e-0.26604 t + 0.1605 = e0.1605 × e-0.26604 t = 1.174 e-0.26604 t .
4) In this question, we'll take k = 1.174 and λ = 0.266.
a. Determine the predictable rate of equipment still in use after 10,000 hours.
After 10,000 hours, t = 10 ; hence R(t) = 1.174 e- 2.66 = 0.08184 = 8,2 % rounded.
b. After how long are there exactly 50 % of equipment still in use?
R(t) = 0.5 implies 1.174 e- 0.266 t = 0.5 iff e- 0.266 t = 0.5/1.174 iff -0.266 t = ln(0.5/1.174)
iff t = ln(0.5/1.174) / (-0.266) = 3.209. Answer: after 3,209 hours.

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 15 / 16
5) Give a 99% confidence interval of the rate of equipment still in use after 10,000 hours of service.
T 1 2 3 4 5 6 7 8 9
Y -0.105 -0.416 -0.635 -0.916 -1.139 -1.386 -1.661 -1.966 -2.303
Y ’ -0.106 -0.372 -0.638 -0.904 -1.170 -1.436 -1.702 -1.968 -2.234
Z 0.998 1.118 0.996 1.014 0.974 0.966 0.976 0.999 1.031

Expression of the Y on T regression line: y’ = -0.26604 t + 0.1605


Point estimate of the rate (t = 10) : y’0 = -2.5
Variable Z : z = 1.007964 and σ Z = 0.043476
Coefficient u associated to a 99 % confidence level: u = 2.58
Confidence interval on y : [-2.8003 ; -2.2395] and the the interval on R is: [0.0608 ; 0.1065].

Exercise 27.
100 children have been classified by age (X) and size (Y):
Y
[95 ; 105[ [105 ; 125[ [125 ; 135[
[3 ; 5[ 15 10 0
X [5 ; 7[ 8 32 5
[7 ; 9[ 2 13 15

1) Enter this table in your calculator.


2) Give the means and standard deviations of X and Y, calculate their covariance.
 3940 
x = 6.1 years ,  V ( X ) = − 6.12 = 2.19  , σ ( X ) = 1.480 year ;
 100 
 1315375 
y = 114.25 cm ,  V (Y ) = − 114.252 = 100.6875  , σ (Y ) = 10.03 cm .
 100 
70540
Cov ( X , Y ) = − 6.1 × 114.25 = 8.475 .
100
3) Calculate their linear correlation coefficient. Comment this value.
8.475
r= = 0.5709 , a very weak linear correlation (the cloud may be noisy and curved).
1.480 × 10.03
4) Nevertheless, does the table allow us to see some trend?
We see that from one age to another, the sizes corresponding to the greatest number of individuals are not
the same. But these largest frequencies do not represent, in their column, an overwhelming majority,
which reflects a high variability of sizes for children of the same age. To model the growth of a child by a
straight line is therefore difficult, or even by a well-defined curve.
5) Assuming that the relationship between age and size is linear until the age of 12, give the 95% confidence
interval of the size of a 12 year-old child.
X 4 6 8 4 6 8 4 6 8
Y 100 100 100 115 115 115 130 130 130
n 15 8 2 10 32 13 0 5 15
Y’ 106.12 113.86 121.6 106.12 113.86 121.6 106.12 113.86 121.6
Z 0.94233 0.87827 0.82237 1.08368 1.01001 0.94572 1.22503 1.14175 1.06908
Expression of the Y on X regression line: y’ = 3.87 x + 90.64
Point estimate of the size of a 12 yo child (x = 12): y’0 = 137.08 cm
Variable Z : z = 1.013138 and σ Z = 0.121881
Coefficient u corresponding to a 99 % confidence level: u = 1.96
Confidence interval on y : [106.1 ; 171.6].

____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – TExCorr – Rev2020 – page 16 / 16