Vous êtes sur la page 1sur 10

EM 521 APPLIED STATISTICS

15/03/2012

STUDY SET 1 SOLUTIONS

1. We have randomly selected 100 telephone numbers from METU Phone Book 1970 and
recorded the last digit of each. Suppose we are also given the last digits of 100 randomly
selected phone numbers from METU Phone Book 2000. Below are the box-plots for the two
data sets:

METU Phone Book

5
Data

METU Phone Book 1970 METU Phone Book 2000

a. Analyzing the box-plots above, what comparison can you make on the mean, median, and
the variance of the two data sets?

Median x =3
Median y=4.5
Distribution of X is right skewed, so Meanx>Median x
Distribution of Y is approximately symmetric, so Meany Median y
Std.dev. x< Std. Dev. y

b. From the three histograms given below, which two correspond to the data sets of METU
Phone Book 1970 and 2000 respectively?

Figure 1 Figure 2
20
14

12
15
10
Frequency

Frequency

8
10

4
5

0 0
0 2 4 6 8 0 2 4 6 8
Figure 3
20

15

Frequency
10

0
0 2 4 6 8

METU Phone Book 2000 => Figure 1


METU Phone Book 1970 => Figure 3

c. Below are the descriptive statistics for the two data sets and the aggregate data obtained
by combining these two data sets. What can be the reason of the differences between the
descriptive statistics values given below?

Descriptive Statistics:
Variable Mean Variance
METU Phone Book 1970 3,640 7,647
METU Phone Book 2000 4,500 10,293
Aggregate Data 4,070 9,111

In year 1970, the telephone numbers the last digits of which are higher than 6 are out of the
interquartile range. The reason may be that, since the population of METU was smaller in 1970 than
2000, the telephone numbers ending with higher numbers were not used much. However in year 2000,
by the increase of the population of METU, uniformity has been obtained over the usage of the last
digits of the telephone numbers. Consequently, both mean and variance are higher in year 2000 than
1970. When we consider the aggregate data Z ( , we see that we are aggregating one
data set with a small sample mean and variance and another data set with a higher sample mean and
sample variance. Therefore the mean and the variance of the aggregated data set will be between the
values of the individual data sets.

2. Haris Corporation has a manufacturing process performed at a remote location. Test


devices (pilots) were set up that location, and voltage readings on the process were
obtained. The table contains voltage readings for 30 production runs at that location.
9.98 10.26 10.05 10.29 10.03 8.05 10.55 10.26 9.97 9.87
REMOTE
10.12 10.05 9.8 10.15 10 9.87 9.55 9.95 9.7 8.72
LOCATION
9.84 10.15 10.02 9.8 9.73 10.01 9.98 8.72 8.8 9.84
a. Draw a stem-and-leaf display with leaf unit 0.01.

Stem-and-Leaf Display: Remote Location


Stem-and-leaf of Remote Location N = 30
Leaf Unit = 0,010

1 80 5
1 81
1 82
1 83
1 84
1 85
1 86
3 87 22
4 88 0
4 89
4 90
4 91
4 92
4 93
4 94
5 95 5
5 96
7 97 03
13 98 004477
(4) 99 5788
13 100 012355
7 101 255
4 102 669
1 103
1 104
1 105 5

b. Find mean, median and mode.

Mean= 9.8037, Median= 9.9750, Mode(s)= 8.72, 9.8, 9.84, 9.87, 9.98, 10.05, 10.15, 10.26

c. What is the 40th percentile?

40
x30 12
100
12th observation: 9.87
d. Draw box-and-whisker plot. Write down on the plot the numerical values of the lines. Are
there any outlier(s)?

Boxplot of Remote Location

10.5

10.0
Remote Location

9.5

9.0

8.5

8.0

First quartile: (7.5 8th observation) 9.8


Second quartile: 9.975
Third quartile: (22.5 23th observation) 10.05
IQR: 10.05-9.8 = 0.25
1.5*IQR = 0.375
Lower limit = 9.8 0.375 = 9,425
Upper limit = 10.05 + 0.375 = 10.425
Outliers: 10.55, 8.8, 8.72, 8.05

3. The table contains 50 random samples of random digits, y = 0,1,2,3,....,9, where the
probabilities corresponding to the values of y are given by the formula p(y) = 1/10. Each
sample contains n = 6 measurements.

SAMPLE SAMPLE SAMPLE SAMPLE


8,1,8,0,6,6 7,6,7,0,4,3 4,4,5,2,6,6 0,8,4,7,6,9
7,2,1,7,2,9 1,0,5,9,9,6 2,9,3,7,1,3 5,6,9,4,4,2
7,4,5,7,7,1 2,4,4,7,5,6 5,1,9,6,9,2 4,2,3,7,6,3
8,3,6,1,8,1 4,6,6,5,5,6 8,5,1,2,3,4 1,2,0,6,3,3
0,9,8,6,2,9 1,5,0,6,6,5 2,4,5,3,4,8 1,1,9,0,3,2
0,6,8,8,3,5 3,3,0,4,9,6 1,5,6,7,8,2 7,8,9,2,7,0
7,9,5,7,7,9 9,3,0,7,4,1 3,3,8,6,0,1 1,1,5,0,5,1
7,7,6,4,4,7 5,3,6,4,2,0 3,1,4,4,9,0 7,7,8,7,7,6
1,6,5,6,4,2 7,1,5,0,5,8 9,7,7,9,8,1 4,9,3,7,3,9
9,8,6,8,6,0 4,4,6,2,6,2 6,9,2,9,8,7 5,5,1,1,4,0
3,1,6,0,0,9 3,1,8,8,2,1 6,6,8,9,6,0 4,2,5,7,7,9
0,6,8,5,2,8 8,9,0,6,1,7 3,3,4,6,7,0 8,3,0,6,9,7
8,2,4,9,4,6 1,3,7,3,4,3
a. Calculate the mean of the 300 digits. This will give an accurate estimate of (the mean of
the population) and should be very near to E(y), which is 4.5.

E(y) = 1/10 [0+1+2+...+9] = 4,5


x = 4,68 they are close to each other.

b. Calculate s2 for the 300 digits. This should be close to the variance of y, 2= 8.25.

s = 2,816 s2 = 7,931 (2 = 8,25 is given.)

Descriptive Statistics: C5

Variable N N* Mean SE Mean StDev Variance Minimum Q1 Median


C5 300 0 4.680 0.163 2.816 7.931 0.000 2.000 5.000

N for
Variable Q3 Maximum IQR Mode Mode Skewness Kurtosis
C5 7.000 9.000 5.000 6 42 -0.12 -1.17

c. Calculate y for each of the 50 samples. Construct a relative frequency distribution for the
sample means to see how close they lie to the mean of = 4.5. Calculate the mean and
standard deviation of the 50 means.

The mean for each sample is calculated and is available in the Excel file. The mean of the 50 means is
4,68 and the standard deviation of 50 means is 1,173.

Variable N N* Mean SE Mean StDev Variance Minimum Q1 Median


C6 50 0 4.680 0.166 1.173 1.375 2.167 3.833 4.667

N for
Variable Q3 Maximum IQR Mode Mode Skewness Kurtosis
C6 5.500 7.333 1.667 3.83333 4 0.09 -0.15

Histogram of C6

10

8
Frequency

0
2 3 4 5 6 7
C6

To see the effect of sample size on the standard deviation of the sampling distribution of a
statistic, combine pairs of samples (moving down the columns of the table) to obtain 25
samples of n=12 measurements.
d. Calculate the mean for each sample.

The paired samples means and standard deviations are given in the Excel file.

e. Construct a relative frequency distribution for the 25 means. Compare this with the
distribution that is based on samples of n=6 digits.

Histogram of C8

10

8
Frequency

0
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
C8

f. Calculate the mean and standard deviation of the 25 means. Compare the standard
deviation of this sampling distribution with the standard deviation of the sampling
distribution in c. What relationship would you expect to exist between the two standard
deviations?

The mean of 25 sample means is 4,68 and the standard deviation of 25 sample means is 0,827. The
mean is the same as in part c as expected. The standard deviation is smaller compared to 50 sample
case.

Variable N N* Mean SE Mean StDev Variance Minimum Q1 Median


C8 25 0 4.680 0.157 0.785 0.617 2.833 4.542 4.750

N for
Variable Q3 Maximum IQR Mode Mode Skewness Kurtosis
C8 5.083 6.333 0.542 4.58333, 4.75 3 -0.56 0.92

4. An industrial engineer is about to take on two projects, say A and B. He knows that project
A will end with success with probability 0.70. If he also knows that project B will end with
success with probability 0.60 and the probability of both projects ending in success is 0.50,

a) What is the probability that at least one of the projects will end in success?
P(AUB) = P(A) + P(B) P(A B) = 0.70 + 0.60 0.50 = 0.80

b) What is the probability that only project A will end in success?


P(A-B) = P(AUB) P(B) = 0.80 0.60 = 0.20

c) What is the probability that project B will not end in success?


P(B) = 1 P(B) = 1 0.60 = 0.40

d) What is the probability that exactly one project will end in success?
P[(A-B)U(B-A)] = P(AUB) P(A B) = 0.80 0.50 = 0.30

e) What is the probability that none of the projects will end in success?
P[(AUB)] = 1 P(AUB) = 1 0.80 = 0.20
5. a) Find a formula for the probability distribution of the number of heads when a coin is
tossed four times.
P(X = i) = C(4,i)*(1/2)i*(1/2)4-i = C(4,i)*(1/2)4
Then, the probability mass function can be obtained as:

1/16, x = 0

4/16, x = 1

6/16, x =2
p(x) =
4/16, x = 3

1/16, x = 4

0, o/w

b) Find the cumulative distribution of the random variable X in part (a).

0, x < 0

1/16, 0 x < 1

F(x) = 5/16, 1 x < 2

11/16, 2 x < 3

15/16, 3 x < 4

1, 4 x

c) Find P(2X<4) and P(X<3) using F(x).


P(2 X < 4) = F(4-) F(2-) = 15/16 5/16 = 10/16
P(X < 3) = F(3- ) = 11/16

6. Recent research on concrete structures shows that Poisson distribution can be used to
represent occurrence of structural loads over time. Suppose that on the average, the time
between occurrences of loads is 0.5 year.
= 1/0.5 = 2 structural loads per year

a) How many loads can be expected to occur on a 2-year period?


Let X: number of structural loads occurring during a 2-year period
X ~ Poisson (2x2)
E(X) = = 2x2 = 4

b) What is the probability that more than 5 loads occur during a 2-year period?
P( X 5) 1 P( X 0) P( X 1) P( X 2) P( X 3) P( X 4) P( X 5)
e 4 4 0 e 4 41 e 4 4 2 e 4 4 3 e 4 4 4 e 4 4 5
1 0.215
0! 1! 2! 3! 4! 5!

c) How long must a time period be so that the probability of no loads occurring on that period
is at most 0.1?
Let Y: number of structural loads occurring in a t-year period
Y ~ Poisson (2t)
We want P(Y 0) 0.1 .
e 2t (2t ) 0
P(Y 0) 0.1
0!
e 2t 0.1
2t ln 0.1
ln 0.1
t 1.151
2
t2

d) If it is known that at least 1 structural load will occur in the coming year, what is the
probability that at least 3 structural loads will occur?
Z: number of occurrences of structural loads in a year
Z ~ Poisson (2)
P( Z 3) 1 P( Z 0) P( Z 1) P( Z 2)
P( Z 3 | Z 1)
P( Z 1) 1 P( Z 0)
e 2 2 0 e 2 21 e 2 2 2
1
0! 1! 2! 0.323 0.374
2 0
e 2 0.865
1
0!

7. The time X (in minutes) for a lab assistant to prepare the equipment for a certain lab
experiment is assumed to have a uniform distribution with A = 25 and B = 35.

a) Write the probability density function of X and sketch its graph.


1 1
P( X x)
35 25 10

0,12
0,10
0,08
0,06
0,04
0,02
0,00
25 35
b) What is the probability that the preparation time exceeds 33 minutes?
35
1 2 1
P( X 33) 10 dx 10 5
33

c) Find the mean, variance and standard deviation of the distribution.


a b 25 35
30
2 2
( a b) 2 (b a) 2 (35 25) 2
2 8.333
12 12 12
8.333 2.887
8. A box of candy contains 24 bars. The time between demands for these candy bars is
exponentially distributed with a mean of 10 minutes. What is the probability that a box of
candy bars opened at 8:00 AM will be empty by noon?

X : time required to empty a box of 24 bars in hours


60
X ~ Exponential ( )
24 *10
1 1
1 x *4
P( X 4) e 4 dx e 4 e 1 0.368
4
4

9. The Rockwell hardness of a metal is determined by impressing a hardened point into the
surface of the metal and then measuring the depth of penetration of the point. Suppose the
Rockwell hardness of a particular alloy is normally distributed with mean 70 and standard
deviation 3. (Rockwell hardness is measured on a continuous scale.)

a) If a specimen is acceptable only if its hardness is between 67 and 75, what is the
probability that a randomly chosen specimen has an acceptable hardness?
X: Rockwell hardness of the specimen
X ~ Normal (70, 9)
67 70 75 70
P(67 X 75) P Z
3 3
5
FZ FZ (1) 0.95254 0.15567 0.79687
3

b) If the acceptable range is as in part (a) and the hardness of each of 10 randomly selected
specimens is independently determined, what is the expected number of acceptable
specimens among the 10?
Y: number of acceptable specimens out of 10
Y ~ Binomial (10, 0.7968)
E(Y ) n * p 10 * 0.7968 7.968

c) What is the probability that at most 8 of 10 independently selected specimens have a


hardness of less than 73.84?
73.84 70
P( X 73.84) P Z Fz (1.28) 0.89973
3
Z: number of acceptable specimens out of 10
Z ~ Binomial (10, 0.89973)
P( Z 8) 1 P( Z 9)
10
1 (0.89973) 9 (0.10027) (0.89973)10
9
0.265

10. Suppose that Bob can decide to go to work by one of three modes of transportation, car,
bus, or commuter train. Because of high traffic, if he decides to go by car, there is a 50%
chance he will be late. If he goes by bus, which has special reserved lanes but is
sometimes overcrowded, the probability of being late is only 20%. The commuter train is
almost never late, with a probability of only 1%, but is more expensive than the bus.
a) Suppose that Bob is late one day, and his boss wishes to estimate the probability that he
drove to work that day by car. Since he does not know which mode of transportation Bob
usually uses, he gives a prior probability of 1/3 to each of the three possibilities. What is
the boss estimate of the probability that Bob drove to work?

Pr{ bus } = Pr{ car } = Pr{ train } = 1/3


Pr{ late | car } = 0.5
Pr{ late | train } = 0.01
Pr{ late | bus } = 0.2

By Bayes Theorem,
Pr{ car | late } = =

=0.7042

b) Suppose that a coworker of Bobs knows that he almost always takes the
commuter train to work, never takes the bus, but sometimes, 10% of the
time, takes the car. What is the coworkers probability that Bob drove to
work that day, given that he was late?

Use Pr{ bus} = 0, Pr{car} = 0.1, and Pr{ train } =0.9.

By Bayes Theorem,
Pr{ car | late } = =
=0.8475