Vous êtes sur la page 1sur 15

Outline of Discussion: First Five Sessions

Tathagata Bandyopadhyay
August 22, 2016
1. Statistical Inquiry: A statistical inquiry always starts with the formulation of a
research question. A few examples are: What percentage of families in Ahmedabad
afford to send their children to private schools? What is the average cell phone bill per
month for a boy/girl aged between 10 and 16 years? What is the average switching
time (in months) from one mobile phone to another for an adult aged between 30
years and 40 years? How many hours on an average a college student spends in a
day on online social networking? On the average how many social media accounts a
college in Ahmedabad has? Interface of an app to book a movie ticket is designed by
two designers. Which interface is more user friendly? A bank is to decide whether
to offer a personal loan of Rs. 25,000 or Rs. 50,000 to all whose income are between
Rs. 20 thousand to Rs. 30000 per month on production of UIDAI card at 15% rate
of interest. The condition is to pay back the full amount in two years. Which format
will have more takers and which will have less defaulters (or more bad loans)? To
answer a research question one needs to design a study for collection of appropriate
data either by conducting a survey or from a secondary source or by conducting an
experiment. This is the design stage of an inquiry. In the design stage you need to
decide on the kind of data to be collected, method of collection, and also the design
of questionnaire.
Next stage is the analysis of data. It comprises summarization of data using tables,
graphs and numbers, modeling the data and then drawing inference or conclusion
from the data with a measure of uncertainty attached to it. For example, a statement
like: I am 95% confident that on an average a college student in Ahmedabad spends
between 5 and 6 hours a day in browsing social media sites.
Are all research questions stated above unambiguous? If not what alterations are needed for the ambiguous ones?
2. Nature of Data: Data are collected either as numbers like, average time spent on
mobile phone a day by a student, number of junk mails received in g-mail account,
average number of cigarettes smoked per day, etc. These are called variables. The
other kind of data are called categorical. Categorical data are collected by classifying an individual or object in one of the mutually exclusive and exhaustive
categories. Examples are: male, female; smoker and non-smoker; illiterate, studied
1

upto grade twelve, Graduates, post graduates and above; different categories of professions; taste of a food product observed in different categories; staying experience
in a hotel classified into different categories, etc.
Usually data are collected on individuals or objects which are called units or subjects.
3. Some Technical terms (Population, Sampling unit, Sampling frame, Sample, Random Sample, Parameter and statistic): Usually large population size
prohibits collecting data from each and every unit/subject for various reasons, primarily for time and cost constraints. Suppose the research question is: what
is the average number of hours per day a college student in Ahmedabad spends in
browsing social media sites in 2016?
Let us define the above technical terms in this specific problem context:
population (all college students in Ahmedabad. In other words, collection of all
students covered by our inquiry), sampling unit (each student could be a sampling
unit, when from the population a sample of students is to be selected. However, a
more efficient process may be to select a sample of colleges and then collect data from
each student of the selected colleges. In the latter case the sampling unit is a college.
On the other hand, after selecting a college from all colleges in Ahmedabad, a sample
of students is selected from each selected college then the colleges are called first stage
sampling units and the students are called the second stage sampling units).
sampling frame (List of all college students in Ahmedabad or list of all colleges in
Ahmedabad. In other words, the list of all sampling units in the population.)
A sample is a collection of sampling units drawn from a frame or frames. Sample
could be drawn according to convenience or could be drawn in a way such that every
sample has got equal chance of being selected. The former is called a convenience
sample and the latter is called a random sample. A random sample is often
preferred because it avoids any bias in selection and usually results in a representative
sample. (A convenience sample in our example could be a sample of students from
the the nearby colleges. For drawing a random sample on the other hand a list of all
college students (a sampling frame) needs to be created and then using some random
mechanism a sample of students will be selected from the list.)
A parameter is a population characteristic of our interest. In our example it is the
average based on data from all college students in Ahmedabad. It is usually a fixed
unknown number. Other population characteristics that may be of our interest
are the standard deviation of the times spent by the students in the population, and
the percentage of students in the population spending more than 4 hours, etc.
A statistic is a sample analogue of the parameter like sample average or sample
standard deviation. A statistic is calculated on the basis of sample observations. In
the context of our example, the average and standard deviation of times spent in
browsing social media sites by the selected students, percentage of selected students
2

spending more than four hours in social media sites etc. are examples of statistics.
Notice that for estimating the unknown parameter the sample value of the corresponding statistic is usually used. For example in estimating population average, the
value of the sample average observed for the selected sample is used.
Group assignments:
1. Consider the problem of estimating (i) the average number of hours on a day a
student in Section A spends in preparing for PGP classes and (ii) the proportion of
female students in Section A.
Prepare the sampling frame. Draw two random samples of (i) size five and (ii) size
ten. Find the sample mean and sample proportion for each of the samples that you
have drawn. You will have two means and two proportions for two samples.
2. Consider the problem of estimating (i) the average number of e-mails received by
the PGP I students,(ii)the proportion of PGP I students using an i-phone, (iii) the
proportion of PGP I students using a Samsung Galaxy phone (iv) the proportion of
PGP I students using more than one cell phone.
Prepare the sampling frame. Draw two random samples of (i) size twenty and (ii)
size forty. Find the sample mean (for (i)) and the sample proportion (for (ii)- (iv))
for each of the two samples.
You may collect your data through e-mails if you feel so. But you need to explain
every step during your presentation.
4. Collection of data: Two methods are usually followed in collection of data for
research. Either through conducting experiment or surveys (often using data already collected by some agency (known as syndicated data)). The former is often
called data collected through experimental study and the latter, the data collected
through observational study.
The data collected through experimental studies are more reliable in comparison
to the data collected through observational studies. In experimental study one
could control the factors that may lead to the confounding of the effect that we want
to study. (We will discuss it in the class) But the former is often, expensive
and may not be feasible at all. However, by using data collected through carefully
designed experimental studies only one can talk about some sort of causality.
Using data collected through observational studies one cannot prove causality
unless it is supported by a sound theory. Through observational studies one can talk
only about association.
Consider the following example. Suppose we observe the incidence of lung
cancer among one million smokers is 20% while among one million nonsmokers it is only 4%. Am I in a position to conclude that smoking causes
lung cancer? Think about it. We will take it up during the class room
discussion.
3

5. Observational Studies: We now consider collection of data through surveys. In


case the data are collected from each and every sampling unit of the population the
method is known as complete enumeration or census method of data collection.
Naturally, if the data are collected through census method theoretically one can find
the true value of the parameter. However, there is an implicit assumption in the
above statement which most of the time is does not hold and is very very far
from what happens in reality.. The assumption is: the collected data are free
from errors which is never (!!!) true.
On the other hand, in case of estimation through sample surveys we only have access
to the selected sample observations which is a part of the whole population. The other
part of the population remains unobserved. Thus the value of the sample statistic,
which we use as an estimate for the unknown value of the parameter, could be near or
far from the true value of the parameter depending on whether the sample is a close
or a bad representation of the population. So one of the important issues in
sample survey is to select a sample which represents the population well
so that an estimate with good accuracy is obtained.
Better representation and hence more accuracy could be achieved by (i) increasing
sample size, (iii) by adopting a better method of selection of the sample
or using better sampling design. The representation and hence accuracy
also depends on (ii) how homogeneous is the population. But, controlling
homogeneity is beyond our control.
Questions: Do the above statements make sense intuitively? In fact, in case of
probability sampling we could prove using rigorous theory that these statements are
indeed true. We will spend some time discussing it. But for the time being, lets
discuss whether intuition supports the above statements.
In any survey the estimates are subject to two kind of errors, viz., sampling error
and non-sampling errors. The sampling error arises because of sampling. In
other words, observing a part of the population instead of the whole population.
More representative the sample is of the population, less is the sampling error.
On the other hand, non-sampling error arises due to all other factors except sampling.
It could be due to errors of coverage. This error arises when the list (like the
telephone directory, list of e-mail addresses) that is used for drawing sample does not
match up perfectly with the sampling frame of the target population.
It could be due non-response. This is considered to be one of the most serious and
frequently occurred errors in any survey. In a personal interview this arises in one of
the three ways: the inability to contact the sampled unit (person or household or an
organization etc. In actual survey substituting by a next door neighbour is common
but is not a good idea), the inability of the respondent to come up with an answer
to the question of interest (for example asking someone about the impact of a policy
4

decision, who may not have any clue), or refusal to answer (could be because of fear
or of intention not to divulge). A good survey should attempt to obtain some
information about the group of non-respondents in order to understand
how different or similar are they as a group from the group of respondents?
Besides these, the errors of observations (may be due to respondents reporting error, the respondent may not simply remember it correctly, the respondent may not
understand the question properly, like, asking the head of the household the number
literates in the household (the meaning of literate may not be clear to the respondent).
Besides the above, the errors could be due to inability of the interviewers to elicit
honest response, could be due bad design of the instrument viz. questionnaire, (it has
been observed that ordering and wording of questions, nature of the question (whether
the question is open ended or close ended) lead to lot of variation in responses), could
be due to coding errors etc. So any kind of error besides sampling error is known
as non-sampling error.
It is believed that for a moderately large sample survey the non-sampling
error constitutes around 70-80% of the total error. Finally, the nonsampling errors increases with the increase in sample size.
In the light of above discussion why do you think a sample survey could
be a better choice? We will discuss this issue in the class.
6. Drawing a random sample or Random sampling from a finite population:
Two kinds of random sampling are used for finite population. These are simple
random sampling with replacement (SRSWR)and simple random sampling
without replacement (SRSWOR). For all practical purposes SRSWOR is preferred to SRSWR but in some situations like random number generation we need
to use SRSWR. We will soon discuss it.
Lets discuss how to draw a random sample of size 10 from a class of 80 students.
Step 1. Assign serial numbers 01 to 80 to the students, like, roll numbers.
Step 2. Consider a random number generation mechanism that selects one of the
digits 0 to 9 with equal probability, i.e., 1/10. (What could be such a mechanism?
Think about it.)
Step 3. Select two digits using this mechanism, if it gives a two-digit number, say, 10,
select the student who is assigned the serial number 10. If it is either 00 or a number
between 81 to 99, reject the number. Again select a two-digit number until a number
between 01 to 80 is obtained.
Question: Prove that by this method the chance of selecting a student in
the first drawing is 1/80
Step 4. Repeat step 3 until you draw nine more two-digit numbers between 01-80.
The students with corresponding serial numbers are selected. Notice in this way you
5

may select a student more than once. So if you are to draw a simple random sample
with replacement, a student could be selected more than once. However, to draw a
simple random sample without replacement you need to repeat step 3 until you select
9 more distinct two-digit numbers.
Question: Prove that by the above described method the chance of selecting a sample of 10 students in case of SRSWR is (80)10 and in case of
SRWOR it is (80.79.78.....71)1 . Also prove that in the fourth drawing the
chance of selecting any one of the 80 students is 1/80 for SRSWOR. Is it
true for any of the ten drawings?
Without having an access to such a mechanism one can use a random number
table or an EXCEL function like randbetween(min, max) to generate random
numbers betweem min and max.
A random number table is a sequence of digits generated using a mechanism as
discussed above so that in the long run the table contains all the digits 0, 1, ..., 9 in
approximately equal proportions, with no trends in the pattern in which the digits
are generated. Thus if a digit is drawn at random from the random number table the
chance of getting any digit is 1/10.
7. Random sampling from an infinite population:
Truly speaking there is nothing like an infinite population but often the population
is either very large or hypothetical (i.e., non-existent) so that sampling frame is not
available. In such cases, the sampling units should be selected fulfilling the following
two conditions.
1. The sampling unit should be a member of the target population.
2. The sampling units should be selected independently of each other.
For example the problem is to draw a random sample of customers of Flipkart from
its customer base. Flipkarts customer base is not only very large but dynamic too.
For all practical purposes the population could be approximately considered as an
infinite population. If Flipkart selects a sample of customers by picking up a purchaser
every 5 seconds during the grand sale of 36 hours, the customers in the sample could
be dependent. Because there is a possibility that the customers could exhibit similar
buying behaviour. One should always be careful about avoiding dependence if domain
knowledge makes us feel so.
Question 1. Devise a method to select a random sample of customers of
Flipkart.
Question 2. Does it intuitively make sense to assume that sampling from
an infinite population is equivalent to sampling from a finite population
with replacement? Explain.

8. Sampling distribution of sample mean and sample proportion


First, we need to understand that for SRSWR and SRSWOR the sample mean and
the sample proportion are random variables and hence these guys have a probability
distributions.
Question 1. Are you convinced? Please explain.
Question 2. Suppose a circus owner had 5 crocodiles to ship from Chennai to
Mumbai. The shipping company agreed to ship but would charge Rs. 20,000 per 100
kg. Naturally, they need to know the total weight of all five crocodiles. Weighing a
crocodile is difficult and at the same time expensive too. Let us name the crocodiles
as Jumbo (J),Kambo (K) , Lambo (L), Mambo (N) and Shambo. They hired a
statistician for estimating the total weight by weighing two crocodiles only. The
statistician proposed the following procedure.
Step 1. Select two crocodiles at random without replacement.
Step 2. Weigh them, find the mean weight and multiply it by 5.
By following the statisticians procedure the total weight came out as 1750 kgs. The
manager of the shipping company is not happy with the estimate. After observing
the size of the crocodiles, from his experience of shipping crocodiles the manager
felt that the estimate was very low. Though, the statistician was claiming that his
estimate is unbiased and if the distribution of weight could be assumed to be normal
then it is actually the best among all unbiased estimates.
There was a guy who helped the company in the past for weighing crocodiles. He
could measure the weight of a crocodile by measuring its length and knowing its age.
His error in estimation was always within 10 kgs. The manager called the guy. His
estimates of weights (in kg) were: 1000 (J), 600(K), 500 (L), 400 (M) and 300 (S).
Assuming these weights to be correct find the mean weight of the crocodiles
and hence the total weight. Find the standard deviation of weights. Write
down all possible samples of size two. For each sample find the sample
mean and hence the probability distribution of the sample mean. Find
the mean and variance of the distribution of sample mean. Match up the
values of mean and standard deviation of the probability distribution of
sample mean with the values that you get directly by using the following
formulas.
p
= and Std(X)
= (/n)( (N n)/(N 1)
E(X)
is the sample mean, is the population mean, is the population standard
where X
deviation, n is the sample size and N is the population size.
Do you notice that the accuracy of sample mean does not depend on the
population size if the sampling fraction is very
psmall. Usually if it is less
than 0.05 the finite population correction (fpc) (N n)/(N 1) is taken as
7

1. Do you think the above fact to be counter intuitive? Explain.Question:


As an implication of the above formulas one could very nicely interpret
the impact of sample size, of population heterogeneity and the role of
sampling fraction f = n/N on accuracy of sample mean as an estimator
of population mean. Please explain. Also for SRSWR what alterations
would be required to the formula?
Do you notice that the accuracy of sample mean does not depend on the
population size if the sampling fraction is very small.
Usually if it is less
p
than 0.05 the finite population correction (fpc) (N n)/(N 1) is taken
as 1. Do you think the above fact to be counter intuitive? Explain.
Question 3.
A statistician who belonged to a group of rebellions was taken as a prisoner by the
army of king Juna and produced before the king. The king offered him to play a
game of chance. The game is as follows: Six identical bags of coins labeled B1 to B6
would be placed before him. Each bag contains either only gold coin or only silver
coin. The statistician would be allowed to pick up two bags at random and would
be allowed to observe its contents. Based on this information he will have to predict
the number of bags containing only gold coin. Naturally, the predicted value would
be six times the proportion of bags containing gold in the sample. If the predicted
number is correct he will be freed and if he errs by 1 bag, he will be imprisoned for
5 years and if he errs by more than 1 bag, he will be executed.
Suppose the emperor ordered to keep two bags of gold coin and four bags of silver
coin.
Write down all possible choices of two bags (i.e. sample of size 2) that
the statistician could make. For each sample find the proportion of bags
having gold coin only. Hence find the probability distribution of sample
proportion of bags containing only gold coin. Find the estimate of the
number of bags out of the six containing only gold coin. Find also the
probabilities of the statistician getting free and getting executed. Find
the mean and standard deviation of the sampling distribution of sample
proportion. Match up your result with the results that could be directly
obtained from the following formulas. What could be the best strategy
for the king to maximize the chance of the statisticians execution?
p
p
E(
p) = p and Std(
p) = ( p(1 p)/ n)( (N n)/(N 1)
where p is the sample proportion, p is the population proportion, n is the sample size
and N is the population size.
Ans: Best strategy: two gold and four silver or four gold and two silver.
Question: As an implication of the above formulas one could very nicely
interpret the impact of sample size, of population heterogeneity and the
role of sampling fraction f = n/N on accuracy of sample proportion as
8

an estimator of population proportion. Please explain. Also for SRSWR


what alterations would be required to the formula?
Question 4. As a promotion strategy of its brand a cell phone company decides
to offer a discount of either Rs. 5000 or Rs. 3000 or Rs. 2000 to the first 10000
customers e-ordering a particular model on its website. The price of the phone is Rs.
10000. As soon as the customer places an order the discount amount will be flashed
and will be deducted from the price. To decide on the discount to be offered to a
customer the company decides to use the following random mechanism. With the
placing of an order, a digit between 0 to 9 will be selected at random. If the chosen
random digit is either 0 or 1, the offered discount will be Rs. 5000, if it is between 2
and 4, it will be Rs. 3000, and otherwise it will be Rs. 2000.
Suppose a customer ( among the first 10000) places an order for a phone.
Find the probability distribution of the price of the phone for the customer
and also find its mean and standard deviation. Ans: Mean = 7100, SD =
1135.78 (approx).
Suppose a customer ( among the first 9000) decides to place order for
two such phones, then find the probability distribution of total (average)
price of two phones for the customer. Also find its mean and standard
deviation. Ans: For Average: Mean = 7100, SD = 803.12 (approx).
If a customer (a local shop owner, among the first 5000 customers) places
an order of 40 cell phones then find the mean and standard deviation
of the total (average) price of the phones. Find an approximation to
its probability distribution and then find the probability that the average
price is (i) less than equal to 6000 (ii) more than Rs. 7000 and (iii) between
Rs. 8000 to Rs. 6000.
Ans: (i) 0 (approx)(ii) 0.29 (approx) (iii) 1 (approx) (Hint: Use central limit
theorem for finding the approximation to the distribution)
Question 5.(Application in Statistical Quality Control) A manufacturing process is supposed to produce capsules containing 400 mg of a chemical, say, C. However,
variation in a manufacturing process is inherent, so the contents of different capsules
would vary. Suppose the regulatory authority makes it mandatory that the content
of every capsule should be between 398 mg. and 402 mg. To ensure it, the mean and
standard deviation of the contents produced by the manufacturing process are set at
400 mg and 0.5 mg. The production supervisor knows from his experience that the
standard deviation of the process does rarely change. However, he feels that continuous monitoring of the process is necessary for checking the stability of the mean of
the process. A consultant suggested him to implement the following procedure.
In every hour during a shift a sample of 100 capsules is to be selected and if the
average content of the sample falls below 399.90 or above 400.10 stop the process and
hunt for the trouble.
9

(i) What is the probability of a false a alarm if this procedure is followed?


(ii) Assuming that the mean has actually shifted to 400.1, what is the
probability that the shift will be detected using such a sample? (iii)What
is the probability that the change in mean will remain undetected after
two such samples are inspected since the beginning of morning shift? (iv)
What is the probability that it remains undetected in the first two and
gets detected at the inspection of the third sample? (v) Suppose the
process produces 10000 capsules per hour. What is the expected number
of capsules produced that will violate the norm of the regulatory authority
till the change in mean is detected in case (iii) above?
Ans: (i) 0.05 (ii) 0.5 (iii) 0.25 (iv) 0.125 (v) 0 (approx)
(If instead of 100 the sample size is 25, what assumption would be necessary for the calculation of the above probabilities? Make the assumption
and solve it.)
Question 6. (Application in Statistical Quality Control) For assessing the
quality of lots sent by vendors, the quality control departments usually devise sampling inspection plans for taking a decision on whether to accept or reject a lot.
Suppose the lot size is 100 (N), then the sampling inspection plan specifies a sample
size, say, 10 (n) that needs to be selected from the lot without replacement and if
the sample contains more than, say, 1 (c) defective items (again the number specified
by the sampling plan) the decision would be to reject the lot, otherwise accept it.
Sampling is often the only option if the testing is destructive in nature.
For designing sampling inspection plans the interests of both the consumer and the
vendor are to be protected. Since the decision to accept or reject a lot is taken on
the basis of a sample, there is a chance that even if the lot quality is good (bad)
the lot may get rejected (accepted). The vendor to protect himself from rejection of
good lots imposes a condition like: if a lot has 5% (p1 ) defective items, the chance
of rejecting such a lot should not exceed 10% (VRisk). Let us call it the vendors
risk. On the other hand, for reducing the chance of accepting a bad lot the consumer
imposes a condition like, the chance of accepting a lot with 10% (p2 ) defective should
not exceed 10% (CRisk). Let us call this consumers risk.
Let us now consider a problem just to illustrate the above. Suppose N = 20, n =
5, c = 0, p1 = 5%, V risk = 10%, p2 = 10%, CRisk = 10%.(i) Does this sampling plan
fulfill Vendors risk? (ii) (i) Does this sampling plan fulfill consumers risk? (iii) If
the actual number of defectives in the lot is 4, what is the chance of accepting such
a lot? Ans: (i)No (Probability = 0.25) (ii)No (Probability = 0.55) (iii) 0.71
Do (i) and (ii) with N = 1000, n = 20, c = 2, p1 = 5%, V risk = 10%, p2 =
10%, CRisk = 10%. Do (iii) with actual number of defectives equal to 40.
Additional Problems for Practice
1. In a game of chance, each player is given an urn containing one five-rupee coin,
10

two two-rupee coins and three one-rupee coins; note that the urn contains six coins
altogether. The player draws three coins at random and without replacement. Let
X, Y and Z denote the values of these three coins in rupees.
Define M = median ( X, Y , Z), L = min (X, Y, Z), U = max (X, Y, Z), S = (L +
U)/2 Find (a) P(M = 1) , (b) P(M = 2), (c) P(S = 1), (d) P(S = 2), (e) P(S = 3),
(f) P(S < M) and (g) E(S). Ans: (a) 0.5, (b) 0.5, (c) 0.05, (d) 0, (e) 0.45, (f) 0.15,
(g) 2.25
2. An alchemist visited the court of a medieval warlord and said Your excellency,
here is my tribute to you. I have six envelopes. One of these contains a single copper
coin, another contains two copper coins, while a third one contains three copper
coins. The remaining three envelopes are empty. Kindly pick up any three of these
six envelopes at random and without replacement. I shall convert all the coins in the
selected envelopes to gold coins dating from the period of King Solomon you can
imagine their value as antiques ! But what happens if I end up picking only the three
empty envelopes?, thundered the warlord, I shall behead you then. Take it easy, your
excellency, calmly replied the alchemist I am also a sorcerer in that extreme case,
I shall make seven gold coins for you, again dating from King Solomons era, simply
from the air. Assume that all the claims of the alchemist were true and that he kept
all his promises (the latter point is natural given the threat about his head!). Let X
be the number of gold coins that the warlord eventually ended up with. Obtain (a)
P(X = 3), (b) P(X = 4), (c) P(X = 5), (d) P(X = 6), (e) P(X = 7), (f) E(X) and (g)
Var(X). Ans: (a) 0.3, (b) 0.15, (c) 0.15, (d) 0.05, (e) 0.05, (f) 3.35, (g) 2.6275
3. A textbook on business statistics contains five chapters. A student, who is not
very serious, takes a simple random sample (without replacement) of three chapters.
He studies these three chapters with some seriousness and completely ignores the
remaining two chapters.
In the final examination, the question paper on this subject consists of five questions,
one from each chapter. The questions from Chapters 1 and 2 are compulsory and
carry 18 and 12 marks respectively. The questions from the other three chapters
carry 20 marks each and each student is supposed to answer any one of these three
questions (even if a student answers more than one of these three questions, he/she
gets credit for only one of them). Thus the maximum possible score for any student
is 50.
Obviously, the student under consideration gets zero in any question from a chapter
that he had ignored (so he does his best to avoid such a question, if possible). Furthermore, as he is not very serious with his studies, he gets only 50% of the marks
in any question from a chapter that he had included for study. Let T be his score in
the examination.
Obtain the probability distribution of T and hence the expectation and variance of
T. Ans: The possible values of T are 10, 16, 19 and 25 with respective probabilities
0.1, 0.3, 0.3 and 0.3; E(T) = 19, V(T) = 21.6
11

4. Five multinationals A,B,C,D and E offer scholarships to the students of a business


school. The scholarship amounts (in appropriate units) are as shown in the following
table:
Multinational A B C D E
Scholarship Amount 4 2 4 5 2
A simple random sample of three (out of the five) multinationals is drawn without
replacement and the associated scholarship amounts are noted. Let X, Y and Z denote the
ordered values of these amounts in the sense that X Y Z. Also let T = (X + 3Y +
Z). Obtain (a) the probability distributions of X, Y, Z and T, and (b) the joint probability
distribution of X and Z. Ans: P(X = 2) = 0.9, P(X = 4) = 0.1; P(Y = 2) = 0.3, P(Y =
4) = 0.7; P(Z = 4) = 0.4, P(Z = 5) = 0.6; T equals 12/5, 13/5, 18/5, 19/5 or 21/5 with
respective probabilities 0.2, 0.1, 0.2, 0.4 and 0.1; P(X = 2 & Z = 4) = 0.4, P(X = 2 & Z
= 5) = 0.5, P(X = 4 & Z = 4) = 0, P(X = 4 & Z = 5) = 0.1.
5. The manager of a casino plans the following game. Each player will be given a box
containing five coins of which one is gold (valued Rs 100), one is silver (valued Rs 25) and
three are ordinary (valued Re 1 each). The player draws two coins at random and without
replacement. Let A and B be the values (in rupees) of the two coins so drawn. Define X =
|A B|. The player receives a payoff (X + 25) if none of the two selected coins is ordinary;
the payoff is zero otherwise. Obtain (a) the probability distribution of X, (b) E(X) and (c)
the expected payoff. Ans: (a) Values 0, 24, 75, 99 with probabilities 0.3, 0.3, 0.1, 0.3, (b)
44.4, (c) 10.
6. On the desk of an executive, there are five letters with page lengths 2,4,6,8 and 10.
Three letters are picked up at random and without replacement. Let A, B and C be the
page lengths of the selected letters, X = minimum(A,B,C) and Y = maximum (A,B,C).
Define M as the arithmetic mean of X and Y. Obtain (a) the probability distribution of M,
(b) E(M), (c) P(Y X > 4) and (d) P(M X > 2). Ans: (a) The possible values of M
are 4, 5, 6, 7, 8 with respective probabilities 0.1, 0.2, 0.4, 0.2, 0.1, (b) 6, (c) 0.7, (d) 0.7.
7. From a batch of five students, with scores 4, 6, 2, 8 and 10 in a quiz, two are selected
by simple random sampling without replacement. Let X and Y be the scores of these two
students. Find (a) P(|XY | > 2), (b) P((X+Y)/2 is even), and (c) E(X 2 + Y 2 ). Ans: (a)
0.6, (b) 0.4, (c) 88
8. A box contains five balls of weights (in kg) 3, 3, 7, 9 and 9. A simple random sample
of three balls is drawn without replacement. Let P, Q, R be the weights (in kg) of the three
balls in the sample, and Z = min(P, Q, R), Y = max(P, Q, R). Obtain (i)P(Y Z > 2)
and (ii)E(Y - 2 Z). Ans: (i) 0.9, (ii) 2

For solving the above problems if the sample sizes are large
we will have to use large sample approximation to the distribu and sample proportion p as discussed in
tions of sample mean X
probability class.
can be approximated by a normal distribution with mean (popThe distribution of X

ulation mean) and standard deviation n where is the population standard deviation
12

and n is the sample size if the sample size is greater than equal to 30 (A thumb rule).
The distribution of p can be approximated
p by a normal
distribution with mean p (population proportion) and standard deviation p(1 p)/ n where n is the sample size and
if both np and n(1 p) are greater than equal to 5 or 10 (thumb rule depends upon the
text book that you are using).

Useful Sampling Designs


Random sampling is the basic sampling design that is used in other designs too. We
discuss here three other important sampling designs that are often useful.

Stratified Random Sampling


A stratified random sample is obtained by dividing the population units into nonoverlapping groups, called strata, then selecting a random sample from each stratum.
The strata needs to be so chosen that the within stratum homogeneity is maximum
and between strata heterogeneity is maximum. Creating strata, of course, needs some
additional information. Also the sampling frame of each stratum should be available.
For example, if a hotel (an airline) is interested to estimate the proportion of customers
happy with the experience of staying (traveling), it is always a good idea to stratify the
customers into strata like vacationers, business executive etc. to enhance (reduce) the
accuracy (variance) of the estimate.
Good stratification leads to enhanced accuracy of the estimate of the population mean
(proportion). There are two standard rules to allocate the total sample size (n) to different
strata. We illustrate it with four strata and a total sample size n. Suppose the sizes
(standard deviations) of strata 1 to 4 are N1 (1 ), N2 (2 ), N3 (3 ), andN4 (4 ) and population
size N = N1 + N2 + N3 + N4 .
The stratified sample mean YSt = W1 Y1 + W2 Y2 + W3 Y3 + W4 Y4 which gives
an unbiased estimate of the population mean = W1 1 + W2 2 + W3 3 + W4 4
where i , i = 1, 2, 3, 4 are the strata means, Yi , i = 1, 2, 3, 4 are the sample means and
Wi = Ni /N, i = 1, 2, 3, 4 for the strata 1, 2, 3, 4 respectively.

Proportional Allocation: The sample sizes to different strata are proportional to the
strata sizes.
The sample sizes for strata 1 to 4 are:
n1 = W1 n,
n2 = W2 n,
n3 = W3 n,
n4 = W4 n.
where Wi = Ni /N, i = 1, 2, 3, 4.
13

Neymans Optimum Allocation:


The sample sizes to different strata are allocated proportional to Wi i , i = 1, 2, 3, 4.
First calculate W1 1 + W2 2 + W3 3 + W4 4 =
.
The sample sizes for strata 1 to 4 are then given by:
n1 = (W1 1 /
) n,
n2 = (W2 2 /
) n,
n3 = (W3 3 /
) n,
n4 = (W4 4 /
) n.

Systematic Sampling
The idea of systematic sampling is simple. Suppose a sample of n shoppers need to be
selected visiting a mall. The sampling frame is not known. But one may decide to sample
every 10th individual entering the mall. The first individual needs to be selected at random
from the first 10 shoppers entering the mall.
In other words, a 1-in-k systematic sample is obtained by randomly selecting one element
from the first k (10 in the shopper example) elements in the frame and every k-th element
thereafter until n elements are selected.
A systematic sample is generally spread uniformly over the entire population whereas
random sample may not. If there are pockets of heterogeneity in the population it is better
to go for systematic sampling than random sampling. For example, in assessing the quality
of spices to be shipped for export systematic sample is usually taken from the container.
Auditors for checking vouchers often use systematic sampling. For example a 1-in-5
systematic sample of travel vouchers could be inspected to determine the proportion of
vouchers filled incorrectly.
Question: Do the following statements make intuitive sense? Explain.
In case the list of sampling units from which systematic sample is to be drawn is in
random order, systematic sampling and random sampling would be equivalent. In case
the values of the variable of interest in the list are either in increasing or in decreasing
order, systematic sampling would be better. In case, the values of the variable of interest
in the list show a periodic movement, the systematic sample could lead to under and over
estimate unless the sampling interval is properly chosen.

Cluster Sampling
A cluster sample is a probability sample in which the sampling units are collections of
units or clusters.
When sampling frame is not available cluster sampling could be useful. Also collection
of data through cluster sampling is convenient and economical too.
Suppose we need to select a sample of college students in Ahmedabad. Each college
could be considered as a cluster. So select a few colleges at random and get data from each
student in the selected colleges.
14

The other option could be to divide Ahmedabad into blocks and each block is considered
as a cluster. Select a number of blocks at random and then get data from each college
student in the selected blocks.
Which one is a better option? It depends on the following considerations: The
objective of any survey design is to obtain a specified amount of information about the
population characteristic of interest (like population total, population mean, population
proportion etc.)at minimum cost. The former would be more economical option than the
latter. But at the same time the measurements on the students in the same college may
be highly correlated and in such cases, the amount of information may not increase as new
measurements taken within a cluster. From this latter point of view the second option
could be better.
Notice that from the latter consideration the units within a cluster should be as heterogeneous as possible and the units between the clusters should be as homogeneous as
possible which is exactly the opposite of stratified random sampling.
So in cluster sampling to decide how many clusters and what sizes of the clusters are
to be chosen, the above considerations should be carefully weighed.

15

Vous aimerez peut-être aussi