Vous êtes sur la page 1sur 26

Most Important Questions of QT

1. Explain Pareto law of income distribution


Solution,
In the course of studying the income distribution macro economically, the Italian economist Vifredo Pareto proposed a law of
income distribution based on the empirical study of the income data of various countries of the world at different times and
concluded that In all places and at all times, the distribution of income in a stable economy is given approximately by the
following empirical formula;
A
Y= ..........................................(1)
(x a) v
Where;
Y = Number of people having income x or greater
a = Lowest income at which the curve begins
A & v = certain parameters
Pareto law asserted above can be explained graphically also with the help of sketch in fig 1.

Y
The Pareto curve sketched in figure above is symmetrical like a hyperbola. That is, as
x a,y and as x ;y 0.
Y= A
(X-a)v
Shifting the origin at (a,o) i.e. a = 0, equation (1) above becomes:
Y=Axv(2)
Taking logarithms on both sides of above equation, we get;
lny = ln(Axv) O
a
X

= lnA + ln x v [ ln mn=ln m+ln n]


Fig. 1. Pareto Curve

= lnA+(v)lnx [ lnxn = n ln x]
= lnAv lnx ..(3)
The value v gives the certain measure of the inequality of income distribution. Pareto's empirical study claims that the value of
v i.e. inequality coefficient of income distribution ranges between 1.2 and 1.9 in many countries (i.e. 1.2 v 1.9) and the
average value of v can be considered 1.5 approximately.
Meaning of v: Differentiating equation (3) with respect to lnx, we get,
d ln Y
v......................................................(4)
d ln X
Equation (4) implies that the percentage of people with higher and higher income falls in the economy as we move to higher and
higher income classes.
As stated above in many counties, the value of v varies form 1.2 to 1.9. If the value of v increases (more than 1.2) then the
Pareto's curve becomes more and more convex to the origin. It implies that income inequality of people among various classes
increase as the value of v increases and viceversa.
2. Write a note on Log-normal Distribution.
The lognormal distribution is derived from the normal distribution. If X is a random variable such that ln X (log of X) is a
normally distributed variable with mean, and variance 2 , then X is said to follow lognormal distribution.
Hence, the continuous random variable X is said to be log normally distributed, if the probability density function of ln X with
parameters `and 2 is:
1
1 (ln X )2
f (x;, 2 ) ;X 0
2
e 2
x 2
Where, X is a positive random variable.
=mean
= standard deviation
= a constant =22/7
e= a constant i.e. e=2.71828

1
Features of log-normal distribution
i) It deals with the random variables which always take positive values only.
ii) Its curve is single tailed and skewed as shown in fig.1.

iii) Mean and variance of lognormal distribution is :


2

2
Mean: x E(X) e 2 2
; and var iance of X=x Var(x) e (e 1)
2

iv) The distribution is unimodal having only one mode given by : mod e e
2

v) As the lognormal distribution is derived from the normal distribution, the log normal distribution is evaluated by using
the table of standard normal distribution. The probability of X lying between the two values, say a and b, is given by;
b
1 1 ln X 2
P(a x b) exp dx
a x 2
2
Log normal distribution arises in the problem of economics, biology, ecology etc. In the field of economics, lognormal
distribution is found mainly in the field of income size distribution, distribution of inheritances, bank deposits, etc. It is more
and often used in financial and wealth distribution analysis.
3. What are the features of Normal distribution?
Normal Distribution is the most important and frequently used continuous distribution. It is the limiting case of binomial
distribution when the number of trials (n) is very large and neither p nor q is very small.
A continuous random variable X is said to follow the normal distribution with parameters (mean) and (standard
deviation) if its probability density function (p.d.f.) is given by
2
1 x
1
f(x) = e 2 -X
2
Where
X= continuous random variable
=mean
= standard deviation
=a constant =22/7
e= a constant i.e. e=2.71828
The normal distribution is denoted by XN(, 2) and is read as X follows normal distribution with mean and variance 2.
If X is normally distributed with mean and variance 2, then the random variable Z defined by
X
Z= is called the standard normal variate and Z is normally distributed with mean () =0 and variance (2) =1

.The normal distribution has following characteristics:
Characteristics of normal distribution
1. The graph of the function f(x) is the famous bell-shaped curve as shown in figure 1. and the curve is symmetrical about
the line x= (z=0).

Fig.1
2. Since the distribution is symmetrical, mean, median and mode coincide i.e.
Mean= median = mode
2
or, =Md=M0
3. Since =Md=M0, the line X= /Z=0 divides the area under the curve into two equal parts (0.5) i.e.
p( X )=0.5 and p( X )
p(- z 0) =0.5 and p(0 z )=0.5
4. The quartiles are equidistant from median i.e. Q3-Md=Md-Q1
5. The distribution is unimodal (i.e. having only one mode)
6. The curve is symmetrical and mesocrutic. Thus coefficient of skewness (1)=0 and coefficient of kurtosis (2)=3
7. The maximum probability occurs at x= and its value is
1
P (x) max=
2
8. The curve of the distribution is asymptote to x-axis, i.e. it comes closer and closer to X-axis on both sides but never
touches the X-axis.
9. The probability that X lies in the interval around the mean () is given by
P(-<x<+)=0.683
P(-2<x<+2)=0.954
P(-3<x<+3)=0.997
In other words, 68.3% of the observations fall between the mean plus minus one standard deviation, 95.4% of the
observation fall between the mean plus minus two standard deviations, etc.
10. A linear combination of independent normal variate is also a normal variate. If x 1, x2.. xn are independent normal
variates with means, 1, 2.n and standard deviation 1, 2,.. n respectively then their linear combination
a1x1+a2x2++anxn is also a normal variable with mean=a11+a22+.. ann and variance= a1212+ a22 22 + ..
+ an2n2
Where a1, a2 an are constant.
Q 3 Q1 5 2
11. Quartile deviation Q.D. = =0.6745= M.D.=
2 6 3
Where M.D.= mean deviation
X np
12. If X is a binomial variable, then for large n, the variable is standard normal variable where.
npq
np=mean of the binomial distribution
npq = standard deviation of the bionomic distribution.
13. The location and shape of the curve for the normal distribution depends on the values of and . If increases the
curve shifts rightwards and if increases, the curve becomes flatter.

4. Find the mean and variance of a binomial distribution.


Solution:
A discrete random variable X is said to follow the binomial distribution with parameters n and p if its probability mass function
(pmf) is given by
f (x) P(X x) n c x px q n x
Where
n = number of trials
p = probability of success
q = probability of failure and p + q = 1
x = number of success
Thus mean of the binomial distribution
x E(x) x.f (x) [ x is discrete random variable]
or, x x.n c x px q n x
n! n!
= x. px q n x n
cr
(n x)! x! (n r)!r!
n(n 1)!
= x. px q n x
(n x)!x(x 1)!

3
(n 1)!
= np px 1q n x
(n x)!(x 1)!
(n 1)!
px 1q
(n 1) (x 1)
= np
{(n 1) (x 1)}!(x 1)!
n 1 x 1 {(n 1) (x1)}
= np c x 1p q
n1
= np.(q+p) (Using binomial expansion)
= np.(1)n1 p+q=1
= np
x = E(x)=np.
Again, we know that in case of discrete random variable X.
Var(X) = E(X2) [E(X)]2
= X2f(x) [x]2
= {X(X1) + X} n c pxqnx [x]2
x
= X2.ncxpxqnx [x]2
= X(x1).ncxpxqnx + x.ncxpxqnx [x]2
n!
= x.(x1). pxqnx + x.f(x) [x]2
(n x)!x!
n(n 1)(n 2)!
= x(x1). pxqnx + x [x]2
(n x)!x(x 1)(x 2)!
x.f(x) = x
(n 2)!p x 2 q n x
2 x [x ]2
= n(n1)p
(n x)!(x 2)!
(n 2)!
= n(n 1)p2 {(n 2) (x 2)}!(x 2)! px 2q{(n 2)(x 2)} x [x ]2
= n(n 1)p .
2
(n2) c(x2) px2q{(n2)(x2)} np [np]2 x np
2 n2 2 2
= n(n1)p .(q+p) +npn p (using binomial expansion)
= n(n1)p2(1)n2+np n2p2
= n(n1)p2 + np n2p2
= n2p2 np2 + np n2p2
= np np2
= np(1p)
= npq p q 1 1 p q
var (X) = npq
Thus the mean of the binomial distribution is np and variance is npq.
5. Define Random Variable. Define the mathematical expectation of a random variable.
The set of all possible outcomes of a random experiment is called a sample space and a variable which takes on different
numerical values as a result of a random experiment is called a random variable. In other words, a random variable is a rule that
relates each outcome of the sample space to some real number.
The random variables are also called stochastic variables, chance variables, variates etc. They are denoted by capital letters X,
Y, Z etc and their particular values are denoted by lower case letters x,y,z etc.
For example, suppose that a coin in tossed twice so that the sample space is S={HH,TH,HT,TT}. If we define the random
variable (X) as the number of heads that can come up, the values of the random variable X are given below:
Table 1
Sample point HH HT TH TT
Random variable (x) 2 1 1 0
Thus, the values of the random variable are the numerical values corresponding to each possible outcome of the experiment. A
random variable may be discrete or continuous. A random variable that takes on a finite or countably infinite number of values
is called a discrete random variable while one which takes on an uncountably infinite number of values is called a continuous
random variable.
Mathematical Expectation:
The mathematical expectation or expected value or mean value of a discrete random variable is the weighted average of all
possible values of the random variable, where the weights are the probabilities associated with the corresponding values. For a
discrete random variables X having the possible values x1,x2,..xn,with corresponding probabilities p1, p2,.,pn
respectively, the expectation of x is defined as:
E(X) = p1x1 + p2x2 +...........................+ p nxn
4
n
pi x i
i 1

Or, E(X)= x.p(X)


Thus, the expected value is obtained by multiplying each value of X by its associatedc probablities and taking the sumof these
products.
For a continuous random variable X, having probability density function f(x), the expectation of x is defined as:

E(X) xf (X)dx

The mathematical expectation of X is very often called the mean of X and is denoted by x or simply . The mean or
expectation of X gives a single value that acts as a representative or average of the values of x.
6. Prove that the mathematical expectation of the sum of the two random variables is independent.
The above statement can be written mathematically as follows:
E(X+Y) = E(X)+E(Y)
Where X and Y are two random variables.
For simplicity, we take a case of discrete random variables. Let X and Y are discrete random variables, f (x,y) be the joint
probability mass function of X and Y, Then,
E(X+Y) = (X Y) f (x,y)
x y

= Xf (x,y) Yf (x,y)
x y x y

= X f (x,y) Y f (x,y)
x y y x

= X.g(x) Y.h(y)
x y

Where(x) marginal probability function of X and h(y) is the marginal probability function of Y. Then,
E(X+Y) = E(X) +E(Y)
E(X Y) E(X) E(Y)
Following the similar logic, we can prove the above statement in case X and Y are two continuous random variables.
7. Prepare a worksheet for Gini coefficient.
Solution: Y N

Ginicoefficient or Giniconcentration ratio is one of the widely used methods to


measure inequality in economics. The Gini coefficient is the ratio of the areas between
the Lorenz curve and the 450 line (perfect equality line) to the area of the triangle below
the 450 line. Symbolically; A

Area between the 45 line and Lorenz curve


Ginicoefficient (G c ) B
Area below the 45 line
O X
If the area between the 450 line and Loren curve is A and the area of the triangle OXN
below the Lorenz curve by is B(As shown in figure), then the Ginicoefficient is:
A
Gc ............................(1)
AB
Farther the Lorenz curve from 450 line, larger will be the value of the Gini coefficient. That is, bigger the value of A, bigger will
be the value of ginicoefficient where (A+B) remains constant.
When there is perfect equality in the variable under consideration then the diagonal line will be the Lorenz curve
implying that the area A will be equal to zero. Thus, Gc will be zero.
If there is perfect inequality in the variable under consideration then the line OXN will be the Lorenz curve, in which
case the area B will be equal to zero. Thus, Gc will be 1.
Therefore, 0 Gc 1 i.e. the value of Gc ranges from 0 to 1.
In the course of empirical analysis, researcher may encounter grouped or ungrouped data sets. The worksheets for both types of
data sets are discussed below.
i. For grouped data set case:
We know that, the formula of Ginicoefficient (Gc) for grouped data set case is:

5
1
Ginicoefficient ( Gc) [Xi Yi 1 Xi 1Yi ]..........................(2)
(100)2
Where,
Xi = Percentage (%) of Cumulative the frequency
Yi = Percentage (%) of Cumulative variable
Here, if we calculate percentage after cumulating the concern variable, then the aggregate percentage will be exactly 100;
otherwise may not.
Worksheet to Calculate Ginicoefficient (Gc):
Let, we have to calculate the Ginicoefficient of the data set given as:
Variable v1 v2 vn
Frequency f1 f2 fn
Arranging the data in ascending order as: v1 v 2 v3 .................. v n .
vi fi % of vi = pi % of fi = qi Cumulative pi Cumulative qi xiyi+1 xi+1yi
v1 f1 v1 f p1 = x1 q1 = y1
100 p1 1 100 q1
v i f i
v2 f2 v2 f p1+p2 = x2 q1 +q2 = y2 x1y2 x2y1
100 p2 2 100 q 2
v i f i
. . . . . . x2y3 x3y2
. . . . . . . .
vn fn vn f p1+. + pn= xn q1+. + qn= yn xn1yn xnyn1
100 p n n 100 q n
v i f i
v i f i pi 100 q i 100 xi yi+1 = k1 xi+1 yi = k2

If we assume, Xi Yi 1 k1 & Xi 1Yi K2 , and substitute these values in equation (2), we get, the Ginicoefficient(GC)

1
= (K1 K 2 ) K [ 0 K 1 ]
(100)2
1. For ungrouped data set case:
Let y1, y2, ,yn be the data set given to calculate the Ginicoefficient. Then, we use following formula to calculation of
Gc. That is,
1 2
Gc 1 2 [ny1 (n 1)y 2 ..... y n ].................(3)
n n y
Where,
y1 y2 ............................ yn
n = total number of observations
y
y = mean value of the variables =
n
8. Sampling is a necessity for any investigation. Justify and discuss this statement with the help of the suitable examples.
Solution:
For any research work, investigation is regarded as important part. It explores the new and hidden facts for any research work.
Generally, investigation is made from the observation of the population under study. In most of the cases, it is not possible to
take information from all of the units of the population as such few units from the population are selected as samples and
information is taken from those samples only. On the basis of that, conclusions are drawn about the characteristics of the
population.
If population is infinite, then, investigation through the all the units of the population is almost impossible. For example, it is
difficult to study all the fishes of the sea. It is not possible to observe all the units of blood of human body etc. In such cases, we
have to take samples from those populations to draw conclusions about the population under study. On the other hand, when
population is finite, it may not be feasible to study all the units of the population due to limitations regarding time and other
resources. Thus, sampling becomes a necessity for any investigation. The main reasons for making it such a necessary part of
the investigation can be enlisted as below:
A. Sampling consumes less time and money as compared to complete study of all population units. Since it takes few
elements from the population, then the total cost of sample survey is too much less than the total cost of census survey if
the cost of enumerating the sampling unit is equal to the cost of enumerating the population unit. Similarly, the time for
6
collection of data, editing classification, analysis and interpretation of data requires much less time as compared to census
survey. Thus, time is of great importance when the information is urgently required.
B. Sampling remains the only way when population contains infinite members because the complete enumeration is
impossible in such cases. For instance, the number of fish in the sea or the number of wild elephants in a dense forest can
be estimated only by sampling method.
C. Sampling remains the only choice when a test involves the destruction of the item under study. For example:
i) To estimate the average life of the bulbs or tubes in a given consignment.
ii) To determine the composition of a chemical salt.
iii) To test the breaking strengths of chalks manufactured in a factory or to estimate the tensile strength of the steel rods.
iv) To test the quality of explosives, crackers, shells.
v) To test the blood of the human body, etc.
D. Complete enumeration of the population or census survey is inappropriate for the hypothetical population. In such
case, sample survey is the only alternative to get information about the population. For example.
i) To study about crashed aircraft, sample survey is the only way.
ii) In the problem of throwing a die or tossing a coin where the process may continue large number of times or
indefinitely, sample survey is the only way.
E. Highly skilled or qualified human resources can be used along with excellent instruments in the sample survey but it
may be impracticable in census survey. It means that sample survey is generally conducted by trained & experienced
investigators.
F. A carefully designed and scientifically executed sample survey gives results which are more reliable than those
obtained from a complete census because it is always possible to ascertain the extent of sampling error and degree of
reliability of the results. Similarly, followup work in case of non response or incomplete response can be effectively
undertaken in a sample survey than in a census. The effective reduction of the nonsampling errors in a sample survey
more than compensates the errors in the estimates due to sampling procedure and thus provides relatively more accurate
and reliable results.
Conclusion:
Though in case of finite population of very small size, it seems relevant to take from all the items of the population, sampling
becomes an essential part of the study from the standpoint of time to be consumed, budget and manpower to be used and the
degree of accuracy of the results. All these reinforce us to conclude that sampling is a necessity for any investigation.
9. Distinguish between sampling and nonsampling errors.
In sampling, we take a small part of the population as our study units and estimate the population parameters. The values
calculated from the sample may differ from the true value of the population parameter due to many reasons, for example, faulty
sampling technique used, poor sample design, wrong response, non-response, typing errors etc. The errors arisng from faulty
sampling technique are called sampling errors while the human mistakes made during the implementation of sample design are
called non-sampling errors.
The differences between them can be listed in the table below:
Sampling error Nonsampling error
Sampling errors are involved in the process of sampling Nonsampling error arises mainly due to the human factor.
and they are attributed to chance or probability.
Sampling errors are completely absent in the census survey. Nonsampling error, occurs on both census and sample
They only occur in sample survey. survey. Nonsampling errors will be of large magnitude in a
census survey than sample survey.
A measure of the sampling error is provided by the We have no way to measure nonsampling errors.
standard error of the estimate.
Sampling error mainly occurs due to improper choice of Nonsampling error arises mainly due to improper planning of
sample design for example, in the study of income the survey.
distribution, the purposive sampling is less appropriate than
the stratified random sampling.
The main sources of sampling error are: The main sources of non-sampling error are:
Improper choice of sampling technique Faulty planning
Substitution of units Response error
Improper choice of statistic Non-response bias
Variability in population Compiling errors
Coverage errors
Publication errors

7
Sampling errors decrease as the sample size increases and Non sampling error increases with the increase in the number
viceversa. Thus sampling error can be minimized by of units to be examined and enumerated. Therefore non
increasing sample size. sampling error cannot be minimized by increasing the sample
size.

10. Discuss the Procedure of Simple Random Sampling.


It is the simplest and common most method of sampling. Simple random sampling (SRS) is the technique in which 'sample are
so drawn that each and every unit in the population has an equal and independent chance of being included in the simple". In
other words; samples are drawn unit by unit with equal probability of selection for each unit at each draw.
In SRS, we can select the samples out of population of size "N" by two ways:
i) Simple random sampling without replacement (SRSWOR).
ii) Simple random sampling with replacement (SRS WR).
If a unit or element selected at a draw is removed from the population before the next draw, then it is known as SRS WOR. At each
draw, each unit in the population has the same probability of being included in the sample.
If sampling is done without replacement, then there are Nc possible samples of size n.
n
On the other hand, if the unit selected at any draw is replaced or, returned in the population before making the next draw, then
it is known as simple random sampling with replacement (SRS WR). Thus it always amounts to sampling from an infinite
population, even though the population is finite.
If a sample of size n is drawn with replacement from a population of size N, then there are N n possible samples.
Procedure of Simple Random Sampling
The procedure for simple random sampling can be summarized as below:
Step 1 Identify the Population
First of all, the size of the population under study should be clearly identified.
Step 2 Determine the Sample Size.
The size of the samples to be taken should be determined from the standpoint of representativeness and minimum resource use.
If the sample size is too small, it may not represent the population under study and if it is large, it may demand a lot of resources
and manpower and in some cases, the size of the non-sampling errors becomes quite large.
Step 3 Select the Sample using the lottery method or random numbers table method.
The two methods used to draw samples are:
Lottery method.
Use of random number tables.
Lottery Method: It is one of the simplest methods of drawing/selecting the samples from the given population. It is mostly used
in the random draw of prizes and others.
In this method, each unit of the population is recorded on a slips or, cards or tickets assigning a distinct number like 1, 2, 3,
4.. etc. These slips should be as homogeneous as possible in terms of shape, size and color etc so that no human bias may
arise. Then the slips are uniformly folded and thoroughly mixed up in a bag or container depending upon population size. The
required number of samples are then selected by choosing the slips randomly.
Use of Random Number Tables:
For, infinitely large population, lottery method is not appropriate because it is quite time consuming and cumbersome to use.
In such condition, random number table is generally used to avoid such difficulties. Therefore; the most practical and in
expensive method of selecting a random sample consists in the use of 'Random Number Tables' which have been so constructed
that each of the digits 0,1,2, 9 appears with approximately the same frequency and independently of each other.
The method of drawing a random sample comprises the following steps:
i) Identify all the units of the population with the Numbers 1 to N.
ii) Select at random, any page of the 'random number table' and pick up the numbers in any row, column or, diagonal at
random.
iii) The units of population related to the selected numbers build the random sample.
Some random number tables in common use are:
Tippet's random number tables.
Fisher and Yates tables.
Kendal and Smith tables.
A million random digits.

8
11. Discuss the Procedure of Systematic Sampling.
Solution:
In this type of sampling first of all, a list is prepared of all elements of the population under study on the basis of a selected
criterion e.g. alphabetical order. The first element is chosen by simple random sampling and the remaining units are selected at
a definite interval.
Let N be the size of the population and n be the sample size to be drawn, and then the sampling interval (k) is given by
K= N/n
th
Thus if i unit is selected at random as the first unit then on the basis of that the remaining units are selected as i, i+k, i+2k,
i+(n-1) k
For example number of students in MBA =100, sample size required is 20 such that sampling interval (k)=100/20=5.
If the random start is 4, the samples are 4th, 9th, 14th, 19th99th units.,
Procedure of systematic Random sampling
The procedure for systematic random sampling includes the following steps.
a) Identify and Define the Population.
First of all, the population to be studied is precisely and clearly defined. If must clear which object, elements to be included
in the population and which elements are not to be included.
b) Determine the Desired sample size:
Next the size of the sample is determined. It depends on the level of precision required.
c) Obtain a list of population
The main requirement of the systemic random sampling is to have the complete and updated list of population units and
they must be arranged in some systematic order such as alphabetical, chronological, geographical etc. The characteristics
to be selected for this purpose must be relevant for the problem understudy.
d) Determine the sampling fraction
In this stage, the sampling fraction (k) is determined by the formula.
K= N/n
Where K= sampling fraction/interval
N= size of population
n= size of the sample required.
For example if there are 400 units in the population and we want a sample of size 40, the sampling interval (k)=N/n
K= 400/40=10
e) Make a random start: Among the systematically ordered sampling units, the first unit is selected at random by using
simple random sampling. Thus every sampling unit has equal probability of so being selected at random start.
f) Select every kth item/element: After making a random start, every kth unit is selected. If the random start is ith, then the
elements i+k, i+2k, i+3k i+(n-1)k are selected
Merits of Systematic Random Sampling
Simple Direct and inexpensive: Systematic sampling saves time and labor because it can be operated in a very short time.
Efficient: Systematic sampling is an efficient approach if the frame is complete.
Checking can be done quickly.
It is a patterned/serial sampling.
Demerits
Systematic random sampling would be inefficient when the complete and update frame is not available.
If there is hidden periodicity, it will be biased and prove to be inefficient.
It cannot be used in case of exploring unfamiliar area because listing of elements is not possible.
If all elements are not in ordered manner it wont be very much reliable.
12. Describe the procedure for stratified random sampling. Under what condition is stratified random sampling referred
to simple random sampling? Discuss.
Simple Random sampling is that it may not represent the characteristics of population if population under study is heterogeneous
in nature. In this case, researchers sometimes prefer to increase the sample size to reduce the sample bias. Alternatively, instead
of increasing the sample size, one can use the stratified random sampling to avoid the sample bias.
Stratified random sampling is thus used to obtain more efficient estimator when the population elements to be sampled out are
heterogeneous in nature. For example; suppose we are going to study about the per capita income of the people in the different
parts of Nepal. Due to different constraint, it is not possible to enumerate all the units of population. If we adopt the technique
of simple random sampling to draw the samples from particular population, all samples may belong to only one group which

9
cannot represent the total population. In such a scenario, stratified random sampling can be assumed as a best way of representing
the total population i.e. such income group (lower, middle and upper income group in above example).
Procedure: In stratified random sampling, the population is divided into different groups called strata. The process of dividing
the total population into different strata is known as stratification.
Stratified random sampling involves the following steps:
Step 1: Stratify the given population into different strata; such that:
a) The elements or units within each stratum (Subpopulation) should be as homogeneous as possible.
b) The stratum with another should be as heterogeneous as possible.
c) Each and every unit in the population should be belonging to one and only one stratum as far as possible i.e. various
strata should be nonoverlapping.
Mathematically;
Let N be the total population size under study and n be the total sample size. It means that sample size of n has to be drawn from
population size of N.
If we use the sampling technique of stratified random sampling to select the sample from given population, then the population
of size N is divided into different subgroups/subpopulation N1, N2 Nk.
N N1 N2 N3 ........... Nk
Where, k denotes the total no. of strata. While dividing the total population into different strata i.e. elements/units having similar
characteristics must be placed in same stratum.
Step 2: Determine the Required Sample Size (n)
Step 3: Allocation of sample size in stratified sampling:
In this step, the sample size for a stratum (ni) is estimated into three ways which are given below;
a) Proportional allocation
b) Optimum allocation
c) Disproportionate Allocation
a) Proportional allocation: In this method, the elements are drawn from each stratum in the same proportion as they exist
in the population. Mathematically;
n1 n 2 n n
........ k
N1 N2 Nk N
Since the total sample size n, and the population size N are fixed, then;
n n N n
n1=N1 , 2= 2
N N
n
Thus, ni = Ni , i = 1,2 . . . k.
N
b) Optimum Allocation: It this method, nk is drawn from kth stratum under the condition that the cost c and Var ( Xst ) is
minimum and sample size of kth stratum is given as:
nWk Sk / Ck
nk
Wk Sk / Ck
Where C = C0 + Ck n k (cost function)
C0 = overhead cost
C = total cost
Ck = cost per unit of Kth stratum.
c) Disproportionate Allocation:
Here, an equal number of elements are taken from every stratum regardless of how the stratum is represented in the population.
Sometimes the proportion may vary from stratum to stratum also.
Step 2: Draw sample of size nk through simple random sampling (without replacement) from each of k strata.
Mathematically; let n be the total sample size drawn from population of size N. After stratifying the total population into k sub
groups i.e. strata, sample from different strata are to be selected through SRS(WOR). Suppose sample of size n1 is drawn from
N1 stratum, n2 from N2 stratum and so on. Therefore total sample size n is equal to;
n = n1 + n2+ ..+ nk..(B)
Stratified random sampling refers to the simple random sampling under the condition that the population consisting of
homogeneous elements/units. In such condition, there is no need of stratifying the population into different strata or group.

10
13. Point Estimation and Interval Estimation
Estimation is a process by which the population characteristics (mean, variance etc.) are estimated on the basis of the calculation
based on samples. In other words, in estimation, we do statistical calculations on sample data and assign values to the population
parameter which we guess to be very close to population parameter.
Estimation can be done in two ways:
1. Point Estimation
2. Interval Estimation
1. Point Estimation
A point estimate is a single number or a single value of the statistic derived from sample observation that is used to estimate an
unknown population parameter. Thus, the procedure of assigning a numerical value to the population parameter on the basis of
sample statistics is called point estimation. The main aim of the point estimation is to find a single value which is the best guess
of the parameter. For example, if we find the sample mean as x =20 and estimate the population mean =20, it called point
estimation.
Merits of point Estimation
1. A point estimate is exact and accurate as it provides a single value.
2. When economists want to forecast the single value of the economic variables like GDP growth rate, money supply,
investment etc, point estimation is the only alternative.
3. Easier to calculate than interval estimate
Demerits of point estimation
1. There is the large degree of error and risk in obtaining it.
2. Being a single value, it can be right or wrong.
3. Here level of significance is not taken into account.
4. It is less reliable.
2. Interval Estimation
An interval estimate is a range of values used to estimate a population parameter. It gives a range of values within which the
population parameter to be estimated is expected to lie with some given level of probability. For example; if an economist claims
that the economic growth of the next fiscal year will be between 4 to 6 percent, then it is interval estimation.
An interval estimate of the population parameter is an interval of the form C1< < C2; where C1 and C2 are the lower and upper
limits of the interval estimate respectively, then P (C1< <C2) = 1 , where is the level of significance. The above statement
says that the population parameter is expected to lie in the interval C1 to C2 with a confidence of (1 ). 1 is called
confidence coefficient.
In short, interval estimation refers to the estimation of a parameter by an interval, known as confidence interval. The confidence
limit is the range of values within which the population parameter is supposed to lie. The end points of the interval of C1 and C2
are called the lower and upper confidence limits respectively.
For, example: the interval estimate for population mean in case of large sample for =5% can be written as:


x 1.96( )
n
Where, n=sample size, = population S.D. and x sample Mean
Merits of interval estimation
1. The degree of risk of obtaining it is very low.
2. Here level of significance is taken into account.
3. It is more reliable because it indicates error in two ways
By the extent of its range
By the probability of true population parameter lying within this range.
Demerits of interval estimation
1. It can't provide exact and to the point estimate.
2. More difficult to calculate than point estimate.
3. In can't be used in the cases where we want a single value to be forecasted.
14. Characteristics of a Good Estimator:
Any sample statistics that is used to draw the inference (conclusion) regarding population parameter is called an estimator. The
estimator which is close to the true value of the population parameter is called the good estimator.
The main features of a good estimator are:

11
1. Unbiasedness:
An estimator which is used to estimate the population parameter is said to be an unbiased estimator of the population
parameter if the mean or expected value of the sampling distribution of the estimator equals the population parameter. i.e.
Mean of the statistic = parameter
Or, E( ) =
The bias is given by bias (b) = E( ) - .
If b=0, the estimator is unbiased,
If b< 0, the estimator is negatively biased or biased downwards and
If b> 0, the estimator is positively biased or biased upwards.
In other words, if b> 0, overestimates and if b< 0, underestimates .
For example, sample mean ( x ) is an unbiased estimate of population mean ( ) whereas sample variance (s2) is not an unbiased
2 2
estimate of true population variance as E( x ) = ,but E[s ] .
2. Efficiency ( Minimum Variance):
A sample statistic is called efficient if the variance of the sampling distribution of that statistic is smaller than any other statistic.
If there are two estimators and both are unbiased, then one with a smaller variance is relatively efficient. If t 1 and t2 are two
unbiased estimators of the parameter of a given population and the variance of t1 is less than the variance of t2 i.e. Var [t1] <
Var [t2], then estimator t1 is relatively more efficient estimator of than t 2. For example, if a population is symmetrically
distributed, both sample mean ( x ) & sample median (md) are consistent estimators. However, the variance of mean is less than
that of median i.e. Var ( x ) < Var [md]. Therefore, the sample mean is more efficient estimator of population mean () than
sample median (md).
The relative efficiency of sample mean is given by:
Var (md )
Relative efficiency =
var( x )
The smaller variance always comes in the denominator so that the ratio is always greater than one.
3. Consistency:
A sample statistics is said to be consistent estimator of the population parameter, if it approaches the population parameter as
the sample size increases. That is, any statistic t is a consistent estimator of the parameter ' ', iff as t as n .Since
sample mean and sample proportion approach to population mean and population proportion, sample mean ( x ) and sample
proportion (p) are consistent estimator of population mean () and population proportion (P).
4. Sufficiency:
An estimator utilizes all the information in the sample regarding the population parameter is called sufficient estimator. That is,
an estimator 't is said to be sufficient if it utilizes all the information in a sample relevant to the estimator of . For example,
sample mean ( x ) is a sufficient estimator of population mean () because all the information in the sample are used while
computing it.
5. BLUE Property
An estimator is said to be BLUE if it is best, linear and unbiased. Here, the term best implies minimum variance, unbiased
implies that the mean value of the estimator equals the population parameter and linearity means the estimator is related to the
x1 x2 x3 ...... xn
sample values in linear form. For example, sample mean ( x ) = in which sample mean is related to
n
the sample values in linear form. Since ( x ) is unbiased and has smallest variance, it is a BLUE estimator of the population
mean.
Besides these properties, we can have other properties like minimum MSE, robustness, etc.
15. Differences between Point Estimation and Interval Estimation
The main differences between point estimation and interval estimation can be enlisted as below:

Point Estimation Interval Estimation


1.A point estimate is a single value of the statistic that is used 1.An interval estimate is a range of values with in which the
to estimate a population parameter. Here, we calculate a true population parameter is considered to lie with certain
numerical value of statistic to assign it to population degree of confidence. E.g. if we find the range as [15 25] and
parameter. E.g. if we calculate sample mean (x) as 5 and say that the true population mean lies in this interval with
95% confidence, it is interval estimation.

12
conclude the population mean () to be 5, it is point
estimation.

2.It is not necessary to find interval estimation to find point 2.It is necessary to find point estimate to construct an interval
estimation. estimate.

3. It is a single number on the real number scale, hence the 3.It is a range of values/an interval, hence the name interval
name points estimate. estimation.

4. It is exact, precise, unique and accurate value 4. There is no precise and unique value but a range of values.

5.A point estimate is likely to differ from the true population 5.An interval estimate is not likely to differ from the true
parameter due to sampling fluctuations. population parameter due to sampling fluctuations.

6.It is less reliable because level of significance is not taken 6.It is more reliable because here level of significance is
into account. So it can be right or wrong and is often taken into account. So, in most of the cases, it cant be wrong
insufficient. and is sufficient.

7. There is a very high degree of risk in obtaining it as it can 7. There is a low degree of risk in obtaining it.
be right or wrong.

8. It does not tell us how far the estimator from the true 8. It tells the magnitude of the estimator from the parameter
parameter is. being estimated.

Why interval estimation is more frequently used than point estimation?


A point estimate can be right or wrong as it is the single point. So in using a point estimate, we have to be very optimistic
that the sample statistic equals the population parameter which we want to estimate. So it is often insufficient. But, on the other
hand, interval estimate is sufficient as it takes the level of significance into account. So the degree of error in interval estimate
is very low. An interval estimate is more reliable as it indicates error in two ways:
- By the extent of its range
- The probability of the true population parameter lying within this range
Due to these merits of interval estimation over point estimation, interval estimation is more frequently used than point
estimation.
16. Hypothesis Testing
A hypothesis is a proposition whose validity is to be tested by drawing a sample from the population. Thus, it is a supposed
value of a variable parameter/supposed relationship between two variables. For example, if we make a claim that the mean score
of the students is 75, it is a hypothesis.
The hypothesis is tested on the basis of the information drawn from the sample. The sample statistics may differ from the true
value of a population parameter due to sampling fluctuations or there may be a real difference between them. Hypothesis testing
is a procedure for testing whether there is a real difference between the statistic and the parameter or the difference is due to
sampling fluctuations only.
Procedure for Hypothesis Testing
Step 1: Formulation of a Hypothesis
Two types of hypotheses are made:
Null Hypothesis (H0): It states that there is not any difference between the statistic and parameter.
Alternative Hypothesis (H1): It is the counterpart of null hypothesis. It states that there is a significant difference between the
statistic and parameter.
Step 2: Level of Significance ():
It shows the level of error we are going to tolerate while accepting or rejecting the null hypothesis. It is set at 5 percent unless
otherwise stated.
Step 3: Calculation of Test Statistic
In the third step, the test statistic is calculated. It may be z-statistic, t-statistic, F-statistic, chi-square statistic, DW statistic, etc.
The actual choice of the statistic depends on the nature of hypothesis being tested.
Step 4: Finding the Table Value/Critical Value
Table value of the test statistic is observed at level of significance and with certain degrees of freedom (if any). This value
acts as a borderline between the acceptance region and rejection region.
Step 5: Decision Making
If calculated value of the statistic < table value of the statistic, H o is accepted.
If calculated value of the statistic > table value of the statistic, H o is rejected and H1 is rejected.
(Note: minus sign is ignored while comparing the calculated value to the table value)
13
17. Discuss the Procedure for Z-test
Z-test is a large sample test which can be used when the sample size is greater than 30 and the population under consideration
is normally distributed. It can be used for testing:
The significance of a single mean
The significance of difference between two means
The significance of single population proportion.
The significance of difference between two population proportions.
We below illustrate the procedure of z-test for testing the significance of a single mean:
Step 1: Formulation of a Hypothesis
Null Hypothesis (H0): = 0; i.e. the population mean is 0.
Alternative Hypothesis (H1): 0; i.e. the population mean is significantly different from 0.
Step 2: Level of Significance ():
It is set at 5 percent unless otherwise stated.
Step 3: Calculation of Test Statistic
Under Ho, the Z-statistic is calculated as:
x
z ~ N (0,1);
/ n
Where,
;
x sample mean
=population mean (claimed value)
=population standard deviation and
n= sample size
Step 4: Finding the Table Value/Critical Value
Table value of the z- statistic is observed at level of significance from normal probability table.
Step 5: Decision Making
If z-calculated < z-tabulated, Ho is accepted.
If z-calculated > z-tabulated, Ho is rejected and H1 is accepted.
18. Discuss the Procedure for t-test
It is a small sample test which can be used when the sample size is less than 30 and the population under consideration is
normally distributed. It can be used for testing:
It can be used for testing:
The significance of a single mean
The significance of difference between two means (Independent Population)
The significance of difference between two means (Dependent Case)
The significance of sample correlation coefficient.
The significance of regression coefficients.
We below illustrate the procedure of t-test for testing the significance of a single mean:
Step 1: Formulation of a Hypothesis
Null Hypothesis (H0): = 0; i.e. the population mean is 0.
Alternative Hypothesis (H1): 0; i.e. the population mean is significantly different from 0.
Step 2: Level of Significance ():
It is set at 5 percent unless otherwise stated.
Step 3: Calculation of Test Statistic
Under Ho, the t-statistic is calculated as:

14
x
t ~t ;
/ n
n1

Where, ;
x sample mean
=population mean (claimed value)
(x x )
2
2
S= is an unbiased estimate of population variance andf
n 1
n= sample size

Step 4: Finding the Table Value/Critical Value


Table value of the t- statistic is observed at level of significance and n-1 degrees of freedom from t-table.
Step 5: Decision Making
If t-calculated < t-tabulated, Ho is accepted.
If t-calculated > t-tabulated, Ho is rejected and H1 is accepted.

19. Discuss the procedure of


2 test.

2 -test was first developed by Karl Pearson in 1900. It explains the magnitude of discrepancy between expected frequency and
2
observed frequency. So, is often used to test the differences in theory and observation. The test statistic is given by.
(O E) 2
2
E
Where, O = observed frequency and E = expected frequency.
2 is a nonngative quantity. Hence, its value ranges from zero to infinity. If 2 is zero, it signifies that the discrepancy
2
between observed and expected frequencies vanishes. In addition, if the value increases, the discrepancy between the
expected and observed frequency goes up. So, the chisquare test is performed to know whether the differences between
observed and estimated frequency is significant or that is only due to sampling fluctuations.
2
Two uses of test:
a) To test the Goodness of Fit
b) To test the Independence of Attributes
2
a) Procedure of test when it is used to test the goodness of fit:
Step 1 Setting Hypothesis:
H0: There is no statistically significant difference between observed and expected frequencies.
H1: There is statistically significant difference between observed and expected frequency.
Step 2 Level of significance:
is taken at 5% unless otherwise stated.
Step 3 Computation of Test statistic:
(O E)2
We use the formula as:
2
E
2
Step 4 Writing tabulated value of from given table
Step 5 Decision making:
2
If the calculated value of is less than the tabulated value at a given level of significance and degree of freedom,
we accept H0 and consider the fit as good. Hence, there is no statistically significant difference between the observed
and expected frequencies.
2
If the calculated value of is greater than the tabulated value at a given level of significance and degrees of
freedom, we reject H0 and the fit is considered to be poor. Hence the discrepancy between observed and expected
frequencies is not due to fit the fluctuations but due to the inadequacy of the theory to fit the observed facts.
b) Procedure of 2 test as a test of independence of attributes:

2 test for independence is used to test whether attributes are related or independent. In this test, the attributes are classified
into a two way table or a contingency table. The observed frequency in each cell is known as cell frequency. The total frequency
in each row or column of the two way contingency table is known as marginal frequency. This test shows whether there is any
association or relationship between two or more attributes.
15
The procedure of testing independence of attributes is as below:
Step 1: Formation of Hypothesis:
H0: Two attributes say A and B are independent i.e. there is no relationship between A and B.
H1: Two attributes (i.e. two categorical variables) A and B are dependent i.e. an association exists between A and B.
Step 2: Level of Significance ():
is taken at 5% unless otherwise stated.
Step 3: Computation of the test statistic:
2
Under H0, is computed as;
(O E)2
2
E
Where, O = Observed frequency (cell frequency), E = expected frequency in a cell.
The expected frequency in a cell is given by;
RT CT
E=
N
Where, RT=row total
CT=column total
N=grand total (total sample size)
Step 4: Writing the Table value
The degree of freedom i.e. v for r c contingency table is (r1) (c1) i.e. v=(r1), (c1) where n=r c = total number
2
of frequencies. Then tabulated or critical values at level of significance for (r1) (c1) d.f. i.e. table value is
obtained. The most commonly used level of significance i.e. is 5%.
Step 5 Make decision:
2 2 2 2
Decision is made by comparing the calculated value of (i.e. cal) with tabulated value of (i.e. tab)
2 2
If cal< tab; H0 is accepted.
2 2
If cal> tab; H0 is rejected (i.e. H1 is accepted).

20. Discuss the procedure of Ftest.


FTest is applicable to test the difference between the variance of two independent normal populations. It can also be used to
test whether the two samples are taken from a normal population having same variance or not.
Ftest is especially used to test:
The equality of population variance
The equality of several population means(ANOVA test)
The significance of an observed simple or multiple correlation
The overall significance of regression line
Assumptions of Ftest are:
Normality: The values in each group are normally distributed.
2 2 2
Homogeneity: the variance within each group should be equal for all groups; i.e. 1 = 2 ==..= n
All sample observations are independent and are based on random sampling technique
Independence error: It states that the variation of each value around its group mean should be independent of each
value.
Since the F distribution is always formed by a ratio of squared values, it is always positive
Procedure for Ftest:
Step 1 Setting Hypothesis:
2 2
Null Hypothesis (H0): 1 = 2 i.e., two population variances are same.
2 2
Alternative Hypothesis (H1): 1 2 i.e., two population variances differ significantly.
Step 2 Level of significance:
We set = 5% unless otherwise stated.
Step 3 Computation of the test statistic:

16
S12
F= 2
at....F( 1, 2 ) if S12 > S22
S2

S22 2 2
F= at....F( 1, 2 ) if S2 > S1
S12
Where, 1 = n1 1 & 2 = n21 are Degree of freedoms

1
S2i
n i 1
(Xi Xi )2 are sample variances.
Step 4 Writing the critical or tabulated Value:
Determine % level of significance for ( 1, 2 ) i.e. (n1 1) degree of freedom from given table.
Step 5 Decision Making:
Compare the calculated and tabulated value of F statistic. If calculated Fvalue is smaller than tabulated F ( 1, 2
), accept H0, otherwise reject the H0.
21. Discuss the procedure of DW Test. When do you use DW test?
DW test or Durbin-Watson test is used for detecting the presence of the serial correlation (autocorrelation) in the error terms
in the regression model. In short, DW test is used testing for autocorrelation or serial correlation. Autocorrelation refers to
the relationship, not between two (or, more) different variables, but between the successive values of the same variable. In the
presence of autocorrelation, the ordinary least square (OLS) estimators are still linear unbiased as well as consistent and
asymptotically normally distributed, but they are no longer efficient (i.e. do not have minimum variance). Therefore, detection
of the presence of autocorrelation is an important task and DW test is used to fulfill this task.
This test is appropriate only for, the firstorder autoregressive scheme.
Procedure of DW Test:
The following steps are used for, testing the autocorrelation or, serial correlation
Step1: Hypothesis Formulation
H0: =0 i.e. the error term is not auto-correlated.
H1: 0 i.e. error term is auto-correlated.
Step 2 Level of significance
We use = 5% unless otherwise stated.
Step 3 Computation of the test statistics
Durbin-Watson d statistic is defined as:
n
(u i u i 1 )2
i 2
d n
u 2i
i 1
In case of time series data, the formula can be expressed as:
n
(u t u t 1 )2
t 2
d n
u 2t
t 1

(Actual Y minus estimated Y)


Where, u i is the error term calculated from the regression as u i = Yi Yi

Assumptions underlying the d statistic


The regression model includes the intercept term. It if is not present, as in the case of the regression through the
origin, it is essential to return the regression including the intercept term.
The explanatory variables, the X's, are non stochastic, or, fixed in repeated sampling.
The disturbances u i are generated by the first order autoregressive scheme: u i u i 1 i .
The error term u i is assumed to be normally distributed.
The regression model does not include the lagged value (s) of the dependent variable as one of the explanatory
variables.
There are no missing observations in the data.
Step 4 Writing the Table Value
17
For, the given sample size and given number of explanatory variables, the critical d L and dU values at appropriate
level of significance are written.
Step 4 Decision Making
If 0<dcal<dL, H0 is rejected and concluded that there is positive autocorrelation.

If d L dcal d U , No decision.
dU<dcal<4dU, accept H0, and the conclusion is that there is no autocorrelation.
If
If 4 d U dcal 4 d L
, no decision and
4<dL<dcal<4, reject H0 and the conclusion is that there is negative autocorrelation.
If

22. Write a note on Partial Correlation. List out its relative merits and demerits.
Partial correlation is also called net correlation and to some extent, it is the reverse of multiple correlation. It seeks to
measure the relationship between one dependent variable and one particular independent variable acting separately, keeping the
effect of all other independent variables theoretically eliminated or removed. In other words, it aims at measuring the degree of
association between dependent variable and a single independent variable in a universe unaffected by variations in other specified
independent variables.
A partial correlation coefficient seeks to answer the question, what is the relationship between, say Y and X 1 keeping
other Xis constant? It is denoted by r12.345.n and read as partial correlation coefficient between X1 and X2 keeping the effect
of X3, X4,and Xn constant. For simplicity, if we suppose that there are only three variables: X 1, X2 and X3, then the three
partial correlation coefficients are defined in terms of simple correlation coefficients and are given by:
r12.3= partial correlation coefficient between X1 and X2 keeping the effect of X3 constant
r12 r13 .r23 ,

1 r13 2 1 r23 2
r23.1= partial correlation coefficient between X2 and X3 keeping the effect of X1 constant =
r23 r12 .r13 , and

1 r12 2 1 r13 2
r13.2 = partial correlation coefficient between X1 and X3 keeping the effect of X2 constant =
r13 r12 .r23
,
1 r12 2 1 r232

Where,

r12=simple correlation coefficient between X 1 and X2,

r23= simple correlation coefficient between X 2 and X3,

r13=simple correlation coefficient between X1 and X3,


The values of partial correlation coefficients lie between -1 and +1 .i.e. 1 r12.3 , r23.1 , r13.2 1
Further, partial correlation coefficients are always interpreted through the coefficient of partial determination which shows the
proportion of unexplained variance in the one variable that is explained by the additional influence of the variable not being held
constant. For example:
r212.3= Extra variation in X1 explained by the additional influence of X2
Variation in X1 unexplained by X3 alone

Merits of Partial Correlation:


In reality any phenomenon is affected by a multiplicity of factors. For example, production of wheat is affected by
amount of fertilizer, amount of rainfall, quality of seeds, temperature etc. With the help of partial correlation, we can
determine the degree of relationship between one dependent variable and one particular independent variable keeping
the influence of other variables constant.
It is especially useful in the analysis of interrelated series. It is pertinent to uncontrolled experiments of various kinds
in which such relationship usually exists. Most economic data fall in this category.

Demerits
The zero order correlation must have linear regression.
The effects of the independent variables must be additively and not jointly related.

18
It has laborious calculations and difficult interpretations even for the statisticians.
Its reliability decreases as the order (number of variables kept constant) increases.

23. Write a note on multiple correlation. List out its relative merits and demerits
Multiple correlation is an extension of partial correlation. It is a measure of relationship between dependent variable
and one another variable that is a combination of all other independent variables. It measures the degree to which variations in
dependent variable are related to the combined effect of all other independent variables. Thus, multiple correlation coefficient
measures the degree of association between dependent variable and one another variable that includes the combined effect of all
other independent variables. The multiple correlation coefficient is denoted by R1.23.n which measures the degree of relationship
between the variable X1 and a combination of all other independent variables X2, X3,., Xn. In case of three variables only:
X1, X2, and X3, the three multiple correlation coefficients are given by:

r12 2 r132 2.r12.r13 .r23


R1.23= multiple correlation coefficient between X1 and a combination of X2 and X3
1 r232
R2.31= multiple correlation coefficient between X2 and a combination of X3 and X1
r232 r12 2 2.r12.r13 .r23

1 r132
R3.12= multiple correlation coefficient between X3 and a combination of X1 and X2
r132 r232 2.r12.r13 .r23

1 r12 2

Where,

r12=simple correlation coefficient between X 1 and X2,

r23= simple correlation coefficient between X2 and X3,

r13=simple correlation coefficient between X1 and X3,

The values of multiple correlation coefficients always lie between zero and one i.e.
0 R1.23 , R2.31 , R3.12 1
Further, the values of multiple correlation coefficients are always interpreted through the coefficients of multiple determination
which show the proportion of explained variation in dependent variable that is explained by the joint effect of all independent
variables.
Merits
It helps us to find the degree of association between dependent variable another group of variables as independent
variables. Many economic phenomena are affected by a multiplicity of factors. So, multiple correlation helps to
determine the combined or joint effect of all cause variables on dependent variable.
It serves as a measure of goodness of fit of regression plane/line.
Demerits
It is based on linear relationship. So, linear regressions coefficients are not accurately descriptive of curvilinear data.
There exists a possibility of misinterpretation.
Limitations of the assumption that the effects of independent variables on the dependent variables are separate distinct
and additive.

24. Distinguish between Partial and Multiple Correlation.


Correlation means simply the nature and degree of relationship among variables. When we measure the degree of relationship
between two variables only, it is called simple correlation and when we probe into the analysis of relationship among more than
two variables, we are led to multiple and partial correlations.
Partial Correlation
Partial correlation is the study of relationship between one dependent variable (say X 1) and one independent variable (say
X2) while the influence of all other independent variables (X3, X4, Xn) are theoretically held constant. It is denoted by
r12.345n
For example, when we study the relationship between wheat yield (X 1) and rainfall (X2) while keeping the influence of
fertilizer (X3), quality of seed (X4) etc. constant. It is partial correlation.
Multiple Correlation

19
Multiple correlation is the study of relationship between one dependent variable (say X 1) and one another variable which
is a combination of all other independent variables. So, in multiple correlation, we study the relationship between one dependent
variable and all other independent variables taken simultaneously. It is denoted by R 1.234n
For example, if we study the relationship between wheat yield (X 1) and another variable which in the combination of
fertilizer (X3), rainfall (X2), quality of seed (X4) etc., it is multiple correlation.
The main difference between partial correlation and multiple correlation can be summarized in the following table:

Partial Correlation Multiple Correlation


1. Partial correlation is the study of degree of 1. Multiple correlation is the study of degree of relationship
relationship between one dependent variable and one between one dependent variable and one another variable which is
independent variable, keeping the influence of all other the combination of all other independent variables.
independent variables constant.

2. The purpose of partial correlation is to show the 2. The purpose of multiple correlation is to determine the
relative importance of the different individual variables efficiency with which two or more variables will predict
on dependent variable. performance in a particular setting.

3. In partial correlation, some variables are held 3. In multiple correlation, no variables are held constant.
constant.
4. The partial correlation coefficients are of zero order, 4. Since no variables are kept constant, multiple correlation
first order, second order, etc as per the number of coefficients are always of zero order.
variables kept constant.
5. In case of three variables x1, x2 and x3, the three 5. In case of three variables x1, x2 and x3, the three multiple
partial correlation coefficients are: correlation coefficients are:
r12.3 = Partial correlation coefficient between x1 and x2 R1.23 = Multiple correlation coefficient between x1 and a
keeping the influence of x3 constant
r12 2 r132 2r12 .r23 .r31
r12 r13 r23 combination of x2 and x3
1 r232
2 2
1 r13 1 r23
Where,
Where, r12 = Simple correlation coefficient between x1 and x2
r12 = Simple correlation coefficient between x1 and x2 r23 = Simple correlation coefficient between x2 and x3
r23 = Simple correlation coefficient between x2 and x3 r31 = Simple correlation coefficient between x3 and x1
r31 = Simple correlation coefficient between x3 and x1 Similarly,
Similarly, R2.31= Multiple correlation coefficient between x2 and a
r23.1= Partial correlation coefficient between x2 and x3
r23 r12 r13 r232 r12 2 2r12 .r23 .r31
keeping influence of x1 constant combination of x1 and x3
2 2 1 r312
1 r12 1 r13
R3.12 = Multiple correlation coefficient between x3 and a
and
r13.2 = Partial correlation coefficient between x1 and x3 r132 r232 2r12 .r23 .r31
r13 r12 r23 combination of x1 and x2
keeping influence of x2 constant 1 r12 2
1 r12 2 1 r232
6. The value of partial correlation coefficients lie 6. The value of multiple correlation coefficients lie between 0
between +1 and 1, i.e. and 1, i.e.
1 r12.3 1 0 R1.23 1
1 r23.1 1 0 R 2.31 1
1 r31.2 1 0 R3.12 1
7. The partial correlation coefficients are interpreted 7. The multiple correlation coefficients are interpreted through
through the coefficient of partial determination the coefficient of multiple determination calculated by squaring
calculated by squaring the partial correlation the multiple correlation coefficients.
coefficients.
8. The reliability of the partial correlation coefficients 8. The reliability of the multiple correlation coefficients is given
1 r 212.3 1 R 21.23
is given by r1.23 ;etc where, M is the by R1.23 ;etc
NM NM
number variables and N is the number of observations. Where, M is the number variables and N is the number of
observations.
25. What do you mean by Dummy variables? What are the conditions under which they can be used in regression
analysis?
20
Solution:
In regression analysis the dependent variable is not only affected by quantitative variables but also by the variables like sex,
religion, party affiliation etc which cannot be measured quantitatively. These variables are often of yes or no type. It means
that the variables either possess or do not possess the characteristics. One way to quantify such attributes is by constructing
artificial variables that take value 1 if it possesses the characteristics and 0 if it does not possess. Variables that assume such 0
and 1 values are called dummy variables, qualitative variables, categorical variables, binary variables or switching variables. In
short, a dummy variable is an artificial variable constructed for the purpose such that it takes the value unity whenever the
phenomenon it represents occurs and zero otherwise.
Dummy variables can be easily incorporated in the regression model. Let us consider the following model:
Y a bD U..........................................(i)
Where, Y = the wages of labor,
D = 1 if skilled
= 0 otherwise (i.e. unskilled)
Equation (i) is a simple regression equation with dummy variable (D) as an explanatory (independent) variable. U is the random
error term that satisfies all the assumptions of ordinary least squares (OLS).
Then, the regression line can be estimated by using the normal equations of least squares method:
The regression results are interpreted as:
Mean wage of skilled workers = E(Y/D=1) = a+b, (putting D=1)
Mean wage of unskilled workers = E(Y/D=0) = a (Putting D=0)
Some points to Remember:
In general, the dummy explanatory variables are denoted by the symbol D rather than usual symbol X to
emphasize that we are dealing with qualitative variables.
If a qualitative variable has m categories, introduce only (m1) dummy variables. In other words, for each
qualitative regressors the number of dummy variables introduced must be one less than the categories of that
variable.
The category for which no dummy variable is assigned (or which takes the value zero) is known as the base,
benchmark, control, comparison, reference or omitted category. And all comparisons are made in relation to the
benchmark category.
The intercept value a represents the mean value of the benchmark category.
It is possible to introduce as many dummy variables as the number of categories of the variable provided we do
not introduce the intercept term in such a model.
The coefficients attached to dummy variables are known as the differential intercept coefficients.
Conditions under which dummy variables can be used as regressors are as follows:
1. The first condition for using the dummy variable is that at least one variable in the model is qualitative. It means that the
variable either possesses or, does not possess the characteristics. For, example, suppose we have sample of family budgets
from all regions of the country, rural and urban, and we want to estimate the demand for tobacco manufactures as a function
of income. It is known that town dwellers are heavier smokers than formers. Thus 'region' is an important explanatory
factor, in this case. The region may be represented by a dummy variable and then, we might assign arbitrarily the value 1
for, a town dweller and zero for, a person living in a rural area.
2. Another condition for, the use of dummy variables regression model is that the variables in the model involve not only the
quantative variables but also qualitative variables. When all the independent variables are all qualitative in nature, such
models are called Analysis of variance (ANOVA) models. On the contrary, regression models containing a mixture of
quantitative and qualitative variables are called analysis of covariance (ANCOVA) models.
3. Regression model with dummy variables are also used in the conditions when there is a shift of function overtime or, the
change in parameters (slopes) of the function of overtime. A shift of a function implies that the constant intercept changes
in different periods; while the other coefficients remain constant such shift may be taken into account by the introduction
of dummy variables in the function.
4. The dummy variables regression model is also used in a condition when one wants to remove seasonal variations in the
series. The process of removing the seasonal variations in time series. The process of removing the seasonal component
from a time series is known as deasonalization or, seasonal adjustment. The dummy variable is also a method of
deseasonalizing a time series.
5. The technique of dummy variable is also used under the condition when there are pooled and panel data. Panel data consist
of observations on the same crosssection, or, individual, units over several time periods.
6. Dummy variables are also used as proxies for, quantitative factors, when no observations on these factors are available or,
when it is convenient to do so.

21
26. Distinguish between R2 and R 2. Why do economists often report R2 rather than R 2?
Solution:
R-square (R2) is known as coefficient of multiple determination which shows percentage of explained variation in the
dependent variable that has been explained by the independent variables or the goodness of fit of the regression line/plane. It is
defined as
Expalined var iation
R2=
TotalVariation
ESS
Or, R2=
TSS
Where,
ESS=Explained sum of square
TSS=Total sum of square
RSS
R2=1
TSS
Where RSS=Residual sum of square or unexplained variation
2
On the other hand R-bar square ( R ) is known as adjusted coefficient of multiple determination and is calculated by taking
the degree of freedom into consideration. Thus
2 RSS/d.f for RSS
R =1-
TSS/df .for TSS
Where d.f. degree of freedom
RSS/n k
Or, R 2=1-
TSS/n 1
Where n=number of observations
k=number of parameter
Though both R2 and R 2 are used for measuring the goodness of fit of regression plane, there are some differences between
them which are listed below:

R2 2
R
1. It is known as the unadjusted coefficient of 1. It is known as the adjusted coefficient of multiple
multiple determination determination

2. R2 is defined as the ratio of explained variation (ESS) 2


to total variation (TSS) i.e. 1. R is defined by

Explained Variation un exp lained var iation


R2= 2 d.f .for un exp lained var iation
Total Variation R =1
Total var ion
ESS d.f .for total var iation
or, R2=
TSS
Y)2 R 2 =1- RSS/n k
(Y y 2 TSS/n 1
Or R2=
(Y Y)2 y 2 2
2
R 2 =1- u /n k
RSS u y 2 /n 1
Or, R2=1 1
TSS y2 Where n-k =degrees of freedom for unexplained variation n-
1=degrees of freedom for total variation
3. In case of two independent variables X1 and X2, R2 k=number of parameters including the intercept term.
is calculated as
a)for non-deviated value
3. In case of two independent variables X1 and X2, R 2 is calculated
Y a Y b 1 X1Y b 2 X2 Y
2
as
R2 1 2
Y 2 nY

22
b) for deviated value Y 2 a Y b 1 X1Y b 2 X2 Y
n 3
b 1 x1y b 2 x 2 y R2 1 2
R2= Y 2 nY
y2 n 1

4. The value of R2 lies between 0 and 1 i.e. 0 4. The value of R 2 also lies between 0 and 1 i.e., 0 R 2 1 but the
R2 1. if R2=0, the regression line does not adjusted R2 ( R 2 ) is less than R2 i.e. R 2 < R2 and in some cases it
explain any variation in Y, if R2=1, it explains can take negative values also but in such a cases, it is interpreted as
the 100% variation in Y. Higher the value of being zero. This implies that the adjusted coefficient of multiple
R2, higher is the percentage of explained determination increases less than the unadjusted coefficient of
variation and better the goodness of fit of multiple determination as the number of independent variable
regression line. increases.
5. In some cases R 2 can take negative values. we have the relation

R 2 = 1-(1-R2) n 1
nk
5. R2 cannot the negative i.e. 0 R2 1
When R2=1, R 2 =1
n 1
When R2=0 R 2 =1
nk
6. Here the degrees of freedom are not taken into For k>1, R 2<0.
account from the introduction of additional explanatory
6. Here degrees of freedom are taken into account from the
variables. So R2 almost invariably increases and never
introduction of additional explanatory/independent variables in
decreases as the number of independent variables is
the model.
increased.

The basic relation between R2 and R 2 is

R 2 = 1-(1-R2) ( n 1 )
nk
Where n=total number of observations
k=number of parameters including the intercept term.
Thus for small k,
n 1
When n ( )1
nk
Then
R 2=1-(1-R2).1
Or, R 2=R2
If implies that R2 tends to equal R 2 as number of observation increases infinitely. In other words, the difference between R 2 and
R 2 gets smaller as the number of observations increases.
As economists are social scientist, they study human behaviour. But to forecast about human behaviour, a large number of
observations are required because experiments on humans are not fully replicable due to human motives. Therefore, economists
usually collect a large number of observations to use in the regression model. In such a case, the difference between R 2 and R
2
is negligible. So they want to report R2 rather than R 2 because the calculation procedure of R 2 is rather difficult than the
calculation of R2 while both of them gives some result for large n.

27. What do you mean by the method of Least Squares? How do you estimate the regression coefficients by using this
method? Show it in the case of three independent variables.
Solution:
Regression analysis is a tool to establish the nature and structure of relationship between dependent variable and independent
variable (s) and to provide the mechanism for predication.
Least Square Method
According to least squares method(OLS), the estimators of the population parameter are to be chosen in such a way that
the sum of square of difference between observed value of dependent variable and estimated value of dependent variable be
minimum i.e. in case of the regression function

23
X
Y=a+ b b 2 X2 ......... b n Xn
1 1

The sum u2=(Y- Y ) 2 be minimum


Where Y= observed value of dependent variable

Y =estimated value of dependent variable


Estimation of Regression Coefficients in case of three independent variables
Let the regression function be
Y=a+b1X1+b2X2+b3X3+ U .. (i)
Where Y = dependent variable
X1, X2, X3 are independent variable U = random disturbance term following the assumption of ordinary least squares
method.
a, b1, b2 and b3 are parameters.
Our aim is to find estimates of a, b1, b2 and b3 on the basis of sample regression function.
Y a b 1X1 b 2 X2 b 3X3 u.........
(ii)
The method of least square requires that
2 = (Y Y ) 2 be minimum. Let u 2 =S
u


2
X b X b X
Or, S= Y a b a b X b X b X Thus our problem reduces
be minimum where Y
1 1 2 2 3 3 1 1 2 2 3 3.
to minimize
S = (Y Y )2


2
= Y a b 1X1 b 2 X2 b 3 X3
Using differential calculus, the first order condition for minimum are
S S S S
0, 0, 0, 0
a b 1 b 2 b 3


2
a b X b X b X
Y
s 1 1 2 2 3 3
a) 0 0
a a
X (1)=0
= 2 Y a b1X1 b2 X2 b3 3
or,
Y a b 1X1 b 2 X2 b 3 X3 =0
or, Y a b 1 X1 b 2 X2 b 3 X3 0
or, Y na b X b X b X . (iii)
1 1 2 2 3 3


2
s Y a b 1X1 b 2 X2 b 3 X3
b) 0= 0
b1 b 1
X b X b X (X1) = 0
= 2 Y a b1 1 2 2 3 3
or,
Y a b 1X1 b 2 X2 b 3 X3 X1 =0
or, X Y aX
1 1 b 1X1 b 2 X2 X1 b 3 X1X3 0
or, X1Y a X1 b 1 X12 b 2 X1X2 b 3 X1X3 .(iv)


2
s Y a b 1X1 b 2 X2 b 3 X3
c) 0 0
b2 b 2
X b X b X (X2) = 0
= 2 Y a b1 1 2 2 3 3
or, 2 b 1X1X2 b 2 X2 2 b 3 X2 X3 =0
X2 Y aX
or, X Y aX
2 2 b 1X1X2 b 2 X2 2 b X X 0
3 2 3

or, X2 Y a X2 b 1 X1X2 b 2 X2 b 3 X2 X3 ..(v)


2

24

2
s Y a b 1X1 b 2 X2 b 3 X3
d) 0 0
b 3 b 3

X b X b X (X3) = 0
= 2 Y a b1 1 2 2 3 3
or, 3 b 1X1X3 b 2 X2 X3 b 3 X32 =0
X3 Y aX
or, X3 Y a X3 b 1 X1X3 b 2 X2 X3 b 3 X32 (vi)

The equation (iii), (iv), (v) and (vi) are called the normal equation for finding the values of the estimators:
a , b 1, b 2 and b 3 .
The normal equations are:
Y na b 1 X1 b 2 X2 b 3 X3
X1Y a X1 b 1 X12 b 2 X1X2 b 3 X1X3
X Y a X b X X b X 2 b X X
2 2 1 1 2 2 2 3 2 3

X3 Y a X3 b 1 X1X3 b 2 X2 X3 b 3 X32
Writing the equations in matrix form,
n X1 X2 X3 a Y

X1 X12 X1X2 X1X3
b1 X1 Y
X2 X1X2 X2 2 X 2 X3 = X Y
b2 2

X3 X1X3 X 2 X3 X32 b X3 Y
3
a n X1 X2 X3
1
Y

b1 X1 X12 X1X2 X1X3 X Y
1
or, = X X 2 X3 X2 Y
b2 2 X1X2 X2 2

b X3
3
X1X3 X 2 X3 X32 X3 Y
X A B ..(*)
1

Finding A-1
Let the determinant be D and the cofactors be C11, C12, C13, C14, C21, . and C44
C11 C12 C13 C14
C C C C
The cofactor matrix is : 21 22 23 24

C31 C32 C33 C34

C41 C42 C43 C44
C11 C21 C31 C41
C C C C
And the Adjoint Matrix is: 12 22 32 42

C13 C23 C33 C43

C14 C24 C34 C44
1
Thus, A-1= Adjo int of A
D
C11 C21 C31 C41

1 C12 C22 C32 C42

D C13 C23 C33 C43

C14 C24 C34 C44
Substituting these values in equation *, we have
a
C C21 C31 C41 Y

11

b1 1 C C22 C32 C42 X1Y



D C
12

C43 X 2 Y
b2
C23 C33

13

b C C24 C34 C44 X Y


3
14
3

25
a C 11Y C X Y C X Y C X Y
21 1 31 2 41 3


b1 1 C 12Y C X Y C X Y C X Y
22 1 32 2 42 3

or,
Y C X Y C X Y C X Y
b2
D C 13 23 1 33 2 43 3

b
3 C Y C X Y C X Y C X Y
14 24 1 34
2 44 3

C
11 Y C X Y C X Y C X Y
21 1 31 2 41 3


D
a C Y C X Y C X Y C X Y
12 22 1 32

2 42 3

b1 D
or,
Y C X Y C X Y C X Y
b2 C

13 23 1 33 2


43 3

b D
3
C Y C X Y C X Y C X Y
14 24 1 34

2 44 3

The END
Good Luck

26

Vous aimerez peut-être aussi