Vous êtes sur la page 1sur 17

BUSINESS STATISTICS

PROJECT ON

CORRELATION

NAME OF THE SUPERVISOR: T.S.AGHILLA

SUBMITTED BY

N.BAVITHRAN

BC0140018

B.ComLL.B(Hons)
CORRELATION-AN OVERVIEW

KEYWORDS: correlation, average, interpretation, coefficient, regression

INTRODUCTION

Correlation is a statistical technique that can show whether and how strongly pairs of variables
are related. For example, height and weight are related; taller people tend to be heavier than
shorter people. The relationship isn't perfect. People of the same height vary in weight, and you
can easily think of two people you know where the shorter one is heavier than the taller one.
Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and
their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much
of the variation in peoples' weights is related to their heights. Although this correlation is fairly
obvious your data may contain unsuspected correlations. You may also suspect there are
correlations, but don't know which are the strongest. An intelligent correlation analysis can lead
to a greater understanding of your data1.

TECHNIQUES IN DETERMINING CORRELATION

There are several different correlation techniques. The Survey System's optional Statistics
Module includes the most common type, called the Pearson or product-moment correlation. The
module also includes a variation on this type called partial correlation. The latter is useful when
you want to look at the relationship between two variables while removing the effect of one or
two other variables. Like all statistical techniques, correlation is only appropriate for certain
kinds of data. Correlation works for quantifiable data in which numbers are meaningful, usually
quantities of some sort. It cannot be used for purely categorical data, such as gender, brands
purchased, or favorite color.2

Positive Correlation
Positive correlation occurs when an increase in one variable increases the value
in another. The line corresponding to the scatter plot is an increasing line.

1
Armitage P, Berry G. In: Statistical Methods in Medical Research , 3rd edn. Oxford: Blackwell Scientific
Publications, 1994:312-41
2
Paiva CE, Lima JP, Paiva BS: Articles with short titles describing the results are cited more
often. Clinics (Sao Paulo) 2012, 67: 509-513.
Negative Correlation
Negative correlation occurs when an increase in one variable decreases the
value of another. The line corresponding to the scatter plot is a decreasing line.
No Correlation:
No correlation occurs when there is no linear dependency between the variables.
Perfect Correlation
Perfect correlation occurs when there is a functional dependency between the
variables. In this case all the points are in a straight line.
Strong Correlation
A correlation is stronger the closer the points are located to one another on the
line.
Weak Correlation
A correlation is weaker the farther apart the poin ts are located to one another on
the line.
RESEARCH OBJECTIVE:

To get a clear cut idea of the types of studying correlation and the methods of studying
correlation

To study about the interpretation of coefficient of correlation and rank correlation

To describe about regression analysis

RESEARCH METODOLOGY:

The researcher will be doing an analytical and qualitative study literature for this study has been
collected from various primary as well as secondary sources primary sources include judgments
of various high court and supreme court and recommendation of various reports. Secondary
sources include various books of eminent authors.
CORRELATION

INTRODUCTION:

Introduction:

The term correlation is used by a common man without knowing that he is making use of the
term correlation. For example when parents advice their children to work hard so that they may
get good marks, they are correlating good marks with hard work.The study related to the
characteristics of only variable such as height, weight, ages, marks, wages, etc., is known as
univariate analysis. The statistical Analysis related to the study of the relationship between two
variables is known as Bi-Variate Analysis. Some times the variables may be inter-related. In
health sciences we study the relationship between blood pressure and age, consumption level of
some nutrient and weight gain, total income and medical expenditure, etc., The nature and
strength of relationship may be examined by correlation and Regressionanalysis. Thus
Correlation refers to the relationship of two variables or more. (e-g) relation between height of
father and son, yield and rainfall, wage and price index, share and debentures etc. Correlation is
statistical Analysis which measures and analyses the degree or extent to which the two variables
fluctuate with reference to each other.3 The word relationship is important. It indicates that there
is some connection between the variables. It measures the closeness of the relationship.
Correlation does not indicate cause and effect relationship. Price and supply, income and
expenditure are correlated.

Definitions:

1. Correlation Analysis attempts to determine the degree of relationship between variables- Ya-
Kun-Chou.

2. Correlation is an analysis of the covariation between two or more variables.- A.M.Tuttle

3
Stringer MJ, Sales-Pardo M, Nunes Amaral LA: Statistical validation of a global model for the
distribution of the ultimate number of citations accrued by papers published in a scientific journal. J
Am Soc Inf Sci Technol 2010, 61: 1377-1385
4
One variable is represented along the horizontal axis and the second variable along the vertical
axis. For each pair of observations of two variables, we put a dot in the plane. There are as many
dots in the plane as the number of paired observations of two variables. The direction of dots
shows the scatter or concentration of various points. This will show the type of correlation.
1. If all the plotted points form a straight line from lower left hand corner to the upper right
hand corner then there is Perfect positive correlation. We denote this as r = +1
2. If all the plotted dots lie on a straight line falling from upper left hand corner to lower
right hand corner, there is a perfect negative correlation between the two variables. In this
case the coefficient of correlation takes the value r = -1.
3. If the plotted points in the plane form a band and they show a rising trend from the lower
left hand corner to the upper right hand corner the two variables are highly positively
correlated.5
4. If the points fall in a narrow band from the upper left hand corner to the lower right hand
corner, there will be a high degree of negative correlation.
5. If the plotted points in the plane are spread all over the diagram there is no correlation
between the two variables.
Types of Correlation:
Correlation is classified into various types. The most
important ones are
i) Positive and negative.
ii) Linear and non-linear.
iii) Partial and total.
iv) Simple and Multiple.
Positive and Negative Correlation:
It depends upon the direction of change of the variables. If the two variables tend to move
together in the same direction (ie) an increase in the value of one variable is accompanied by an

4
"Associstion Between Variables". N.p., 2016. 21 Sept. 2016.
5
P. Ahlgren, B. Jarneving and R. Rousseau (2003). Requirements for a cocitation similarity measure, with special
reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and
Technology 54(6), 550-560
increase in the value of the other, (or) a decrease in the value of one variable is accompanied by a
decrease in the value of other, then the correlation is called positive or direct correlation. Price
and supply, height and weight, yield and rainfall, are some examples of positive correlation.
If the two variables tend to move together in opposite directions so that increase (or) decrease in
the value of one variable is accompanied by a decrease or increase in the value of the other
variable, then the correlation is called negative (or) inverse correlation. Price and demand, yield
of crop and price, are examples of negative correlation.

Linear and Non-linear correlation:


6
If the ratio of change between the two variables is a constant then there will be linear correlation
between them
Here the ratio of change between the two variables is the same. If we plot these points on a graph
we get a straight line. If the amount of change in one variable does not bear a constant ratio of
the amount of change in the other. Then the relation is called Curvi-linear (or) non-linear
correlation. The graph will be a curve
X 2 4 6 8 10 12
Y 3 6 9 12 15 18

Simple and Multiple correlation:

When we study only two variables, the relationship is simple correlation. For example,
quantity of money and price level, demand and price. But in a multiple correlation we study
more than two variables simultaneously7. The relationship of price, demand and supply of a
commodity are an example for multiple correlation.

Partial and total correlation:

The study of two variables excluding some other variable is called Partial correlation.

6
H.D. White (2003). Author cocitation analysis and Pearson’s r. Journal of the American Society for statsitics
54(13), 1250-12
7
Aldrich, J. (1995). Correlations genuine and spurious in Pearson and Yule. Statistical Science, 10(4), 364–376.
For example, we study price and demand eliminating supply side. In total correlation all facts
are taken into account.

8
Correlation Coefficient
Karl pearson, a great biometrician and statistician, suggested a mathematical method for
measuring the magnitude of linear relationship between the two variables. It is most widely used
method in practice and it is known as pearsonian coefficient of correlation 9. It is denoted by ‘ r’ .
The formula for calculating ‘ r’ is
C ov( x, y)
(i) r = where  x , y are S.D of x and y
xy
Respectively.
 xy
(ii) r =
n  x y

(iii) r = XY , X= xx,Y= yy
X2.Y2

ADEXPENSES ANNUAL SALES

9 19
7 13
5 12
8 16
6 15
3 10
4 8

X Y XY X2 Y2

8
"Associstion Between Variables". N.p., 2016. 21 Sept. 2016.
9
Weldon, F.R.S. (1892). Certain correlated variations in Crangon vulgaris. Proceedings of the Royal Society of
London, 51, 1–21.
9 19 171 81 361
7 13 91 49 169
5 12 60 25 144
8 16 128 64 256
6 15 90 36 225
3 10 30 9 100
4 8 32 16 64
42 93 602 280 1319

R=nEXY-(EX)(EY)

√[𝑛𝐸𝑋𝑠𝑞 − (𝐸𝑋)𝑠𝑞][𝑛𝐸𝑌𝑠𝑞 − (𝐸𝑌)𝑠𝑞]

R= 308
338.33
R=0.910
P,E=0.6745 (1 − 𝑟)2
√𝑛
N=7
R=0.910
P.E=±0.04382
6.P.E=±0.2629

Spearman rank correlation


10
A plot of the data may reveal outlying points well away from the main body of the data, which
could unduly influence the calculation of the correlation coefficient. Alternatively the variables
may be quantitative discrete such as a mole count, or ordered categorical such as a pain score
This results in a simple formula for Spearman's rank correlation, Rho.

where d is the difference in the ranks of the two variables for a given individual. Thus we can
derive table 11.2 from the data in table

Derivation of spearman rank correlation from data of the table

Child number rank height rank dead space d d2

1 1 3 2 4

2 2 1 -1 1

3 3 2 -1 1

4 4 4 0 0

5 5 5.5 0.5 0.25

6 6 11 5 25

7 7 7 0 0

8 8 5.5 -2.5 6.25

9 9 8 -1 1

10 10 13 3 9

10
Sedgwick, P. "Spearman's Rank Correlation Coefficient". BMJ 349.nov28 1 (2014): g7327-g7327. Web.
11 11 10 -1 1

12 12 9 -3 9

13 13 12 -1 1

14 14 15 1 1

15 15 14 -1 1

TOTAL 60.5

From this we get that

=1- 6x60.52

153-15

In this case the value is very close to that of the Pearson correlation coefficient. For n> 10, the
Spearman rank correlation coefficient can be tested for significance using the t test given earlier.

Regression Analysis

Definition:
11
Regression is the measure of the average relationship between two or more variables in terms
of the original units of the data.

It allows you to model, examine, and explore spatial relationships, and can help to explain the
factors behind observed spatial patterns. Regression analysis is also used for predicting. You
may want to understand why people are persistently dying young in certain regions, for example,
or may want to predict rainfall where there are no rain gauges.

11
"Regression Basics". Choice Reviews Online 39.02 (2001): 39-0989-39-0989. Web.
Linear relationships are either positive or negative. If you find that the number of search and
rescue events increases when daytime temperatures rise, the relationship is said to be positive;
there is a positive correlation. Another way to express this positive relationship is to say that
search and rescue events decrease as daytime temperatures decrease. Conversely, if you find that
the number of crimes goes down as the number of police officers patrolling an area goes up, the
relationship is said to be negative12.

X 6 2 10 4 8

Y 9 11 5 8 7

X Y XY X2 Y2
6 9 54 36 81
2 11 22 4 121
10 5 50 100 25
4 8 32 16 64
8 7 56 64 49
EX=30

EY=40

EXY=214

EX2 =220

EY2=340

EY=Na+bEX

EXY=aEX+bEX2

12
"Regression Analysis Basics". N.p., 2016.21 Sept. 2016.
40=5a+30b

214+30a+220b

240+30a+180b multiplying equation1 by 6

214+30a+220b

Deducting equation IV from III -40b-26 or b=-0.65

Substituting the value of b in equation I

40a=5a+30(-0.65) or 5a=4=+40+19.5=59.5 or a=11.9

Putting the value of a and b in the equation,the regression of Y on X is Y=11.9-0.65X

Regression line of Xon Y :Xc=a+bY

And the two normal equations are:

EX=Na+bEY

EXY=aEY+bEY2

30=5a+40b

214+40a+340b

240+40a+320b multiplying equation I by 8

214=40a+340b

From equation III and IV -20b=26;b=-13

Substituting the value of b in equation I;30=5a+40(-1.3)

5a=30+52=82 therefore a=16.4

Putting the value of a and b in the equation, the regression line in the equation line of X on Y
LEAST SQUARE METHOD:

The least square methods is the most popular technique in statistics. Because of several factors.
13
First factor, most common estimators can be casted within this framework. For example, the
mean of a distribution is the value that minimizes the sum of squared deviations of the scores.
Second, using squares makes it mathematically very tractable because when the error is
independent of an estimated quantity, one can add the squared error and the squared estimated
quantity. Third factor, the mathematical tools and algorithms involved in LSM have been well
studied for a relatively long time. LSM is one of the oldest techniques of modern statistics, and
even though ancestors of LSM can be traced up to Greek mathematics, the first modern precursor
is probably Galileo (see Harper, 1974, for a history and pre-history of LSM). The modern
approach was first exposed in 1805 by the French mathematician Legendre in a now classic
memoir, but this method is somewhat older because it turned out that, after the publication of
Legendre’s memoir, Gauss (the famous German mathematician) contested Legen dre’s priority.
Gauss often did not published ideas when he through that they could be controversial or not yet
ripe, but would mention his discoveries when others would publish them (the way he did, for
example for the discovery of Non-Euclidean geometry). And in 1809, Gauss published another
memoir in which he mentioned that he had previously discovered LSM and used it as early as
1795 in estimating the orbit of an asteroid. A somewhat bitter anteriority dispute followed (a bit
reminiscent of the Leibniz-Newton controversy about the invention of Calculus), which,
however, did not diminish the popularity of this technique

Example

Based on the following data of number of units produced and the corresponding total cost,
estimate the total cost of producing 4,000 units. Use the least-squares linear regression method.

Month Units Cost


1 1,520 $36,375
2 1,250 38,000
3 1,750 41,750

13
Björck, Åke. "The Calculation Of Linear Least Squares Problems". Acta Numerica 13 (2004): 1-53. Web.
4 1,600 42,360
5 2,350 55,080
6 2,100 48,100
7 3,000 59,000
8 2,750 56,800
Solution:

X Y x2
xy
1,520 $36,375 2,310,400 55,290,000
1,250 38,000 1,562,500 47,500,000
1,750 41,750 3,062,500 73,062,500
1,600 42,360 2,560,000 67,776,000
2,350 55,080 5,522,500 129,438,000
2,100 48,100 4,410,000 101,010,000
3,000 59,000 9,000,000 177,000,000
2,750 56,800 7,562,500 156,200,000
16,320 377,465 35,990,400 807,276,500
We have,

n = 8;

Σx = 16,320;

Σy = 377,465;

Σx2 = 35,990,400; and

Σxy = 807,276,500

Calculating the average variable cost per unit:

B= 8x807, 276,500-16,320x377, 465 13.8


8x35, 990,400-(16,320)SQ

Calculating the approximate total fixed cost:

A=377,465-13.8078x16, 320 19,015

The cost-volume formula now becomes: y = 19,015 + 13.8x At 4,000 activity level, the
estimated total cost is $74,215 [= 19,015 + 13.8 × 4,000].

CONCLUSION:

Correlation coefficients can be useful statistical tools. They can help us identify some kinds of
statistically significant relationships between pairs of variables, and they can tell us about the
sign (positive or negative) of these relationships. Comparative studies of alternative measures
can be very useful in choosing an appropriate bibliometric toolkit.
TENTATIVE CHAPTERISATION:

CHAPTER 1: Introduction

CHAPTER 2: Need for correlation

CHAPTER 3: Types of correlation

CHAPTER 4: Methods of studying correlation

CHAPTER 5: Interpretation of coefficient of correlation and rank correlation

CHAPTER 6: Regression analysis

CHAPTER 7: conclusion

Roll N.O: BC0140018

Vous aimerez peut-être aussi