Vous êtes sur la page 1sur 50

# Dr M.

Kanmani
Department of Education
Manonmaniam Sundaranar
University, Tirunelveli-12.
kan_mani_msc@yahoo.com

APPROPRIATE STATISTICS
FOR DATA ANALYSIS

BASIC CONCEPTS
Population
Collection of all individuals or objects or items
under study and denoted by N
Sample
A part of a population and denoted by n
Variable
Characteristic of an individual or object.
Qualitative and Quantitative variables

Parameter
Characteristic of the population
Statistic
Characteristic of the sample

Characteristics

Population

Sample

Size

Mean

_
x

X
n

SD

_
2
(x x)
n

Proportion

Correlation
Coefficient

COV( x , y)
r
x y

_
2
(x x)
S
n 1

x
n

## Chart on population, sample and

statistical inference
Population too large

the population

the sample

## Draw inference which is

applicable to the population

Organise
data

Analyse the
Organised data

## Organising a raw data set

R a w d a ta s e t
C a t e g o r ic a l o r
Q u a lit a t iv e d a t a s e t

Q u a n t it a t iv e
d a ta s e t

M a k e a fre q u e n c y
t a b le

O b t a in t h e m o d i f ie d r a n g e
a n d t h e n d iv id e in t o
s e v e r a l c la s s e s
M ake a
f r e q u e n c y t a b le

Pictorial representation of
a data set
G r a p h ic a l r e p r e s e n t a t io n
o f a d a ta s e t

Dec 6, 2015

C a t e g o r ic a l o r
Q u a lit a t iv e d a t a s e t

Q u a n t it a t iv e
d a ta s e t

B a r D ia g r a m
P ie D ia g r a m

H is t o g r a m
S t e m le a f d is p la y
P ie d ia g r a m
T im e p lo t

## Summarising a raw data set

on a quantitative variable
S u m m a r is in g a r a w d a ta s e t
o n a q u a n tita tiv e v a r ia b le
S tu d y th e c e n tr a l
te n d e n c y o fth e d a ta s e t

S tu d y th e v a r ia b lity
o r d is p e r s io n o f th e d a ta s e t

L o c a tin g a c e n tr a l v a lu e
o f th e d a ta s e t b y th e m e a s u r e s
M ode
M e d ia n , M e a n

Q u a n tify in g th e d is p a r ity a m o n g th e
d a ta e n tr ie s is d o n e b y th e m e a s u r e s
R a n g e , In t e r - q u a r t i l e R a n g e
V a r ia n c e , S D

Dec 6, 2015

Sampling Techniques
P ro b a b i l i t y
S a m p lin g
S im p le R a n d o m
S a m p le

S tra ti f i e d R a n d o m
S a m p le

P ro p o rti o n a te

S y s te m a ti c
R andom
S a m p le

D i s p ro p o rti o n a te

O n e S ta g e

N o n -P ro b a b i l i t y
S a m p lin g
C l u s te r
S a m p lin g

C o n v e n ie n c e
s a m p lin g

T w o S ta g e

M u l t i S ta g e

Q u o ta
S a m p lin g

Judgem ent
S a m p lin g

S nowbal
S a m p lin g

## Stages in Data Analysis

Editing
Error
Checking
And
Verification

Coding
Data Entry
(Keyboarding)
Data
Analysis

Descriptive
Analysis

Univeriate
Analysis

Bivariate
Analysis

Interpretation

Multivate
Analysis

Statistical Inference
S ta tis tic a l
In f e r e n c e

T h e o ry o f
E s tim a tio n
P o in t
E s tim a tio n
Dec 6, 2015

In t e r v a l
E s tim a tio n

T e s tin g o f
H y p o th e s is
P a ra m e tric
Test

N o n P a ra m e tric
Test

## The Concept of P Value

Given the observed data set, the P value is the
smallest level for which the null hypothesis is
rejected (and the alternative is accepted)
If the P value then reject H0 ; Otherwise accept
H0

## If the P value 0.01 then reject H0 at 1% level of

significance
If the P value lies between 0.01 to 0.05 (ie. 0.01< P
value 0.05) then reject H0 at 5% level of
significance
If the P value 0.05 then accept H0 at 5% level of
significance
Dec 6, 2015

Measurement Scales
Types of measurement scales are

Nominal Scale
Ordinal Scale
Interval scale
Ratio Scale

Scales of Measurement
Scale Level

Scale of
Scale
Measurement Qualities

Example(s)

Ratio

Magnitude
Age, Height,
Equal Intervals Weight,
Absolute Zero Percentage

Interval

Magnitude
Temperature
Equal Intervals

Ordinal

Magnitude

Likert Scale,
Anything rank
ordered

Nominal

None

Names, Lists
of words

Appropriate Statistics
Nominal

Ordinal

[ Cross tabs ]

[ Frequencies ]

Chi square,
Phi
Cramer's
Contingency

Median,
Interquartile range

[ Nonparametric ]

[ Nonparametric ]

Chi-square,
Runs
Binomial
McNemar
Cochran

Kolmogorov-Smirnov
Sign
Wilcoxen
Kendall coefficient of
concordance
Friedman two-way anova
Mann-Whitney U
Wald-Wolfowitz
Kruskal-Wallis

Interval
Mean
Standard Deviation
Pearson's productmoment correlation
t test
Analysis of variance,
Multivariate analysis of
variance, MANOVA
Factor analysis
Regression
Multiple correlation, R

Ratio
Coefficient of
Variation,
(CV = SD / M)

Hypothesis

Number of

Measurement

Testing

Samples

Scale

Hypotheses
Distribution

Test

One

Nominal

Chi-square

Two or more

Nominal

Chi-square

Interval

Z Test

Or Ratio

Hypothesis
means

One (small

Interval

sample

Or Ratio

## Two (Large sample

Interval

t Test
Z Test

Or Ratio
Two (Small sample

Interval

t Test

Or Ratio
Three or more(Small
sample

Interval
Or Ratio

Interval

ANOVA
Z Test

Or Ratio
Hypothesis
Proportions

One (small

Interval

sample

Or Ratio

## Two (Large sample

Interval

t Test
Z Test

Or Ratio
Two (Small sample

Interval

t Test

Or Ratio
Variance

Two or more

Interval

sample

Or Ratio

ANOVA

## Example for Tests of Hypotheses concerning Two population means

Sample I: 110, 120, 123, 112, 125
Sample II: 120, 128, 133, 138, 129
Group Statistics

Mark

Class
Class A
Class B

N
5
5

Mean
118.00
129.60

Std. Deviation
6.671
6.656

Std. Error
Mean
2.983
2.977

## Independent Samples Test

Levene's Test for
Equality of Variances

F
Mark

Equal variances
assumed
Equal variances
not assumed

Sig.
.178

.684

df

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

95% Confidence
Interval of the
Difference
Lower
Upper

-2.753

.025

-11.60

4.214

-21.318

-1.882

-2.753

8.000

.025

-11.60

4.214

-21.318

-1.882

Class

Sample

Mean

SD

Class A

118.00

6.671

Class B

129.60

6.656

t Value

P Value

2.753

0.025

## For more than two populations, it is assumed that the probability

distribution ( i.e. Histogram ) of each population is approximately
normal.

## H0: All the population means are equals

H1: At least two population means are differ

## This test is called Analysis Of Variance (ANOVA)

Data from Unrestricted (independent) samples ( One-way ANOVA)
Data from Block Restricted Samples (Two-way ANOVA)
Objective : To check whether the null hypothesis is to be accepted
based on the value of the F by placing the significance level of at
the right tail of the Snedecor F distribution.

## Example for One Way ANOVA

School I : 45, 54, 35, 43, 48
School II : 54, 65, 67, 55, 52
School III : 87, 65, 75, 79, 67
ANOVA
Mark

Between Groups
Within Groups
Total

Sum of
Squares
2195.200
706.400
2901.600

df
2
12
14

Mean Square
1097.600
58.867

F
18.646

Sig.
.000

Ma rk
Duncan

School
School I
School II
School III

N
5
5
5

1
2
45.00
58.60

74.60

## Means for groups in homogeneous subsets are displayed.

a. Uses Harmonic Mean Sample Size = 5.000.

Non-Parametric Tests

## In some situations, the practical data may be non-normal

and/or it may not be possible to estimate the parameter(s)
of the data
The test which are used for such situations are called nonparametric tests
Since these tests are based on the data which are free
from distribution and parameter, these tests are known as
non-parametric(NP) test or Distribution Free tests
NP test can be used even for nominal data (qualitative
data like greater or less, etc.) and ordinal data, like ranked
data.
NP test required less calculation, because there is no need
to compute parameters.

## List of Non-Parametric Tests

1. One-sample test

## One sample sign test

Chi-square one sample test
Kolmogorov-Smirnov test

## Two samples sign test

Wilcoxon Matched-pairs signed rank test

## Chi-Square test for two independent samples

Mann-Whitney U test
Kolmogorov-Smirnov two sample test

## List of Non-Parametric Tests

4 K Related Samples test

## Friedman Two way Analysis of Variance by Ranks

The Coehran Q test

5. K Independent samples

## Chi-Square test for k Independent samples

The extension of the Median test
Kruskal-Wallis one-way Analysis of Variance by Rank

categorized data

## Let us consider two factors which may or may

not have influence on the observed frequencies
formed with respect to combinations of different
levels of the two factors
H0: Factor A and factor B are independent
H1: Factor A and factor B are not independent
Objective : To check whether the null hypothesis
is to be accepted based on the value of the chisquare by placing the significance level of at
the right tail of the chi-square distribution.

## To fit the data to the nearest distribution which

represents the data more meaningfully for future analysis.
Such fitting of data to the nearest distribution is done
using the goodness of fit test
H0: The given data follow an assumed distribution
H1: The given data do not follow an assumed distribution
Objective : To check whether the null hypothesis is to be
accepted based on the value of the chi-square by placing
the significance level of at the right tail of the chisquare distribution.

Kolmogorov-smirnov test

## It is similar to the chi-square test to do goodness of fit of a

given set of data to an assumed distribution
This test is more powerful for small samples whereas the chisquare test is suited for large sample

## H0: The given data follow an assumed distribution

H1: The given data do not follow an assumed distribution

## K-S test is an one-tailed test. Hence if the calculated value of

D is more than the theoretical value of D for a given
significance level, then reject H0 ; otherwise accept H0

## Two samples sign test is applied to a situation, where two

samples are taken from two populations which have
continuous symmetrical distributions and known to be nonnormal
Modified sample value, Zi = + if Xi > Yi
= if Xi < Yi
= 0 if Xi = Yi
Classified into four categories
1
2
3
4

## One-tailed two-sample sign tests with binomial distribution

Two-tailed two-sample sign tests with binomial distribution
One-tailed two-sample sign tests with normal distribution
Two-tailed two-sample sign tests with normal distribution

## The Wilcoxon test is a most useful test for behavioral

scientist
Let di = the difference score for any matched pair
Rank all the di without regard to sign
T = Sum of rank with less frequent sign
Compute Z = [T E(T)]/SD(T)

Mann-Whitney U Test

## Mann-Whitney U test is an alternate to the two sample t-test

This test is based on the ranks of the observations of two
samples put together
Alternate name for this test is Rank-Sum Test
Let R1 = The sum of the ranks of the observations of the first
sample
Let R2 = The sum of the ranks of the observations of the second
sample
Objective: To check whether the two samples are drawn from
different populations having the same distribution
Compute Z = [U E(U)]/SD(U)
where U = n1n2 + [n1(n1 + 1)/2] - R1
or U = n1n2 + [n2(n2 + 1)/2] - R2

## The Chi-square test measures the association between two or

more variables.This test is applicable only when data is on
nominal scale.
Correlation and Regression analysis is used for measuring the
relationship between two variables measured on interval or
ratio scale.

Correlation Analysis

## Correlation analysis is a statistical technique used to measure

the magnitude of linear relationship between two variables.
Correlation analysis cannot be used in isolation to describe the
relationship between variables.
It can be used along with regression analysis to determine the
nature of the relationship between two variables.
Thus correlation analysis can be used for further analysis
Two prominent types of correlation Coefficient are
Pearson Product Moment correlation coefficient
Spearmans Rank correlation coefficient
Testing the significance of correlation coefficient
Type I H0: = 0 and H1: 0
Type II H0: = r and H1: r
Type III H0: r1 = r2 and H1: r1 r2

Correlation Analysis
Example:

## Mark in Mathematics: 89,58,78,79,86,58

Marks in Statistics:

75,79,59,78,84,65

Correlations

MATHS

STATISTI
CS

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

STATISTI
MATHS
CS
1
.968**
.
.002
6
6
.968**
1
.002
.
6
6

## **. Correlation is significant at the 0.01 level (2-tailed).

Regression Analysis

## Regression analysis is used to predict the nature and

closeness of relationships between two or more variables
It evaluate the causal effect of one variable on another
variable
It used to predict the variability in the dependent (or criterion)
variable based on the information about one or more
independent (or predictor) variables.
Two variables : Simple or Linear Regression Analysis
More than two variables : Multiple Regression Analysis

## Linear Regression Analysis

Linear regression : Y = + X
Where Y : Dependent variable
X : Independent variable
and : Two constants are called regression coefficients
: Slope coefficient i.e. the change in the value of Y with
the corresponding change in one unit of X
: Y intercept when X = 0

## R2 : The strength of association i.e. to what degree that the

variation in Y can be explained by X.
R2 = 0.10 then only 10% of the total variation in Y can be
explained by the variation in X variables

## Test of significance of Regression Equation

Linear regression : Y = + X
F test is used to test the significance of the linear relationship
between two variables Y and X
H0: = 0 (There is no linear relationship between Y and X)

## Objective : To check whether the estimates from the regression

model represent the real world data.

## Example for Regression Analysis

School Climate
: 25, 34, 55, 45, 56, 49, 65
Academic Achievement: 58, 62, 80, 75, 84, 72, 89
Variables Entered/Removedb
Model
1

Variables
Entered
School a
Climate

Variables
Removed

Model Summary

Method
.

Model
1

Enter

R
.978a

R Square
.957

R Square
.949

ANOVAb
Model
1

Regression
Residual
Total

Sum of
Squares
732.832
32.597
765.429

df
1
5
6

Mean Square
732.832
6.519

F
112.409

Sig.
.000a

Coefficientsa

Model
1

(Constant)
School Climate

Unstandardized
Coefficients
B
Std. Error
36.436
3.698
.805
.076

## a. Dependent Variable: Academic Achievement

Standardized
Coefficients
Beta
.978

t
9.853
10.602

Sig.
.000
.000

Std. Error of
the Estimate
2.55330

Multivariate Analysis

## Multivariate analysis is defined as all statistical

techniques which are simultaneously analyse more than
two variables on a sample of observation.

## Multivariate analysis helps the researcher in evaluating

the relationship between multiple (more than two)
variables simultaneously.

## Multivariate techniques are broadly classified into two

categories:

Dependency Techniques
Independency Techniques

## A Classification of Multivariate Methods

A ll M u lt iv a r ia t e
M e th o d s
A re s o m e o f th e
v a r ia b le s d e p e n d e n t
o n o th e rs ?
Yes

N o

D ependence
M e th o d s

In d e p e n d e n c e
M e th o d s

## Multivariate Analysis: Classification of Dependence Methods

D ependence
M e th o d s
H o w m a n y v a r ia b le s
a re d e p e n d e n t?
O ne D ependent
V a r ia b le

S e v e ra l D e p e n d e n t
V a r ia b le s

M e t r ic - T h e s c a le s a r e
r a t io o r in t e r v a l

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l

M e t r ic - T h e s c a le s a r e
r a t io o r in t e r v a l

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l

M u lt ip le
R e g r e s s io n

M u lt ip le
D is c r im in a n t
A n a ly s is

M u lt iv a r ia t e
A n a ly s is o f
V a r ia n c e
(M A N O V A )

C o n jo in t
A n a ly s is

M u lt ip le in d e p e n d e n t
and dependent
v a r ia b le s
C a n o n ic a l
A n a ly s is

## Multivariate Analysis: Classification of Independence Methods

In d e p e n d e n c e
M e th o d s
A r e in p u t s
M a t r ic ?

M e t r ic - T h e
s c a le s a r e r a t io
o r In te rv a l
F a c to r
A n a ly s is

C lu s t e r
A n a ly s is

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l
M e t r ic
M u lt id im e n s io n a l
S c a lin g

N o n m e t r ic
M u lt id im e n s io n
S c a lin g

Discriminant Analysis

## Discriminant analysis aims at studying the effect of two or

more predictor variables (independent variables) on certain
evaluation criterion
The evaluation criterion may be two or more groups
Two groups such as good or bad, like or dislike, successful or
unsuccessful, above expected level or below expected level
Three groups such as good, normal or poor
Check whether the predictor variable discriminate among the
groups
To identify the predictor variable which is more important
when compared to other predictor variable(s).
Such analysis is called discriminant analysis

Discriminant Analysis

## Designing a discriminant function: Y = aX1 + bX2

where Y is a linear composite representing the discriminant function, X 1
and X2 are the predictor variables (independent variables) which are
having effect on the evaluation criterion of the problem of interest.
Finding the discriminant ratio (K) and determining the variables which
account for intergroup difference in terms of group means
This ratio is the maximum possible ratio between the variability between
groups and the variability within groups
Finding the critical value which can be used to include a new data set (i.e.
new combination of instances for the predictor variables) into its
appropriate group
Testing
H0: The group means are equal in importance
H1: The group means are not equal in importance
using F test at a given significance level

Factor Analysis

## Factor analysis can be defined as a set of methods in which the observable or

manifest responses of individuals on a set of variables are represented as functions
of a small number of latent variables called factors.
Factor analysis helps the researcher to reduce the number of variables to be
analyzed, thereby making the analysis easier.
For example, Consider a market researcher at a credit card company who wants to
evaluate the credit card usage and behaviour of customers, using various variables.
The variables include age, gender, marital status, income level, education,
employment status, credit history and family background.
Analysis based on a wide range of variables can be tedious and time consuming.
Using Factor Analysis, the researcher can reduce the large number of variables into
a few dimensions called factors that summarize the available data.
Its aims at grouping the original input variables into factors which underlying the
input variables.
For example, age, gender, marital status can be combined under a factor called
demographic characteristics. The income level, education, employment status can
be combined under a factor called socio-economic status. The credit card and
family background can be combined under factor called background status.

## To identify the hidden dimensions or construct which

may not be apparent from direct analysis

## It helps the researcher to cluster the product and

population being analyzed.

## Factor: A factor is an underlying construct or dimension that

represent a set of observed variables. In the credit card company
example, the demographic characteristics, socio economic status
and background status represent a set of variables.
the factors. It measure how closely the variables in the factor are
associated. It is also called factor-variable correlation. Factor
factors.
Eigen Values: Eigen values measure the variance in all the
variables corresponding to the factor. Eigen values are calculated
factor. It aid in explaining the importance of the factor with respect
to variables. Generally factors with eigen values more than 1.0 are
considered stable. The factors that have low eigen values (<1.0)
may not explain the variance in the variables related to that factor.

## Communalities: Communalities, denoted by h2, measure the

percentage of variance in each variable explained by the
factors extracted. It ranges from 0 to 1. A high communality
value indicates that the maximum amount of the variance in
the variable is explained by the factors extracted from the
factor analysis.
Total Variance explained: The total variance explained is the
percentage of total variance of the variables explained. This is
calculating by adding all the communality values of each
variable and dividing it by the number of variables.
Factor Variance explained: The factor variance explained is
the percentage of total variance of the variables explained by
the factors. This is calculating by adding the squared factor
loadings of all the variables and dividing it by the number of
variables.

## Define the problem

Construct the correlation matrix that measures the
relationship between the factors and the variables.
Select an appropriate factor analysis method
Determine the number of factors
Rotation of factors
Interpret the factors
Determine the factor scores

Cluster Analysis

## Cluster analysis can be defined as a set of

techniques used to classify the objects into
relatively homogeneous groups called clusters
It involves identifying similar objects and grouping
them under homogeneous groups
Cluster as a group of objects that display high
correlation with each other and low correlation
with other variables in other clusters

## Procedure in Cluster Analysis

1.
2.

3.

Defining the problem: First define the problem and de upon the variables
based on which the objects are clustered.
Selection of similarity or distance measures : The similarity measure tries to
examine the proximity between the objects. Closer or similar objects are
grouped together and the farther objects are ignored. There are three major
methods to measure the similarity between objects:
1. Euclidean Distance measures
2. Correlation coefficient
3. Association coefficients
Selection of clustering approach: To select the appropriate clustering
approach. There are two types of clustering approaches:
1. Hierarchical Clustering approach
2. Non-Hierarchical Clustering approach
Hierarchical clustering Approach consists of either a top-down approach or
a bottom-up approach. Prominent hierarchical clustering methods are: Single
method.

## Procedure in Cluster Analysis

Hierarchical clustering Approach consists of either a top-down
approach or a bottom-up approach. Prominent hierarchical clustering