Vous êtes sur la page 1sur 50

Dr M.

Kanmani
Head-in Charge
Department of Education
Manonmaniam Sundaranar
University, Tirunelveli-12.
kan_mani_msc@yahoo.com

APPROPRIATE STATISTICS
FOR DATA ANALYSIS

BASIC CONCEPTS
Population
Collection of all individuals or objects or items
under study and denoted by N
Sample
A part of a population and denoted by n
Variable
Characteristic of an individual or object.
Qualitative and Quantitative variables

Parameter
Characteristic of the population
Statistic
Characteristic of the sample

NOTATIONS OF POPULATION AND SAMPLE


Characteristics

Population

Sample

Size

Mean

_
x

X
n

SD

_
2
(x x)
n

Proportion

Correlation
Coefficient

COV( x , y)
r
x y

_
2
(x x)
S
n 1

x
n

Chart on population, sample and


statistical inference
Population too large

Sample drawn from


the population

Collect data from


the sample

Draw inference which is


applicable to the population

Organise
data

Analyse the
Organised data

Organising a raw data set


R a w d a ta s e t
C a t e g o r ic a l o r
Q u a lit a t iv e d a t a s e t

Q u a n t it a t iv e
d a ta s e t

M a k e a fre q u e n c y
t a b le

O b t a in t h e m o d i f ie d r a n g e
a n d t h e n d iv id e in t o
s e v e r a l c la s s e s
M ake a
f r e q u e n c y t a b le

Pictorial representation of
a data set
G r a p h ic a l r e p r e s e n t a t io n
o f a d a ta s e t

Dec 6, 2015

C a t e g o r ic a l o r
Q u a lit a t iv e d a t a s e t

Q u a n t it a t iv e
d a ta s e t

B a r D ia g r a m
P ie D ia g r a m

H is t o g r a m
S t e m le a f d is p la y
P ie d ia g r a m
T im e p lo t

Summarising a raw data set


on a quantitative variable
S u m m a r is in g a r a w d a ta s e t
o n a q u a n tita tiv e v a r ia b le
S tu d y th e c e n tr a l
te n d e n c y o fth e d a ta s e t

S tu d y th e v a r ia b lity
o r d is p e r s io n o f th e d a ta s e t

L o c a tin g a c e n tr a l v a lu e
o f th e d a ta s e t b y th e m e a s u r e s
M ode
M e d ia n , M e a n

Q u a n tify in g th e d is p a r ity a m o n g th e
d a ta e n tr ie s is d o n e b y th e m e a s u r e s
R a n g e , In t e r - q u a r t i l e R a n g e
V a r ia n c e , S D

Dec 6, 2015

Sampling Techniques
P ro b a b i l i t y
S a m p lin g
S im p le R a n d o m
S a m p le

S tra ti f i e d R a n d o m
S a m p le

P ro p o rti o n a te

S y s te m a ti c
R andom
S a m p le

D i s p ro p o rti o n a te

O n e S ta g e

N o n -P ro b a b i l i t y
S a m p lin g
C l u s te r
S a m p lin g

C o n v e n ie n c e
s a m p lin g

T w o S ta g e

M u l t i S ta g e

Q u o ta
S a m p lin g

Judgem ent
S a m p lin g

S nowbal
S a m p lin g

Stages in Data Analysis


Editing
Error
Checking
And
Verification

Coding
Data Entry
(Keyboarding)
Data
Analysis

Descriptive
Analysis

Univeriate
Analysis

Bivariate
Analysis

Interpretation

Multivate
Analysis

Statistical Inference
S ta tis tic a l
In f e r e n c e

T h e o ry o f
E s tim a tio n
P o in t
E s tim a tio n
Dec 6, 2015

In t e r v a l
E s tim a tio n

T e s tin g o f
H y p o th e s is
P a ra m e tric
Test

N o n P a ra m e tric
Test

The Concept of P Value


Given the observed data set, the P value is the
smallest level for which the null hypothesis is
rejected (and the alternative is accepted)
If the P value then reject H0 ; Otherwise accept
H0

If the P value 0.01 then reject H0 at 1% level of


significance
If the P value lies between 0.01 to 0.05 (ie. 0.01< P
value 0.05) then reject H0 at 5% level of
significance
If the P value 0.05 then accept H0 at 5% level of
significance
Dec 6, 2015

Measurement Scales
Types of measurement scales are

Nominal Scale
Ordinal Scale
Interval scale
Ratio Scale

Scales of Measurement
Scale Level

Scale of
Scale
Measurement Qualities

Example(s)

Ratio

Magnitude
Age, Height,
Equal Intervals Weight,
Absolute Zero Percentage

Interval

Magnitude
Temperature
Equal Intervals

Ordinal

Magnitude

Likert Scale,
Anything rank
ordered

Nominal

None

Names, Lists
of words

Appropriate Statistics
Nominal

Ordinal

[ Cross tabs ]

[ Frequencies ]

Chi square,
Phi
Cramer's
Contingency

Median,
Interquartile range

[ Nonparametric ]

[ Nonparametric ]

Chi-square,
Runs
Binomial
McNemar
Cochran

Kolmogorov-Smirnov
Sign
Wilcoxen
Kendall coefficient of
concordance
Friedman two-way anova
Mann-Whitney U
Wald-Wolfowitz
Kruskal-Wallis

Interval
Mean
Standard Deviation
Pearson's productmoment correlation
t test
Analysis of variance,
Multivariate analysis of
variance, MANOVA
Factor analysis
Regression
Multiple correlation, R

Ratio
Coefficient of
Variation,
(CV = SD / M)

Type of Statistical Tests and its Characteristics


Hypothesis

Number of

Measurement

Testing

Samples

Scale

Hypotheses
About frequency
Distribution

Test

One

Nominal

Chi-square

Two or more

Nominal

Chi-square

One (Large sample

Interval

Z Test

Or Ratio

Hypothesis
About
means

One (small

Interval

sample

Or Ratio

Two (Large sample

Interval

t Test
Z Test

Or Ratio
Two (Small sample

Interval

t Test

Or Ratio
Three or more(Small
sample

Interval
Or Ratio

One (Large sample

Interval

ANOVA
Z Test

Or Ratio
Hypothesis
About
Proportions

One (small

Interval

sample

Or Ratio

Two (Large sample

Interval

t Test
Z Test

Or Ratio
Two (Small sample

Interval

t Test

Or Ratio
Variance

Two or more

Interval

sample

Or Ratio

ANOVA

Example for Tests of Hypotheses concerning Two population means


Sample I: 110, 120, 123, 112, 125
Sample II: 120, 128, 133, 138, 129
Group Statistics

Mark

Class
Class A
Class B

N
5
5

Mean
118.00
129.60

Std. Deviation
6.671
6.656

Std. Error
Mean
2.983
2.977

Independent Samples Test


Levene's Test for
Equality of Variances

F
Mark

Equal variances
assumed
Equal variances
not assumed

Sig.
.178

.684

t-test for Equality of Means

df

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

95% Confidence
Interval of the
Difference
Lower
Upper

-2.753

.025

-11.60

4.214

-21.318

-1.882

-2.753

8.000

.025

-11.60

4.214

-21.318

-1.882

Class

Sample

Mean

SD

Class A

118.00

6.671

Class B

129.60

6.656

t Value

P Value

2.753

0.025

Comparing multiple population means

For more than two populations, it is assumed that the probability


distribution ( i.e. Histogram ) of each population is approximately
normal.

H0: All the population means are equals


H1: At least two population means are differ

This test is called Analysis Of Variance (ANOVA)


Data from Unrestricted (independent) samples ( One-way ANOVA)
Data from Block Restricted Samples (Two-way ANOVA)
Objective : To check whether the null hypothesis is to be accepted
based on the value of the F by placing the significance level of at
the right tail of the Snedecor F distribution.

Example for One Way ANOVA


School I : 45, 54, 35, 43, 48
School II : 54, 65, 67, 55, 52
School III : 87, 65, 75, 79, 67
ANOVA
Mark

Between Groups
Within Groups
Total

Sum of
Squares
2195.200
706.400
2901.600

df
2
12
14

Mean Square
1097.600
58.867

F
18.646

Sig.
.000

Ma rk
Duncan

School
School I
School II
School III

N
5
5
5

Subset for alpha = .05


1
2
45.00
58.60

74.60

Means for groups in homogeneous subsets are displayed.


a. Uses Harmonic Mean Sample Size = 5.000.

Non-Parametric Tests

In some situations, the practical data may be non-normal


and/or it may not be possible to estimate the parameter(s)
of the data
The test which are used for such situations are called nonparametric tests
Since these tests are based on the data which are free
from distribution and parameter, these tests are known as
non-parametric(NP) test or Distribution Free tests
NP test can be used even for nominal data (qualitative
data like greater or less, etc.) and ordinal data, like ranked
data.
NP test required less calculation, because there is no need
to compute parameters.

List of Non-Parametric Tests


1. One-sample test

One sample sign test


Chi-square one sample test
Kolmogorov-Smirnov test

2. Two related samples tests

Two samples sign test


Wilcoxon Matched-pairs signed rank test

3. Two independent samples test

Chi-Square test for two independent samples


Mann-Whitney U test
Kolmogorov-Smirnov two sample test

List of Non-Parametric Tests


4 K Related Samples test

Friedman Two way Analysis of Variance by Ranks


The Coehran Q test

5. K Independent samples

Chi-Square test for k Independent samples


The extension of the Median test
Kruskal-Wallis one-way Analysis of Variance by Rank

Chi-square test for checking independence of two


categorized data

Let us consider two factors which may or may


not have influence on the observed frequencies
formed with respect to combinations of different
levels of the two factors
H0: Factor A and factor B are independent
H1: Factor A and factor B are not independent
Objective : To check whether the null hypothesis
is to be accepted based on the value of the chisquare by placing the significance level of at
the right tail of the chi-square distribution.

Chi-square test for goodness of fit

To fit the data to the nearest distribution which


represents the data more meaningfully for future analysis.
Such fitting of data to the nearest distribution is done
using the goodness of fit test
H0: The given data follow an assumed distribution
H1: The given data do not follow an assumed distribution
Objective : To check whether the null hypothesis is to be
accepted based on the value of the chi-square by placing
the significance level of at the right tail of the chisquare distribution.

Kolmogorov-smirnov test

It is similar to the chi-square test to do goodness of fit of a


given set of data to an assumed distribution
This test is more powerful for small samples whereas the chisquare test is suited for large sample

H0: The given data follow an assumed distribution


H1: The given data do not follow an assumed distribution

K-S test is an one-tailed test. Hence if the calculated value of


D is more than the theoretical value of D for a given
significance level, then reject H0 ; otherwise accept H0

Two samples sign test

Two samples sign test is applied to a situation, where two


samples are taken from two populations which have
continuous symmetrical distributions and known to be nonnormal
Modified sample value, Zi = + if Xi > Yi
= if Xi < Yi
= 0 if Xi = Yi
Classified into four categories
1
2
3
4

One-tailed two-sample sign tests with binomial distribution


Two-tailed two-sample sign tests with binomial distribution
One-tailed two-sample sign tests with normal distribution
Two-tailed two-sample sign tests with normal distribution

The Wilcoxon Matched-pairs signed-ranks test

The Wilcoxon test is a most useful test for behavioral


scientist
Let di = the difference score for any matched pair
Rank all the di without regard to sign
T = Sum of rank with less frequent sign
Compute Z = [T E(T)]/SD(T)

Mann-Whitney U Test

Mann-Whitney U test is an alternate to the two sample t-test


This test is based on the ranks of the observations of two
samples put together
Alternate name for this test is Rank-Sum Test
Let R1 = The sum of the ranks of the observations of the first
sample
Let R2 = The sum of the ranks of the observations of the second
sample
Objective: To check whether the two samples are drawn from
different populations having the same distribution
Compute Z = [U E(U)]/SD(U)
where U = n1n2 + [n1(n1 + 1)/2] - R1
or U = n1n2 + [n2(n2 + 1)/2] - R2

Correlation and Regression Analysis

The Chi-square test measures the association between two or


more variables.This test is applicable only when data is on
nominal scale.
Correlation and Regression analysis is used for measuring the
relationship between two variables measured on interval or
ratio scale.

Correlation Analysis

Correlation analysis is a statistical technique used to measure


the magnitude of linear relationship between two variables.
Correlation analysis cannot be used in isolation to describe the
relationship between variables.
It can be used along with regression analysis to determine the
nature of the relationship between two variables.
Thus correlation analysis can be used for further analysis
Two prominent types of correlation Coefficient are
Pearson Product Moment correlation coefficient
Spearmans Rank correlation coefficient
Testing the significance of correlation coefficient
Type I H0: = 0 and H1: 0
Type II H0: = r and H1: r
Type III H0: r1 = r2 and H1: r1 r2

Correlation Analysis
Example:

Mark in Mathematics: 89,58,78,79,86,58


Marks in Statistics:

75,79,59,78,84,65

Correlations

MATHS

STATISTI
CS

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

STATISTI
MATHS
CS
1
.968**
.
.002
6
6
.968**
1
.002
.
6
6

**. Correlation is significant at the 0.01 level (2-tailed).

Regression Analysis

Regression analysis is used to predict the nature and


closeness of relationships between two or more variables
It evaluate the causal effect of one variable on another
variable
It used to predict the variability in the dependent (or criterion)
variable based on the information about one or more
independent (or predictor) variables.
Two variables : Simple or Linear Regression Analysis
More than two variables : Multiple Regression Analysis

Linear Regression Analysis

Linear regression : Y = + X
Where Y : Dependent variable
X : Independent variable
and : Two constants are called regression coefficients
: Slope coefficient i.e. the change in the value of Y with
the corresponding change in one unit of X
: Y intercept when X = 0

R2 : The strength of association i.e. to what degree that the


variation in Y can be explained by X.
R2 = 0.10 then only 10% of the total variation in Y can be
explained by the variation in X variables

Test of significance of Regression Equation

Linear regression : Y = + X
F test is used to test the significance of the linear relationship
between two variables Y and X
H0: = 0 (There is no linear relationship between Y and X)

H1: 0 (There is linear relationship between Y and X)

Objective : To check whether the estimates from the regression


model represent the real world data.

Example for Regression Analysis

School Climate
: 25, 34, 55, 45, 56, 49, 65
Academic Achievement: 58, 62, 80, 75, 84, 72, 89
Variables Entered/Removedb
Model
1

Variables
Entered
School a
Climate

Variables
Removed

Model Summary

Method
.

Model
1

Enter

R
.978a

R Square
.957

Adjusted
R Square
.949

a. Predictors: (Constant), School Climate

a. All requested variables entered.


b. Dependent Variable: Academic Achievement

ANOVAb
Model
1

Regression
Residual
Total

Sum of
Squares
732.832
32.597
765.429

df
1
5
6

Mean Square
732.832
6.519

F
112.409

Sig.
.000a

a. Predictors: (Constant), School Climate


b. Dependent Variable: Academic Achievement

Coefficientsa

Model
1

(Constant)
School Climate

Unstandardized
Coefficients
B
Std. Error
36.436
3.698
.805
.076

a. Dependent Variable: Academic Achievement

Standardized
Coefficients
Beta
.978

t
9.853
10.602

Sig.
.000
.000

Std. Error of
the Estimate
2.55330

Multivariate Analysis

Multivariate analysis is defined as all statistical


techniques which are simultaneously analyse more than
two variables on a sample of observation.

Multivariate analysis helps the researcher in evaluating


the relationship between multiple (more than two)
variables simultaneously.

Multivariate techniques are broadly classified into two


categories:

Dependency Techniques
Independency Techniques

A Classification of Multivariate Methods


A ll M u lt iv a r ia t e
M e th o d s
A re s o m e o f th e
v a r ia b le s d e p e n d e n t
o n o th e rs ?
Yes

N o

D ependence
M e th o d s

In d e p e n d e n c e
M e th o d s

Multivariate Analysis: Classification of Dependence Methods

D ependence
M e th o d s
H o w m a n y v a r ia b le s
a re d e p e n d e n t?
O ne D ependent
V a r ia b le

S e v e ra l D e p e n d e n t
V a r ia b le s

M e t r ic - T h e s c a le s a r e
r a t io o r in t e r v a l

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l

M e t r ic - T h e s c a le s a r e
r a t io o r in t e r v a l

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l

M u lt ip le
R e g r e s s io n

M u lt ip le
D is c r im in a n t
A n a ly s is

M u lt iv a r ia t e
A n a ly s is o f
V a r ia n c e
(M A N O V A )

C o n jo in t
A n a ly s is

M u lt ip le in d e p e n d e n t
and dependent
v a r ia b le s
C a n o n ic a l
A n a ly s is

Multivariate Analysis: Classification of Independence Methods


In d e p e n d e n c e
M e th o d s
A r e in p u t s
M a t r ic ?

M e t r ic - T h e
s c a le s a r e r a t io
o r In te rv a l
F a c to r
A n a ly s is

C lu s t e r
A n a ly s is

N o n m e t r ic - T h e
s c a le s a r e n o m in a l
o r o r d in a l
M e t r ic
M u lt id im e n s io n a l
S c a lin g

N o n m e t r ic
M u lt id im e n s io n
S c a lin g

Discriminant Analysis

Discriminant analysis aims at studying the effect of two or


more predictor variables (independent variables) on certain
evaluation criterion
The evaluation criterion may be two or more groups
Two groups such as good or bad, like or dislike, successful or
unsuccessful, above expected level or below expected level
Three groups such as good, normal or poor
Check whether the predictor variable discriminate among the
groups
To identify the predictor variable which is more important
when compared to other predictor variable(s).
Such analysis is called discriminant analysis

Discriminant Analysis

Designing a discriminant function: Y = aX1 + bX2


where Y is a linear composite representing the discriminant function, X 1
and X2 are the predictor variables (independent variables) which are
having effect on the evaluation criterion of the problem of interest.
Finding the discriminant ratio (K) and determining the variables which
account for intergroup difference in terms of group means
This ratio is the maximum possible ratio between the variability between
groups and the variability within groups
Finding the critical value which can be used to include a new data set (i.e.
new combination of instances for the predictor variables) into its
appropriate group
Testing
H0: The group means are equal in importance
H1: The group means are not equal in importance
using F test at a given significance level

Factor Analysis

Factor analysis can be defined as a set of methods in which the observable or


manifest responses of individuals on a set of variables are represented as functions
of a small number of latent variables called factors.
Factor analysis helps the researcher to reduce the number of variables to be
analyzed, thereby making the analysis easier.
For example, Consider a market researcher at a credit card company who wants to
evaluate the credit card usage and behaviour of customers, using various variables.
The variables include age, gender, marital status, income level, education,
employment status, credit history and family background.
Analysis based on a wide range of variables can be tedious and time consuming.
Using Factor Analysis, the researcher can reduce the large number of variables into
a few dimensions called factors that summarize the available data.
Its aims at grouping the original input variables into factors which underlying the
input variables.
For example, age, gender, marital status can be combined under a factor called
demographic characteristics. The income level, education, employment status can
be combined under a factor called socio-economic status. The credit card and
family background can be combined under factor called background status.

Benefits of Factor Analysis

To identify the hidden dimensions or construct which


may not be apparent from direct analysis

To identify relationships between variables

It helps in data reduction

It helps the researcher to cluster the product and


population being analyzed.

Terminology in Factor Analysis

Factor: A factor is an underlying construct or dimension that


represent a set of observed variables. In the credit card company
example, the demographic characteristics, socio economic status
and background status represent a set of variables.
Factor Loadings: Factor loading help in interpreting and labeling
the factors. It measure how closely the variables in the factor are
associated. It is also called factor-variable correlation. Factor
loadings are correlation coefficients between the variables and the
factors.
Eigen Values: Eigen values measure the variance in all the
variables corresponding to the factor. Eigen values are calculated
by adding the squares of factor loading of all the variables in the
factor. It aid in explaining the importance of the factor with respect
to variables. Generally factors with eigen values more than 1.0 are
considered stable. The factors that have low eigen values (<1.0)
may not explain the variance in the variables related to that factor.

Terminology in Factor Analysis

Communalities: Communalities, denoted by h2, measure the


percentage of variance in each variable explained by the
factors extracted. It ranges from 0 to 1. A high communality
value indicates that the maximum amount of the variance in
the variable is explained by the factors extracted from the
factor analysis.
Total Variance explained: The total variance explained is the
percentage of total variance of the variables explained. This is
calculating by adding all the communality values of each
variable and dividing it by the number of variables.
Factor Variance explained: The factor variance explained is
the percentage of total variance of the variables explained by
the factors. This is calculating by adding the squared factor
loadings of all the variables and dividing it by the number of
variables.

Procedure followed for Factor Analysis

Define the problem


Construct the correlation matrix that measures the
relationship between the factors and the variables.
Select an appropriate factor analysis method
Determine the number of factors
Rotation of factors
Interpret the factors
Determine the factor scores

Cluster Analysis

Cluster analysis can be defined as a set of


techniques used to classify the objects into
relatively homogeneous groups called clusters
It involves identifying similar objects and grouping
them under homogeneous groups
Cluster as a group of objects that display high
correlation with each other and low correlation
with other variables in other clusters

Procedure in Cluster Analysis


1.
2.

3.

Defining the problem: First define the problem and de upon the variables
based on which the objects are clustered.
Selection of similarity or distance measures : The similarity measure tries to
examine the proximity between the objects. Closer or similar objects are
grouped together and the farther objects are ignored. There are three major
methods to measure the similarity between objects:
1. Euclidean Distance measures
2. Correlation coefficient
3. Association coefficients
Selection of clustering approach: To select the appropriate clustering
approach. There are two types of clustering approaches:
1. Hierarchical Clustering approach
2. Non-Hierarchical Clustering approach
Hierarchical clustering Approach consists of either a top-down approach or
a bottom-up approach. Prominent hierarchical clustering methods are: Single
linkage, Complete linkage, Average linkage, Wards method and Centroid
method.

Procedure in Cluster Analysis


Hierarchical clustering Approach consists of either a top-down
approach or a bottom-up approach. Prominent hierarchical clustering
methods are: Single linkage, Complete linkage, Average linkage,
Wards method and Centroid method.

Non-Hierarchical clustering Approach: A cluster center is first determined


and all the objects that are within the specified distance from the cluster
center are included in the cluster
4

Deciding on the number of clusters to be selected

Interpreting the clusters