Académique Documents
Professionnel Documents
Culture Documents
Anlise de Dados |
Experimental Statistics and
Data Analysis
MEC | ISEP
Sandra Ramos | <sfr@isep.ipp.pt> | Office: H417
06/11/2015
Contents
ESEAD
06/11/2015
Evaluation Method
ESEAD
06/11/2015
Teaching methodology
ESEAD
06/11/2015
Bibliography
ESEAD
06/11/2015
Software
ESEAD
06/11/2015
Some definitions
Some definitions
ESEAD
06/11/2015
Some definitions
Types of variables:
Qualitative variables - also known as categorical variables are variables with no natural sense of ordering. They are
therefore measured on a nominal scale. For instance, hair
color (Black, Brown, Gray, Red, Yellow) is a qualitative
variable, as is name (Adam, Becky, Christina, Dave...).
Qualitative variables can be coded to appear numeric but
their numbers are meaningless, as in male=1, female=2.
Quantitative variables: variables that are not qualitative are
known as quantitative variables. They are interval and ratio
scales. A countrys population, a persons shoe size, or a
cars speed are all quantitative variables.
ESEAD
06/11/2015
Some definitions
Qualitative variables:
Nominal variables are variables that have two or more
categories, but which do not have an intrinsic order.
Dichotomous variables are nominal variables which have
only two categories or levels. For example, if we were looking
at gender, we would most probably categorize somebody as
either "male" or "female".
Ordinal variables are variables that have two or more
categories just like nominal variables only the categories can
also be ordered or ranked.
ESEAD
06/11/2015
Some definitions
Quantitative variables:
Continuous variables takes values of an interval or of an
colection of intervals. For exemple: Age, Weight, Height.
Discrete variables take a number finite ou infinite
numerable set of values).
ESEAD
06/11/2015
Some definitions
Types of variables:
An independent variable, sometimes called an
experimental or predictor variable, is a variable that is
being manipulated in an experiment in order to observe the
effect on a dependent variable, sometimes called an
outcome variable.
Imagine that a tutor asks 100 students to complete a maths
test. The tutor wants to know why some students perform
better than others. While the tutor does not know the answer
to this, she thinks that it might be because of two reasons:
(1) some students spend more time revising for their test; and
(2) some students are naturally more intelligent than others.
As such, the tutor decides to investigate the effect of revision
time and intelligence on the test performance of the 100
students.
Dependent Variable: Test Mark (measured from 0 to 100).
Independent Variables: Revision time (measured in hours)
and Intelligence (measured using IQ score).
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
Some definitions
Descriptive Statistic versus Inferential Statistics
Descriptive statistics is the term given to the analysis of data
that helps describe, show or summarize data. Descriptive
statistics do not, however, allow us to make conclusions
beyond the data we have analysed or reach conclusions
regarding any hypotheses we might have made. They are
simply a way to describe our data.
Inferential statistics are techniques that allow us to use these
samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important
that the sample accurately represents the population. The
principal methods of inferential statistics are (1) the
estimation of parameter(s) and (2) testing of statistical
hypotheses.
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
Pie graph
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
Make sure that what you are selecting is actually an intervalor ratio-level variable. Because all of the data are entered as
numbers, SPSS will actually calculate descriptive statistics on any
of these variables; but these results are only meaningful for
interval- or ratio-level variables (in this case, just the score
variable).
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
This shows the specific results for each variable that we entered
into the analysis. It is possible to get a table that gives us basic
descriptive statistics for many variables simultaneously, just by
moving them all at once from the left-hand to the right-hand list
in the main dialog box.
4
See the definition in the next slide.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
5
We will see below how to interprete these charts, for now we
see only how to obtain them.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
ESEAD
06/11/2015
n.
1 + 3.3 log2 (n) (Sturges, 1926).
1 + 2.3 log2 (n) (Larson, 1975).
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
n
1X
xi
n i=1
x(n/2) + x(n/2)+1
for n even and
2
x = x((n+1)/2) for n odd.
ESEAD
06/11/2015
Measures of location
Mode
The mode (Mo) is the value that appears most often in a set of
data. The mode is not necessarily unique, since the
probability mass function or probability density function may
take the same maximum value at several points. When a
data distribution exhibits several relative maxima of almost
equal value, we say that it is a multimodal distribution.
Quantiles
The quantile of order (0 < < 1), x of a dataset is that
value of the data below which lie 100% of the cases. The
median is therefore the 50% quantile, or x0.5 . Often used
quantiles are:
ESEAD
06/11/2015
Measures of location
Quantiles
ESEAD
06/11/2015
ESEAD
06/11/2015
n
n
1 X 2
1 X
(xi x )2 =
(x n x 2 )
n 1 i=1
n 1 i=1 i
ESEAD
06/11/2015
s = + s2
Coefficient of variation
The coefficient of variation, CV , of an sample is defined as:
CV =
s
100%
|
x|
CV has no units.
ESEAD
06/11/2015
n 2 m3
.
(n 1)(n 2)s3
Kurtosis
The kurtosis, b2 , of an sample is defined as:
n 2 (n + 1)m4
(n 1)2
,
(n 1)(n 2)(n 3)s4
(n 2)(n 3)
P
)k is the centred moment of order k.
where mk = n1 m
i=1 (xi x
b2 =
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
Statistical Inference
Summary
"Special" continuous distributions
Sampling and sampling distributions
Estimation - point estimation and intervalar estimation
Hiphotesis tests
1. Parametric hiphotesis tests
Verification of the assumptions of the parametric
tests with assumptions of the parametric test
t-test for independent samples
t-test for paired samples
Analysis of variance (ANOVA)
2. Nonparametric hiphotesis tests
Wilcoxon test
Wilcoxon-Mann-Whitney test
Kruskal-Wallis test
Chi-square test
Fisher test
McNemar test
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
ESEAD
06/11/2015
ESEAD
06/11/2015
1
2n/2
n
2
ESEAD
06/11/2015
Chi-square distribuion
If Z N (0, 1), them X = Z 2 is chi-square with 1 degree of
freedom.
2
If
i (1) (Xi independent r.v.s), i = 1, 2, ..., n, them
PX
n
2
X
i=1 i (n).
Ra
P(X 6 a) = fX (x)dx (In SPSS
Transform > Compute > Function group > CDF & ... )
R x
Quantile of order - x : P(X 6 x ) =
fX (x)dx = (In
SPSS
Transform > Compute > Function group > Inverse DF ... )
ESEAD
06/11/2015
ESEAD
06/11/2015
Students t-distribuion
If Z N (0, 1) and X 2 (n), them, for Z and X independent
r.v.s, Y = Z t(n).
X /n
Ra
If T t(n), P(T 6 a) = fT (t)dt
(In SPSS
Transform > Compute > Function group > CDF & ... )
R t
Quantile of order - t : P(T 6 t ) =
fT (t)dt =
(In SPSS
Transform > Compute > Function group > Inverse DF ... )
ESEAD
06/11/2015
fX (x) =
n1 n1 /2
n1 +n2
2
n
2
n21 n22
1+
x (1/2)(n1 2)
(1/2)(n1 +n2 ) , 0 < x <
n1
x
n
2
ESEAD
06/11/2015
Students F-distribuion
If X1 2 (n1 ) and X2 2 (n2 ) two independent r.v.s, them
1
F (n1 , n2 )
Y = XX 1/n
1/n1
Ra
If X F (n1 , n2 ), P(X 6 a) = fX (x)dx.
(In SPSS
Transform > Compute > Function group > CDF & ... )
R t
Quantile of order - x : P(X 6 X ) =
fX (x)dx =
(In SPSS
Transform > Compute > Function group > Inverse DF ... )
ESEAD
06/11/2015
Exercices
Let X N (0, 1). Find the values of the P(X < 1.5) and the
2.5% percentile of X .
Let X t(n). Find the values of the P(X < 1.5) and the 2.5%
percentile of X for n = 7, n = 30 and n = 100.
Let X 2 (n). Find the value of the P(X > 5) for
n = 1, 3, 5, 10.
Let X F (2, 2). Find the value of the P(X > 5).
ESEAD
06/11/2015
ESEAD
06/11/2015
Sampling
Sampling is the process to obtain theses samples from the
populations."
It is, therefore, important that the sample accurately
represents the population.
There are many methods of sampling when doing research.
"Simple" random sampling is the ideal (Use in simple
experiments that require a single sample to be taken from a
given population. The people in the sample frame must all be
accessible and available).
In this course we will consider that the all samples used are
random.
ESEAD
06/11/2015
ESEAD
06/11/2015
- variance
p - proportion of sucesses
(estimator)
Pn
= i=1 Xi
X
n
Pn
2)
Xi2 n X
S2 = i=1(n1)
Pn
Xi 7
P = i=1
n
(estimate)
P
xi
n
2)
Xi2 n X
P(n1)
n
i=1 xi
n
xP=
S =
p =
n
i=1
7
Where Xi Bernoulli(p), that is, Xi = 1 when a success was
observed and Xi = 0 othewise.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
N (0, 1)
=
/ n
/ n
.
ESEAD
06/11/2015
Sampling Distributions
i=1
Xi
2
X
N (0, 1)
,
n
/ n
ESEAD
06/11/2015
Sampling Distributions
with 2 unknown and
3. Distribution of the sample mean X
big samples (n > 30 in the practice)
X
N (0, 1)8
S/ n
In the practice, when 2 are unknown but n > 30 it is
possible to use the normal distribuition or the t-distribution
(Remember: when the degrees of freedom of the
t-distributions is > 30 the pdfs of theses two distributions
are almost the same).
ESEAD
06/11/2015
Sampling Distributions
Let X11 , ...Xn11 be i.i.d. r.v.s with E(Xi1 ) = 1 and Var(Xi1 ) = 21 > 0
(both finite) and X12 , ...Xn22 be i.i.d. r.v.s with E(Xi2 ) = 2 and
Var(Xi2 ) = 22 > 0 (both finite).
1 X
2 with 21 and 22 both
4. Distribution of the diference X
known:
X1 X2 (1 2 )
q 2
N (0, 1)
1
2
2
+
n
n
1
ESEAD
06/11/2015
Sampling Distributions
1 X
2 with 21 and 22 both
5. Distribution of the diference X
unknown and unequal:
X1 X2 (1 2 )
q 2
t(v) where v =
S1
S22
+
n
n
1
1
n1
S12
n1
S12
n1
2
+
+
S22
n2
2
1
n2
S22
n2
2
ESEAD
06/11/2015
Sampling Distributions
1 X
2 with 21 and 22 both
6. Distribution of the diference X
unknown but equals (21 = 22 = 2 )
In this case is considered the estimator to 2 :
S2 =
1 X
2 is, for normal
and the distribution of the diference X
populations,
X1 X2 (1 2 )
q
t(n1 + n2 2)
S n1 + n1
1
ESEAD
06/11/2015
Confidence intervals
A little review...
An unknown parameter (e.g., ) can be estimate by a point
in such a way that is "close" to (e.g,. E()
= ,
(e.g., )
is small). E.g., = and = x .
Var()
In this section we consider interval estimation;
that is, how
to estimate by an interval of values L , U that has high
probability of including but also has, small average length.
Let X be a r.v. with d.f. F(x|) where is unknown. If
P(L(X ) < < U (X )) = 1 then the interval (L(X ), U (X )) is
called a confidence interval for whith confidence
coefficient (CI) 1 .
To find a CI for a parameter is necessary consider a
sufficient statistic (function the depend on random sample
X1 , . . . , Xn solely). E.g., when = the sufficient statistic is
= X1 ++Xn
X
n
ESEAD
06/11/2015
Confidence intervals
Confidence intervals for one normal mean
Let X1 , . . . , Xn be independent N (, 2 ) r.v.s where is unknown
and 2 > 0 is known. To obtain a CI to , consider the sufficient
statistic is
2
X
N (0, 1)
X N ,
Z =
(1)
n
/ n
ESEAD
06/11/2015
Confidence intervals
Confidence intervals for one normal mean
The CI for with confidence (1 ) 100% is
z1/2 , X
+ z1/2 .
X
n
n
Confidence intervals for one normal mean with 2 unknown
In this case the sufficient statistic is
T =
X
T (n 1),
S/ n
ESEAD
06/11/2015
Confidence intervals
Confidence intervals for the diference between two
populational means. Normal populations with known
variances
s
s
2
2
2
2
1
1
X1 X2 z1/2
+ 2 , X1 X2 + z1/2
+ 2.
n1
n2
n1
n2
For non normal populations but n1 , n2 >> 30 is possible to use
these interval.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
Confidence intervals
Confidence intervals for the diference between two
populational means. Normal populations with unknown and
unequal variances
s
s
2
2
2
2
S
S
S
S
1
1
2
2
X1 X2 t/2,v
+
, X1 X2 + t/2,v
+
n1
n2
n1
n2
where
2
S1
n1
v=
1
n1
S2
1
n1
S2
!2
+ n2
!2
+ n1
S2
2
n2
!2 .
ESEAD
06/11/2015
Confidence intervals
.
Confidence intervals for the diference between two
populational means. Normal populations with unknown but
equal variances
X1 X2 t/2,n1 +n2 2 S
+
, X1 X2 + t/2,n1 +n2 2 S
+
,
n1
n2
n1
n2
where S2 =
ESEAD
06/11/2015
Hyphotesis tests
A little review...
Hiphotesis H0 : = 0 versus H1 : 6= (> or <)0
The null hypothesis (H0 ) states that no effect or no
difference exists in the data.
The alternative or research hypothesis (H1 ) states
researchers belief that some difference or effect exists.
Type of tests
Bilateral H1 : 6= 0
Rigth unilateral H1 : > 0
Left unilateral H1 : < 0
Decision There are several rule to decide in a hypothesis test.
We decide if reject or no the null hypothesis
ESEAD
06/11/2015