Esead Slides

Estatstica Experimental e
Anlise de Dados |
Experimental Statistics and
Data Analysis
MEC | ISEP
Sandra Ramos | <sfr@isep.ipp.pt> | Office: H417
06/11/2015
ESEAD: Some Important Considerations
Contents
1. Exploratory data analysis (EDA) (2 weeks)

1.1 Data presentation: graphs, charts and tables
1.2 Numerical summaries
2. Statistical inference (8 week)
2.1 Confidence intervals.
2.2 Parametric hiphotesis tests.
2.3 Non parametric hiphotesis tests.
3. Simple and multiple linear regression (2 weeks)
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015
Evaluation Method
1. Pratical group work1 with a maximum of 3 students CWORK;

2. Two individual exames - CE1; CE2;
3. The final classification (FC )of the course is given by the
formula:
FC = (CWORK + 1.5 CE1 + 1.5 CE2)/4
Conduced in the lective period. This work consist in an

analysis of an data set and elaboration of an report with a
description of the used methodology, results and principal
conclusions.
ESEAD
06/11/2015
Teaching methodology
This course is divided into three teaching environments: the

theoretical classes (T), the laboratorial classes (PL) and the tutorial
guidance classes (OT).
1. The T classes are of expository nature. All concepts and
results will be accompanied by illustrative examples.
2. The PL classes seeks to solidify the knowledge acquired in T
classes. The statistical software SPSS will be always used to
solve proposed problems.
3. In the OT the students will be cleared of all their doubts and
may also make their evaluation practical works.
ESEAD
06/11/2015
Bibliography
1. Ross, S. (2010). Introductory Statistics. Elsevier.

2. Murteira, B., Ribeiro, C.S., Andrade e Silva, J. e Pimenta, C.
(2002). Introduo Estatstica. McGrawHill, Lisboa.
3. Hall, A, Neves, C., Pereira, A. (2011). Grande Maratona de
Estatstica no SPSS. Escolar Editora.
4. Maroco, J. (2011). Anlise Estatstica com o SPSS Statistics.
ReportNumber.
5. Ross, S. (2010). Introductory Statistics. Elsevier.
6. Marques de S, Joaquim P. (2007). Applied Statistics Using
SPSS, STATISTICA, MATLAB and R
ESEAD
06/11/2015
Software
SPSS (Statistical Package for the Social Sciences) Statistics is a

software package used for statistical analysis. Long produced by
SPSS Inc., it was acquired by IBM in 2009. The current versions
(2015) are officially named IBM SPSS Statistics.
ESEAD
06/11/2015
Some definitions
Some definitions
Popolution: A population includes all of the elements whose

the characteristics we pretend study.
Sample: A sample is an any subset of the popultation.
A a measurable characteristic of a population, such as a
mean or standard deviation, is called a parameter.
A measurable characteristic of a sample is called a statistic.
ESEAD
06/11/2015
Some definitions
Types of variables:
Qualitative variables - also known as categorical variables are variables with no natural sense of ordering. They are
therefore measured on a nominal scale. For instance, hair
color (Black, Brown, Gray, Red, Yellow) is a qualitative
variable, as is name (Adam, Becky, Christina, Dave...).
Qualitative variables can be coded to appear numeric but
their numbers are meaningless, as in male=1, female=2.
Quantitative variables: variables that are not qualitative are
known as quantitative variables. They are interval and ratio
scales. A countrys population, a persons shoe size, or a
cars speed are all quantitative variables.
ESEAD
06/11/2015
Some definitions
Qualitative variables:
Nominal variables are variables that have two or more
categories, but which do not have an intrinsic order.
Dichotomous variables are nominal variables which have
only two categories or levels. For example, if we were looking
at gender, we would most probably categorize somebody as
either "male" or "female".
Ordinal variables are variables that have two or more
categories just like nominal variables only the categories can
also be ordered or ranked.
ESEAD
06/11/2015
Some definitions
Quantitative variables:
Continuous variables takes values of an interval or of an
colection of intervals. For exemple: Age, Weight, Height.
Discrete variables take a number finite ou infinite
numerable set of values).
ESEAD
06/11/2015
Some definitions
Types of variables:
An independent variable, sometimes called an
experimental or predictor variable, is a variable that is
being manipulated in an experiment in order to observe the
effect on a dependent variable, sometimes called an
outcome variable.
Imagine that a tutor asks 100 students to complete a maths
test. The tutor wants to know why some students perform
better than others. While the tutor does not know the answer
to this, she thinks that it might be because of two reasons:
(1) some students spend more time revising for their test; and
(2) some students are naturally more intelligent than others.
As such, the tutor decides to investigate the effect of revision
time and intelligence on the test performance of the 100
students.
Dependent Variable: Test Mark (measured from 0 to 100).
Independent Variables: Revision time (measured in hours)
and Intelligence (measured using IQ score).
ESEAD
06/11/2015
Some definitions
Descriptive Statistic versus Inferential Statistics
Descriptive statistics is the term given to the analysis of data
that helps describe, show or summarize data. Descriptive
statistics do not, however, allow us to make conclusions
beyond the data we have analysed or reach conclusions
regarding any hypotheses we might have made. They are
simply a way to describe our data.
Inferential statistics are techniques that allow us to use these
samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important
that the sample accurately represents the population. The
principal methods of inferential statistics are (1) the
estimation of parameter(s) and (2) testing of statistical
hypotheses.
ESEAD
06/11/2015
SPSS: A very, very, brief introdution...
This editing window shows the contents of a data set, allows

the creation of new data sets and also allows to change of an
existent data set.
ESEAD
06/11/2015
The variable view separator (Data Editor window) is the place

where the characteristics of the variables are defined.
ESEAD
06/11/2015

Name: name of the variable.
Type: type of variable (numeric, data, string,..).
Width: variable length, that is, the number of digits that the
variable has.
Decimals: number of decimal places.
Label: description of the variable.
Values: labels of the qualitative variables (eg , 1 = female and
2 = male) .
Missing: in this field we can indicate the coding of missing
values (non-existent values). This values are not eligible for
statistical calculation purposes.
Columns: indicates the number of characters the form a
column, so, the column width.
Align: data alignment. Measure: selects the measured
variable scale (interval / ratio , ordinal or nominal).
Role: function of the variable. Input (preditor or independent
variable); Target ( outcome or dependent variable); Both (both
functions); ....
ESEAD
06/11/2015

Viewer window (Output)
The SPSS Viewer allows the visualization of the results of

analyzes carried out.
ESEAD
06/11/2015
Exploratory data analysis | Desdriptive statistics
Exploratory data analysis

When first looking at a dataset, it is wise to use descriptive
statistics to get some idea of what your data look like.
The next simple dataset showing three different variables:
type of psychological problem someone was treated for,
type of treatment approach used, and symptom level at
the end of treatment. One of these variables (symptom) is on
an interval-level scale; the other two are grouping variables
(i.e., they are measured on a nominal-level scale).
ESEAD
06/11/2015

The type of descriptive statistics to be used depends on the
scale of measure of data. Here are some options:
For simple description of nominal-level variables
(groups) use frequencies (frequencies tables, bar and
pie charts).
For more complex description of nominal-level variables
use crosstabs.
For simple description of interval- or ratio-level variables
(items measured on a scale) use the descriptives
command to obtain numerical characteristics of the
data. It is possible obtain some graphs: boxplots,
histograms, ...
For more complex description of interval- or ratio-level
variables use the explore command.2
Note that even though problem and treatment are
nominal-level variables, they have to be coded as numbers
(not as text) in order to use the following procedures.
2
Each of these options will be shown in more detail below.
ESEAD
06/11/2015

Here are the various choices! All of them are found in the Analyze
menu in SPSS, under the sub-menu for Descriptive Statistics :
To obtain frequencies for a nominal-level variable, like problem

in this dataset, open the Frequencies dialog box, select the
variable of interest in from the left-hand lists, and use the arrow
button to move it from the left-hand list into the right-hand list.
ESEAD
06/11/2015
Use the Charts button to see a graphical output of the

frequencies on variable of interest. In case of nominal, select
Bar charts or Pie charts . If you arent happy with the way your
data are displayed (ascending vs. descending order, etc.), try some
of the options found by clicking on the Format button!
ESEAD
06/11/2015

After hit OK in the main dialog box to continue, the output for
this variable looks like this:
This is the frequency table shows:

the number of people in each group, the percent of the total
in each group;
the percent in each group as a proportion of just the people
with complete data (thats the valid percent),
the cumulative percent (thats the cumulative percent). In
case of nominal variables the cumulative percent have no
interest because they can not be interpreted. For ordinal
variables the cumulative percent makes sense.
ESEAD
06/11/2015

Heres the graphical output: theses graphs (bar and pie graphs)
shows the percent in each category.
Bar graph
Pie graph
ESEAD
06/11/2015

The analysis of two nominal variables in combination for
example, what % of the patients with the problem 1, were treated
with treatment CBT? can be achieved by using crosstabs
command in the Analyze/Descriptive Statistics sub-menu.
ESEAD
06/11/2015

The crosstabs dialog box lets us enter one variable (or more) as
rows in a frequency table, and another variable as the columns in
the same table. Use the Cells command to get percentages for
each row and column:
ESEAD
06/11/2015

After hit OK in the main dialog box to continue, the output for
this variable looks like this:
The central area of this table shows the total number of

patients who have each individual combination of the various
levels of the two variables.
The far-right column shows the total for each problem, as a
percent of all patients.
The bottom-most row slices the data the other way, showing
the total for each type of treatment, as a percent of all
Sandrapatients.
Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015

For interval- or ratio-level variables (i.e., scales variables), use
the descriptives sub-command on the
Analyze/Descriptive Statistics menu:
ESEAD
06/11/2015

This opens a dialog box where we can select the variable that we
want descriptive statistics on.3
Make sure that what you are selecting is actually an intervalor ratio-level variable. Because all of the data are entered as
numbers, SPSS will actually calculate descriptive statistics on any
of these variables; but these results are only meaningful for
interval- or ratio-level variables (in this case, just the score
variable).
ESEAD
06/11/2015

As usual, select the variable(s) that we are interested in from the
left-hand column, and move them to the right-hand column. Use
the Options button to select the specific descriptive statistics
that we are interested in. Usually, good choices include the mean,
standard deviation, maximum, and minimum. If you are
concerned about the impact of outliers4 on your data, a measure
of skewness may also be appropriate.
This shows the specific results for each variable that we entered
into the analysis. It is possible to get a table that gives us basic
descriptive statistics for many variables simultaneously, just by
moving them all at once from the left-hand to the right-hand list
in the main dialog box.
4
See the definition in the next slide.
ESEAD
06/11/2015
In statistics, an outlier is an observation point that is distant

from other observations.
An outlier may be due to variability in the measurement or it
may indicate experimental error; the latter are sometimes
excluded from the data set.
We will see later how to identify whether a given observation
is an outlier.
ESEAD
06/11/2015

Finally, to get more complex descriptive results for an interval- or
ratio-level variable, use the Explore command in the
Analyze/Descriptive Statistics sub-menu.
ESEAD
06/11/2015

The following dialog box will appear:
The variable that we are interested in goes into the dependent

list.
ESEAD
06/11/2015

The explore command gives us descriptive statistics for each
variable, with more extensive results (median, interquartile range,
etc.) than the descriptives command:
ESEAD
06/11/2015

The explore command also gives graphs for each variable by
groups. For example, a boxplot 5 for variable "Symptom" by
"Treatment" or by "Prolem".
5
We will see below how to interprete these charts, for now we
see only how to obtain them.
ESEAD
06/11/2015

In the Graphs menu, under the sub-menu Legacy dialogs it is
possible obtain a large set of graphs. For example, to obtain an
histogram
ESEAD
06/11/2015

Histogram for the variable "Symptom".
It was considered 4 bins (classes) in the histogram.

There are several rules to obtain the "reasonable" number of
bins. The number of bins are a integer k close to
n.
1 + 3.3 log2 (n) (Sturges, 1926).
1 + 2.3 log2 (n) (Larson, 1975).
n is the sample size.
ESEAD
06/11/2015
In the last lesson we saw how presenting and summarising the

Data, namely, we saw:
frequency tables, pie and bar charts to presenting and
summarising nominal variables;
cross table or contingency table to presenting the counts
corresponding to the several combinations of categories of
two nominal variables.
numerical summaries (mean, standart deviation, variance,
ranges, ...) to summarising scale (continuous) variables;
histograms and boxplot to presenting the data resulting by
the observation of scale (continuous) variables;
ESEAD
06/11/2015

Numerical summaries/ measures: Computation and
interpretation
Measures of location
Measures of location are used in order to determine where
the data distribution is concentrated. The most usual
measures of location are: Arithmetic Mean; Median; Mode;
Quantiles.
Measures of dispersion (or scale)
The measures of dispersion give an indication of how
concentrated a data distribution is. The most usual
measures of dispersion are: Variance; standard deviation;
total range; inter-quartil range; coefficient of variation.
Measures of shape
The measures of shape give an indication of about the
distribution of the data. The most usual measures of shape
are: Skewness and Kurtosis.
ESEAD
06/11/2015

Let x1 , x2 , ..., xn be the data, and x(1) , x(2) , ..., x(n) the ordered data.
Arithmetic mean (or simply mean)
x =
n
1X
xi
n i=1
If the datasets exhibiting outliers and extreme cases that can

be suspected to be the result of rough measurement errors,
one can use a trimmed mean by neglecting a certain
percentage of the tail cases (e.g., 5%).
Median
The median of a dataset is that value of the data below which
lie 50% of the cases. The median, x , of an sample is defined
as:
x =
x(n/2) + x(n/2)+1
for n even and
2
x = x((n+1)/2) for n odd.
ESEAD
06/11/2015
Mode
The mode (Mo) is the value that appears most often in a set of
data. The mode is not necessarily unique, since the
probability mass function or probability density function may
take the same maximum value at several points. When a
data distribution exhibits several relative maxima of almost
equal value, we say that it is a multimodal distribution.
Quantiles
The quantile of order (0 < < 1), x of a dataset is that
value of the data below which lie 100% of the cases. The
median is therefore the 50% quantile, or x0.5 . Often used
quantiles are:
ESEAD
06/11/2015
Quantiles
Quartiles, corresponding to multiples of 25% of

the cases. The boxplot mentioned before presents
the quartiles and the inter-quartile range
(IQR = x0.75 x0.25 ). These values are often used to
determine the outliers of the dataset distribution.
Deciles, corresponding to multiples of 10% of the
cases.
Percentiles, corresponding to multiples of 1% of
the cases. We will often use the percentile
p = 2.5% and its complement p = 97.5%.
ESEAD
06/11/2015

Quantiles
Quantile has been defined in multiple ways. The quantile of
order is given by6 :
x = (1 )x(j) + x(j+1) ,
where
m = 1 ,
j = floor(n + m),
= n + m j.
Note: floor(x) is greatest integer less than x.
The formula is also frequentely used:
x = x(int(k+1)) if k = (n) is not a integer number and
x = (x(k) + x(k+1) )/2 if k = (n) is a integer number.
6
the must popular formula.
ESEAD
06/11/2015
Measures of scale or dispersion

Let x1 , x2 , ..., xn be the data.
Range
Range = x(n) x(1) .
Interquartil range - IQR
IQR = x0.75 x0.25 = Q3 Q1 .
Variance The deviation, s2 , of an sample is defined as:
s2 =
n
n
1 X 2
1 X
(xi x )2 =
(x n x 2 )
n 1 i=1
n 1 i=1 i
ESEAD
06/11/2015
Measures of scale or dispersion

Standard deviation
The deviation, s, of an sample is defined as:
s = + s2
Coefficient of variation
The coefficient of variation, CV , of an sample is defined as:
CV =
s
100%
|
x|
CV has no units.
ESEAD
06/11/2015

Measures shape
Skewness and kurtosis has been defined in multiple ways. The
steps below explain the method used by Prism, called g1 (the most
common method).
Skewness
The skewness, b1 , of an sample is defined as:
b1 =
n 2 m3
.
(n 1)(n 2)s3
Kurtosis
The kurtosis, b2 , of an sample is defined as:
n 2 (n + 1)m4
(n 1)2
,
(n 1)(n 2)(n 3)s4
(n 2)(n 3)
P
)k is the centred moment of order k.
where mk = n1 m
i=1 (xi x
b2 =
ESEAD
06/11/2015

Measures shape (interpretation)
Skewness quantifies how symmetrical the distribution is.
A symmetrical distribution has a skewness of zero.
An asymmetrical distribution with a long tail to the right
(higher values) has a positive skew.
An asymmetrical distribution with a long tail to the left
(lower values) has a negative skew.
The skewness is unitless.
Any threshold or rule of thumb is arbitrary, but here is
one: If the skewness is greater than 1.0 (or less than
-1.0), the skewness is substantial and the distribution is
far from symmetrical (George & Mallery, 2010).
Kurtosis quantifies whether the shape of the data
distribution matches the Gaussian distribution.
A Gaussian distribution has a kurtosis of 0.
A flatter distribution has a negative kurtosis.
A distribution more peaked than a gaussian distribution
has a positive kurtosis.
Kurtosis has no units.
ESEAD
06/11/2015
Measures shape (interpretation)
ESEAD
06/11/2015

Boxplot (going back....)
The boxplot is an exploratory graphic used to show the
distribution of a dataset. With this graphic it is possible evaluate
the dispersion, location and symmetry of a data set. It is also
possible to identify outliers.
There are several criteria to classify an observation as an outlier.
We will consider the criterion:
A value xi is a outlier ( marked with circles) if xi < BI or xi > BS
where BI and BS are defined as:
BI = q1 1.5 IQR and BS = q3 + 1.5 IQR.
A value xi is a severe outlier (marked with crosses) if xi < BI or
xi > BS where BI and BS are defined as:
BI = q1 3 IQR and BS = q3 + 3 IQR.
ESEAD
06/11/2015

Boxplot (going back....)
ESEAD
06/11/2015

Scatterplot
A simple scatterplot can be used to (a) determine whether a
relationship is linear, (b) detect outliers and (c) graphically present
a relationship. For example, determining whether a relationship is
linear (or not) is an important assumption if you are analysing
your data using a Pearsons correlation, simple linear regression
or multiple regression (we will see this later).
ESEAD
06/11/2015
Statistical Inference
Summary
"Special" continuous distributions
Sampling and sampling distributions
Estimation - point estimation and intervalar estimation
Hiphotesis tests
1. Parametric hiphotesis tests
Verification of the assumptions of the parametric
tests with assumptions of the parametric test
t-test for independent samples
t-test for paired samples
Analysis of variance (ANOVA)
2. Nonparametric hiphotesis tests
Wilcoxon test
Wilcoxon-Mann-Whitney test
Kruskal-Wallis test
Chi-square test
Fisher test
McNemar test
ESEAD
06/11/2015
"Special" continuous distributions: revisions

Normal distribuion
A random variable (r.v.) X has the normal distribution N (, 2 ) (for
some 2 > 0 and < < +) if the density function of X is
"

2 #
1 x
1
exp
, < x < +.
fX (x) =
2
ESEAD
06/11/2015

Normal distribuion
The N (0, 1) distribuition is called the standard normal
distribution.
The sum normal independent of r.v.s is also a r.v. with
normal distribution.
Ra
P(X 6 a) = fX (x)dx (In SPSS
Transform > Compute > Function group > CDF & ... )
R x
fX (x)dx = (In
Quantile of order - x : P(X 6 x ) =
SPSS
Transform > Compute > Function group > Inverse DF ... )
ESEAD
06/11/2015

Chi-square distribuion
A random variable (r.v.) X has the chi-square with n degree of
freedom (X 2 (n)) (for some n positive integer) if the density
function of X is
fX (x) =
1
2n/2
n
2
x n/21 e x/2 , x > 0.
ESEAD
06/11/2015
Chi-square distribuion
If Z N (0, 1), them X = Z 2 is chi-square with 1 degree of
freedom.
2
If
i (1) (Xi independent r.v.s), i = 1, 2, ..., n, them
PX
n
2
X
i=1 i (n).
Ra
P(X 6 a) = fX (x)dx (In SPSS
R x
Quantile of order - x : P(X 6 x ) =
fX (x)dx = (In
SPSS
ESEAD
06/11/2015

Students t-distribuion
A r.v. T has Students t-distribuion with n degrees of freedom
(T t(n)) if (for some integer n > 0) the probability density
function of T is

n+1
1
2
, < t < +,
fT (t) =

2 n2 1 + xt2 (n+1)/2
n
R
where (t) = 0 x t1 e x dx is the function gamma. (n + 1) = n! for
any positive integer n.
ESEAD
06/11/2015
Students t-distribuion
If Z N (0, 1) and X 2 (n), them, for Z and X independent
r.v.s, Y = Z t(n).
X /n
Ra
If T t(n), P(T 6 a) = fT (t)dt
(In SPSS
R t
Quantile of order - t : P(T 6 t ) =
fT (t)dt =
(In SPSS
ESEAD
06/11/2015

F-distribuion
A r.v. X has F-distribuion with n1 and n2 degrees of freedom
(positive integres) (X F (n1 , n2 )) if (for some integer n > 0) the
probability density function of X is
fX (x) =
n1 n1 /2
n1 +n2
2
n
2
n21 n22

1+
x (1/2)(n1 2)
(1/2)(n1 +n2 ) , 0 < x <
n1
x
n
2
ESEAD
06/11/2015
Students F-distribuion
If X1 2 (n1 ) and X2 2 (n2 ) two independent r.v.s, them
1
F (n1 , n2 )
Y = XX 1/n
1/n1
Ra
If X F (n1 , n2 ), P(X 6 a) = fX (x)dx.
(In SPSS
R t
Quantile of order - x : P(X 6 X ) =
fX (x)dx =
(In SPSS
ESEAD
06/11/2015
Exercices
Let X N (0, 1). Find the values of the P(X < 1.5) and the
2.5% percentile of X .
Let X t(n). Find the values of the P(X < 1.5) and the 2.5%
percentile of X for n = 7, n = 30 and n = 100.
Let X 2 (n). Find the value of the P(X > 5) for
n = 1, 3, 5, 10.
Let X F (2, 2). Find the value of the P(X > 5).
ESEAD
06/11/2015
Sampling and Sampling Distributions

Sampling
When we carry out a statistical study, usually, we are
interested to get conclusions about the populations under
study.
When it is possible observe all the elements of the
populations under study, are used descriptive statistics
techniques to describe/characterize the populations.
Frequently the populations are not completly accessible. For
example: infinite populations, high cost of research (e.g. the
time to obtain the data is unaffordable and in some cases
analysis may be destructive). Neste casos, sampling
techniques is needed and inferential techniques are applied
to conclude about the populations.
Remember "Inferential statistics are techniques that allow
us to use samples to make generalizations about the
populations from which the samples were drawn. The
principal methods of inferential statistics are (1) the
estimation of parameter(s) and (2) testing of statistical
hypotheses."
ESEAD
06/11/2015
Sampling
Sampling is the process to obtain theses samples from the
populations."
It is, therefore, important that the sample accurately
represents the population.
There are many methods of sampling when doing research.
"Simple" random sampling is the ideal (Use in simple
experiments that require a single sample to be taken from a
given population. The people in the sample frame must all be
accessible and available).
In this course we will consider that the all samples used are
random.
ESEAD
06/11/2015

Sampling
Let the sequence of independent and identically distributed r.v.s
(i.i.d.) X1 , X2 , ..., Xn be a random sample (r.s.).
Parameter - numerical characteristic of a population
(usually unknown).
- function of r.s.:
= g (X1 , X2 , ..., Xn ).
Statistic or estimator
The base of the inference on the parameter .
for a particular observed random
Estimate - The value of
sample (point estimate of ).
ESEAD
06/11/2015
Parameter, Estimator, Estimate

(parameter)
- mean
2
- variance
p - proportion of sucesses
(estimator)
Pn
= i=1 Xi
X
n
Pn
2)
Xi2 n X
S2 = i=1(n1)
Pn
Xi 7
P = i=1
n
(estimate)
P
xi
n
2)
Xi2 n X
P(n1)
n
i=1 xi
n
xP=
S =
p =
n
i=1
7
Where Xi Bernoulli(p), that is, Xi = 1 when a success was
observed and Xi = 0 othewise.
ESEAD
06/11/2015
Central Limit Theorem (CLT)

Let X1 , X2 , ...Xn be independent and identically distributed (i.i.d.)
r.v.s with E(Xi ) = and Var(Xi ) = 2 > 0 (both finite). Then for
n > +

Pn

X
i=1 Xi /n
N (0, 1)
=
/ n
/ n
.
ESEAD
06/11/2015
Sampling Distributions
Let X1 , X2 , ...Xn be a independent and identically distributed r.v.s

with E(Xi ) = and Var(Xi ) = 2 > 0 (both finite).

with 2 known
1. Distribution of the sample mean X
Pn
=
X
i=1
Xi

2
X
N (0, 1)
,
n
/ n
This distribution is exact if Xi N (i , 2i ) and, under the CLT,

approximate for n (n > 30 in the practice).

with 2 unknown

X
t(n 1)
S/ n
This distribution is exact if Xi N (, 2 ).
ESEAD
06/11/2015

with 2 unknown and
big samples (n > 30 in the practice)

X
N (0, 1)8
S/ n
In the practice, when 2 are unknown but n > 30 it is
possible to use the normal distribuition or the t-distribution
(Remember: when the degrees of freedom of the
t-distributions is > 30 the pdfs of theses two distributions
are almost the same).
It is possible show this result by using the Slutsky theorem.
ESEAD
06/11/2015
Let X11 , ...Xn11 be i.i.d. r.v.s with E(Xi1 ) = 1 and Var(Xi1 ) = 21 > 0
(both finite) and X12 , ...Xn22 be i.i.d. r.v.s with E(Xi2 ) = 2 and
Var(Xi2 ) = 22 > 0 (both finite).
1 X
2 with 21 and 22 both
4. Distribution of the diference X
known:

X1 X2 (1 2 )
q 2
N (0, 1)
1
2
2
+
n
n
1
This distribution is exact for Xi1 (1 , 21 ) and Xi2 (2 , 22 )

and aproximate for n1 , n2 > 30 (CLT).
ESEAD
06/11/2015
1 X
unknown and unequal:

X1 X2 (1 2 )
q 2
t(v) where v =
S1
S22
+
n
n
1
1
n1
S12
n1
S12
n1
2
+
+
S22
n2
2
1
n2
S22
n2
2
Welch estimator for the degrees of freedom of the t-Student

distribution where the two distributions have unequal
variances. The approximate degrees of freedom is rounded
down to the nearest integer.
This distribution is exact for Xi1 (1 , 21 ) and Xi2 (2 , 22 ).
For big samples it is also possible to use the t distribution or
use the normal distribution.
ESEAD
06/11/2015
1 X
unknown but equals (21 = 22 = 2 )
In this case is considered the estimator to 2 :
S2 =
(n1 1)S12 + (n2 1)S22

n1 + n2 2

1 X
2 is, for normal
and the distribution of the diference X
populations,

X1 X2 (1 2 )
q
t(n1 + n2 2)
S n1 + n1
1
For not normal populations and big samples it is also

possible to use the t distribution or use the normal
distribution.
ESEAD
06/11/2015
Confidence intervals
A little review...
An unknown parameter (e.g., ) can be estimate by a point
in such a way that is "close" to (e.g,. E()
= ,
(e.g., )
is small). E.g., = and = x .
Var()
In this section we consider interval estimation;
that is, how

to estimate by an interval of values L , U that has high
probability of including but also has, small average length.
Let X be a r.v. with d.f. F(x|) where is unknown. If
P(L(X ) < < U (X )) = 1 then the interval (L(X ), U (X )) is
called a confidence interval for whith confidence
coefficient (CI) 1 .
To find a CI for a parameter is necessary consider a
sufficient statistic (function the depend on random sample
X1 , . . . , Xn solely). E.g., when = the sufficient statistic is
= X1 ++Xn
X
n
ESEAD
06/11/2015
Confidence intervals for one normal mean
Let X1 , . . . , Xn be independent N (, 2 ) r.v.s where is unknown
and 2 > 0 is known. To obtain a CI to , consider the sufficient
statistic is

2
X
N (0, 1)
X N ,
Z =
(1)
n
/ n
There is a value z1/2 : P(z1/2 < Z < z1/2 ) = 1 .

Consider (1):

X
< z1/2 = 1
P z1/2 <
/ n

z1/2 < < X
+ z1/2
P X
= 1 .
n
n
ESEAD
06/11/2015
Confidence intervals for one normal mean
The CI for with confidence (1 ) 100% is

z1/2 , X
+ z1/2 .
X
n
n
Confidence intervals for one normal mean with 2 unknown
In this case the sufficient statistic is
T =

X
T (n 1),
S/ n
and the CI for with confidence (1 ) 100% is

+ t/2,n1 S .
t/2,n1 S , X
X
n
n
For non normal populations but n >> 30 is usual replace t/2 by
z1/2 .
ESEAD
06/11/2015
Confidence intervals for the diference between two
populational means. Normal populations with known
variances
If the populations has normal distributions with known variances,

the sufficient statistic is

X1 X2 (1 2 )
q 2
N (0, 1)
1
2
+ n2
n
1
and the CI for 1 2 with confidence (1 ) 100% is
s
s
2
2
2
2
1
1
X1 X2 z1/2
+ 2 , X1 X2 + z1/2
+ 2.
n1
n2
n1
n2
For non normal populations but n1 , n2 >> 30 is possible to use
these interval.
ESEAD
06/11/2015
populational means. Normal populations with unknown and
unequal variances
The CI for 1 2 with confidence (1 ) 100% is
s
s
2
2
2
2
S
S
S
S
1
1
2
2
X1 X2 t/2,v
+
, X1 X2 + t/2,v
+
n1
n2
n1
n2
where
2
S1
n1
v=
1
n1
S2
1
n1
S2
!2
+ n2
!2
+ n1
S2
2
n2
!2 .

these interval or replace the tquantile by the respective
zquantile.
ESEAD
06/11/2015
.
populational means. Normal populations with unknown but
equal variances
The CI for 1 2 with confidence (1 ) 100% is

!
r
r
1
1
1
1
X1 X2 t/2,n1 +n2 2 S
+
, X1 X2 + t/2,n1 +n2 2 S
+
,
n1
n2
n1
n2
where S2 =
(n1 1)S12 +(n2 1)S22

.
n1 +n2 2

these interval or replace the tquantile by the respective
zquantile.
ESEAD
06/11/2015
Hyphotesis tests
A little review...
Hiphotesis H0 : = 0 versus H1 : 6= (> or <)0
The null hypothesis (H0 ) states that no effect or no
difference exists in the data.
The alternative or research hypothesis (H1 ) states
researchers belief that some difference or effect exists.
Type of tests
Bilateral H1 : 6= 0
Rigth unilateral H1 : > 0
Left unilateral H1 : < 0
Decision There are several rule to decide in a hypothesis test.
We decide if reject or no the null hypothesis
ESEAD
06/11/2015

Esead Slides

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Esead Slides

Transféré par

Droits d'auteur :

Formats disponibles

Estatstica Experimental e

ESEAD: Some Important Considerations

1. Exploratory data analysis (EDA) (2 weeks)

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

1. Pratical group work1 with a maximum of 3 students CWORK;

Conduced in the lective period. This work consist in an

This course is divided into three teaching environments: the

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

1. Ross, S. (2010). Introductory Statistics. Elsevier.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

SPSS (Statistical Package for the Social Sciences) Statistics is a

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Popolution: A population includes all of the elements whose

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

SPSS: A very, very, brief introdution...

SPSS: A very, very, brief introdution...

This editing window shows the contents of a data set, allows

SPSS: A very, very, brief introdution...

The variable view separator (Data Editor window) is the place

SPSS: A very, very, brief introdution...

SPSS: A very, very, brief introdution...

The SPSS Viewer allows the visualization of the results of

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis | Desdriptive statistics

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Each of these options will be shown in more detail below.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

To obtain frequencies for a nominal-level variable, like problem

Exploratory data analysis

Use the Charts button to see a graphical output of the

Exploratory data analysis

This is the frequency table shows:

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

The central area of this table shows the total number of

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Exploratory data analysis

Exploratory data analysis

In statistics, an outlier is an observation point that is distant

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

The variable that we are interested in goes into the dependent

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Exploratory data analysis

Exploratory data analysis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL