Vous êtes sur la page 1sur 81

Estatstica Experimental e

Anlise de Dados |
Experimental Statistics and
Data Analysis
MEC | ISEP
Sandra Ramos | <sfr@isep.ipp.pt> | Office: H417

06/11/2015

ESEAD: Some Important Considerations

Contents

1. Exploratory data analysis (EDA) (2 weeks)


1.1 Data presentation: graphs, charts and tables
1.2 Numerical summaries
2. Statistical inference (8 week)
2.1 Confidence intervals.
2.2 Parametric hiphotesis tests.
2.3 Non parametric hiphotesis tests.
3. Simple and multiple linear regression (2 weeks)

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Evaluation Method

1. Pratical group work1 with a maximum of 3 students CWORK;


2. Two individual exames - CE1; CE2;
3. The final classification (FC )of the course is given by the
formula:
FC = (CWORK + 1.5 CE1 + 1.5 CE2)/4

Conduced in the lective period. This work consist in an


analysis of an data set and elaboration of an report with a
description of the used methodology, results and principal
conclusions.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Teaching methodology

This course is divided into three teaching environments: the


theoretical classes (T), the laboratorial classes (PL) and the tutorial
guidance classes (OT).
1. The T classes are of expository nature. All concepts and
results will be accompanied by illustrative examples.
2. The PL classes seeks to solidify the knowledge acquired in T
classes. The statistical software SPSS will be always used to
solve proposed problems.
3. In the OT the students will be cleared of all their doubts and
may also make their evaluation practical works.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Bibliography

1. Ross, S. (2010). Introductory Statistics. Elsevier.


2. Murteira, B., Ribeiro, C.S., Andrade e Silva, J. e Pimenta, C.
(2002). Introduo Estatstica. McGrawHill, Lisboa.
3. Hall, A, Neves, C., Pereira, A. (2011). Grande Maratona de
Estatstica no SPSS. Escolar Editora.
4. Maroco, J. (2011). Anlise Estatstica com o SPSS Statistics.
ReportNumber.
5. Ross, S. (2010). Introductory Statistics. Elsevier.
6. Marques de S, Joaquim P. (2007). Applied Statistics Using
SPSS, STATISTICA, MATLAB and R

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Software

SPSS (Statistical Package for the Social Sciences) Statistics is a


software package used for statistical analysis. Long produced by
SPSS Inc., it was acquired by IBM in 2009. The current versions
(2015) are officially named IBM SPSS Statistics.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions

Some definitions

Popolution: A population includes all of the elements whose


the characteristics we pretend study.
Sample: A sample is an any subset of the popultation.
A a measurable characteristic of a population, such as a
mean or standard deviation, is called a parameter.
A measurable characteristic of a sample is called a statistic.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions

Types of variables:
Qualitative variables - also known as categorical variables are variables with no natural sense of ordering. They are
therefore measured on a nominal scale. For instance, hair
color (Black, Brown, Gray, Red, Yellow) is a qualitative
variable, as is name (Adam, Becky, Christina, Dave...).
Qualitative variables can be coded to appear numeric but
their numbers are meaningless, as in male=1, female=2.
Quantitative variables: variables that are not qualitative are
known as quantitative variables. They are interval and ratio
scales. A countrys population, a persons shoe size, or a
cars speed are all quantitative variables.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions

Qualitative variables:
Nominal variables are variables that have two or more
categories, but which do not have an intrinsic order.
Dichotomous variables are nominal variables which have
only two categories or levels. For example, if we were looking
at gender, we would most probably categorize somebody as
either "male" or "female".
Ordinal variables are variables that have two or more
categories just like nominal variables only the categories can
also be ordered or ranked.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions

Quantitative variables:
Continuous variables takes values of an interval or of an
colection of intervals. For exemple: Age, Weight, Height.
Discrete variables take a number finite ou infinite
numerable set of values).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions
Types of variables:
An independent variable, sometimes called an
experimental or predictor variable, is a variable that is
being manipulated in an experiment in order to observe the
effect on a dependent variable, sometimes called an
outcome variable.
Imagine that a tutor asks 100 students to complete a maths
test. The tutor wants to know why some students perform
better than others. While the tutor does not know the answer
to this, she thinks that it might be because of two reasons:
(1) some students spend more time revising for their test; and
(2) some students are naturally more intelligent than others.
As such, the tutor decides to investigate the effect of revision
time and intelligence on the test performance of the 100
students.
Dependent Variable: Test Mark (measured from 0 to 100).
Independent Variables: Revision time (measured in hours)
and Intelligence (measured using IQ score).
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Some definitions
Descriptive Statistic versus Inferential Statistics
Descriptive statistics is the term given to the analysis of data
that helps describe, show or summarize data. Descriptive
statistics do not, however, allow us to make conclusions
beyond the data we have analysed or reach conclusions
regarding any hypotheses we might have made. They are
simply a way to describe our data.
Inferential statistics are techniques that allow us to use these
samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important
that the sample accurately represents the population. The
principal methods of inferential statistics are (1) the
estimation of parameter(s) and (2) testing of statistical
hypotheses.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

SPSS: A very, very, brief introdution...

SPSS: A very, very, brief introdution...

This editing window shows the contents of a data set, allows


the creation of new data sets and also allows to change of an
existent data set.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

SPSS: A very, very, brief introdution...

The variable view separator (Data Editor window) is the place


where the characteristics of the variables are defined.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

SPSS: A very, very, brief introdution...


Name: name of the variable.
Type: type of variable (numeric, data, string,..).
Width: variable length, that is, the number of digits that the
variable has.
Decimals: number of decimal places.
Label: description of the variable.
Values: labels of the qualitative variables (eg , 1 = female and
2 = male) .
Missing: in this field we can indicate the coding of missing
values (non-existent values). This values are not eligible for
statistical calculation purposes.
Columns: indicates the number of characters the form a
column, so, the column width.
Align: data alignment. Measure: selects the measured
variable scale (interval / ratio , ordinal or nominal).
Role: function of the variable. Input (preditor or independent
variable); Target ( outcome or dependent variable); Both (both
functions); ....
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

SPSS: A very, very, brief introdution...


Viewer window (Output)

The SPSS Viewer allows the visualization of the results of


analyzes carried out.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis | Desdriptive statistics

Exploratory data analysis


When first looking at a dataset, it is wise to use descriptive
statistics to get some idea of what your data look like.
The next simple dataset showing three different variables:
type of psychological problem someone was treated for,
type of treatment approach used, and symptom level at
the end of treatment. One of these variables (symptom) is on
an interval-level scale; the other two are grouping variables
(i.e., they are measured on a nominal-level scale).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


The type of descriptive statistics to be used depends on the
scale of measure of data. Here are some options:
For simple description of nominal-level variables
(groups) use frequencies (frequencies tables, bar and
pie charts).
For more complex description of nominal-level variables
use crosstabs.
For simple description of interval- or ratio-level variables
(items measured on a scale) use the descriptives
command to obtain numerical characteristics of the
data. It is possible obtain some graphs: boxplots,
histograms, ...
For more complex description of interval- or ratio-level
variables use the explore command.2
Note that even though problem and treatment are
nominal-level variables, they have to be coded as numbers
(not as text) in order to use the following procedures.
2

Each of these options will be shown in more detail below.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Here are the various choices! All of them are found in the Analyze
menu in SPSS, under the sub-menu for Descriptive Statistics :

To obtain frequencies for a nominal-level variable, like problem


in this dataset, open the Frequencies dialog box, select the
variable of interest in from the left-hand lists, and use the arrow
button to move it from the left-hand list into the right-hand list.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Use the Charts button to see a graphical output of the


frequencies on variable of interest. In case of nominal, select
Bar charts or Pie charts . If you arent happy with the way your
data are displayed (ascending vs. descending order, etc.), try some
of the options found by clicking on the Format button!
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


After hit OK in the main dialog box to continue, the output for
this variable looks like this:

This is the frequency table shows:


the number of people in each group, the percent of the total
in each group;
the percent in each group as a proportion of just the people
with complete data (thats the valid percent),
the cumulative percent (thats the cumulative percent). In
case of nominal variables the cumulative percent have no
interest because they can not be interpreted. For ordinal
variables the cumulative percent makes sense.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Heres the graphical output: theses graphs (bar and pie graphs)
shows the percent in each category.
Bar graph

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

Pie graph

ESEAD

06/11/2015

Exploratory data analysis


The analysis of two nominal variables in combination for
example, what % of the patients with the problem 1, were treated
with treatment CBT? can be achieved by using crosstabs
command in the Analyze/Descriptive Statistics sub-menu.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


The crosstabs dialog box lets us enter one variable (or more) as
rows in a frequency table, and another variable as the columns in
the same table. Use the Cells command to get percentages for
each row and column:

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


After hit OK in the main dialog box to continue, the output for
this variable looks like this:

The central area of this table shows the total number of


patients who have each individual combination of the various
levels of the two variables.
The far-right column shows the total for each problem, as a
percent of all patients.
The bottom-most row slices the data the other way, showing
the total for each type of treatment, as a percent of all
Sandrapatients.
Ramos, DMA/LEMA - ISEP | CEAUL - FCUL
ESEAD
06/11/2015

Exploratory data analysis


For interval- or ratio-level variables (i.e., scales variables), use
the descriptives sub-command on the
Analyze/Descriptive Statistics menu:

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


This opens a dialog box where we can select the variable that we
want descriptive statistics on.3

Make sure that what you are selecting is actually an intervalor ratio-level variable. Because all of the data are entered as
numbers, SPSS will actually calculate descriptive statistics on any
of these variables; but these results are only meaningful for
interval- or ratio-level variables (in this case, just the score
variable).
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


As usual, select the variable(s) that we are interested in from the
left-hand column, and move them to the right-hand column. Use
the Options button to select the specific descriptive statistics
that we are interested in. Usually, good choices include the mean,
standard deviation, maximum, and minimum. If you are
concerned about the impact of outliers4 on your data, a measure
of skewness may also be appropriate.

This shows the specific results for each variable that we entered
into the analysis. It is possible to get a table that gives us basic
descriptive statistics for many variables simultaneously, just by
moving them all at once from the left-hand to the right-hand list
in the main dialog box.
4
See the definition in the next slide.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

In statistics, an outlier is an observation point that is distant


from other observations.
An outlier may be due to variability in the measurement or it
may indicate experimental error; the latter are sometimes
excluded from the data set.
We will see later how to identify whether a given observation
is an outlier.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Finally, to get more complex descriptive results for an interval- or
ratio-level variable, use the Explore command in the
Analyze/Descriptive Statistics sub-menu.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


The following dialog box will appear:

The variable that we are interested in goes into the dependent


list.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


The explore command gives us descriptive statistics for each
variable, with more extensive results (median, interquartile range,
etc.) than the descriptives command:

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


The explore command also gives graphs for each variable by
groups. For example, a boxplot 5 for variable "Symptom" by
"Treatment" or by "Prolem".

5
We will see below how to interprete these charts, for now we
see only how to obtain them.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


In the Graphs menu, under the sub-menu Legacy dialogs it is
possible obtain a large set of graphs. For example, to obtain an
histogram

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Histogram for the variable "Symptom".

It was considered 4 bins (classes) in the histogram.


There are several rules to obtain the "reasonable" number of
bins. The number of bins are a integer k close to

n.
1 + 3.3 log2 (n) (Sturges, 1926).
1 + 2.3 log2 (n) (Larson, 1975).

n is the sample size.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

In the last lesson we saw how presenting and summarising the


Data, namely, we saw:
frequency tables, pie and bar charts to presenting and
summarising nominal variables;
cross table or contingency table to presenting the counts
corresponding to the several combinations of categories of
two nominal variables.
numerical summaries (mean, standart deviation, variance,
ranges, ...) to summarising scale (continuous) variables;
histograms and boxplot to presenting the data resulting by
the observation of scale (continuous) variables;

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Numerical summaries/ measures: Computation and
interpretation
Measures of location
Measures of location are used in order to determine where
the data distribution is concentrated. The most usual
measures of location are: Arithmetic Mean; Median; Mode;
Quantiles.
Measures of dispersion (or scale)
The measures of dispersion give an indication of how
concentrated a data distribution is. The most usual
measures of dispersion are: Variance; standard deviation;
total range; inter-quartil range; coefficient of variation.
Measures of shape
The measures of shape give an indication of about the
distribution of the data. The most usual measures of shape
are: Skewness and Kurtosis.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Measures of location
Let x1 , x2 , ..., xn be the data, and x(1) , x(2) , ..., x(n) the ordered data.
Arithmetic mean (or simply mean)
x =

n
1X
xi
n i=1

If the datasets exhibiting outliers and extreme cases that can


be suspected to be the result of rough measurement errors,
one can use a trimmed mean by neglecting a certain
percentage of the tail cases (e.g., 5%).
Median
The median of a dataset is that value of the data below which
lie 50% of the cases. The median, x , of an sample is defined
as:
x =

x(n/2) + x(n/2)+1
for n even and
2
x = x((n+1)/2) for n odd.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Measures of location
Mode
The mode (Mo) is the value that appears most often in a set of
data. The mode is not necessarily unique, since the
probability mass function or probability density function may
take the same maximum value at several points. When a
data distribution exhibits several relative maxima of almost
equal value, we say that it is a multimodal distribution.
Quantiles
The quantile of order (0 < < 1), x of a dataset is that
value of the data below which lie 100% of the cases. The
median is therefore the 50% quantile, or x0.5 . Often used
quantiles are:

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Measures of location
Quantiles

Quartiles, corresponding to multiples of 25% of


the cases. The boxplot mentioned before presents
the quartiles and the inter-quartile range
(IQR = x0.75 x0.25 ). These values are often used to
determine the outliers of the dataset distribution.
Deciles, corresponding to multiples of 10% of the
cases.
Percentiles, corresponding to multiples of 1% of
the cases. We will often use the percentile
p = 2.5% and its complement p = 97.5%.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Measures of location
Quantiles
Quantile has been defined in multiple ways. The quantile of
order is given by6 :
x = (1 )x(j) + x(j+1) ,
where
m = 1 ,
j = floor(n + m),
= n + m j.
Note: floor(x) is greatest integer less than x.
The formula is also frequentely used:
x = x(int(k+1)) if k = (n) is not a integer number and
x = (x(k) + x(k+1) )/2 if k = (n) is a integer number.
6

the must popular formula.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Measures of scale or dispersion


Let x1 , x2 , ..., xn be the data.
Range
Range = x(n) x(1) .
Interquartil range - IQR
IQR = x0.75 x0.25 = Q3 Q1 .
Variance The deviation, s2 , of an sample is defined as:
s2 =

n
n
1 X 2
1 X
(xi x )2 =
(x n x 2 )
n 1 i=1
n 1 i=1 i

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Measures of scale or dispersion


Let x1 , x2 , ..., xn be the data.
Standard deviation
The deviation, s, of an sample is defined as:

s = + s2
Coefficient of variation
The coefficient of variation, CV , of an sample is defined as:
CV =

s
100%
|
x|

CV has no units.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Measures shape
Let x1 , x2 , ..., xn be the data.
Skewness and kurtosis has been defined in multiple ways. The
steps below explain the method used by Prism, called g1 (the most
common method).
Skewness
The skewness, b1 , of an sample is defined as:
b1 =

n 2 m3
.
(n 1)(n 2)s3

Kurtosis
The kurtosis, b2 , of an sample is defined as:
n 2 (n + 1)m4
(n 1)2

,
(n 1)(n 2)(n 3)s4
(n 2)(n 3)
P
)k is the centred moment of order k.
where mk = n1 m
i=1 (xi x
b2 =

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Measures shape (interpretation)
Skewness quantifies how symmetrical the distribution is.
A symmetrical distribution has a skewness of zero.
An asymmetrical distribution with a long tail to the right
(higher values) has a positive skew.
An asymmetrical distribution with a long tail to the left
(lower values) has a negative skew.
The skewness is unitless.
Any threshold or rule of thumb is arbitrary, but here is
one: If the skewness is greater than 1.0 (or less than
-1.0), the skewness is substantial and the distribution is
far from symmetrical (George & Mallery, 2010).
Kurtosis quantifies whether the shape of the data
distribution matches the Gaussian distribution.
A Gaussian distribution has a kurtosis of 0.
A flatter distribution has a negative kurtosis.
A distribution more peaked than a gaussian distribution
has a positive kurtosis.
Kurtosis has no units.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis

Measures shape (interpretation)

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Boxplot (going back....)
The boxplot is an exploratory graphic used to show the
distribution of a dataset. With this graphic it is possible evaluate
the dispersion, location and symmetry of a data set. It is also
possible to identify outliers.
There are several criteria to classify an observation as an outlier.
We will consider the criterion:
A value xi is a outlier ( marked with circles) if xi < BI or xi > BS
where BI and BS are defined as:
BI = q1 1.5 IQR and BS = q3 + 1.5 IQR.
A value xi is a severe outlier (marked with crosses) if xi < BI or
xi > BS where BI and BS are defined as:
BI = q1 3 IQR and BS = q3 + 3 IQR.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Boxplot (going back....)

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Exploratory data analysis


Scatterplot
A simple scatterplot can be used to (a) determine whether a
relationship is linear, (b) detect outliers and (c) graphically present
a relationship. For example, determining whether a relationship is
linear (or not) is an important assumption if you are analysing
your data using a Pearsons correlation, simple linear regression
or multiple regression (we will see this later).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Statistical Inference

Summary
"Special" continuous distributions
Sampling and sampling distributions
Estimation - point estimation and intervalar estimation
Hiphotesis tests
1. Parametric hiphotesis tests
Verification of the assumptions of the parametric
tests with assumptions of the parametric test
t-test for independent samples
t-test for paired samples
Analysis of variance (ANOVA)
2. Nonparametric hiphotesis tests
Wilcoxon test
Wilcoxon-Mann-Whitney test
Kruskal-Wallis test
Chi-square test
Fisher test
McNemar test
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions


Normal distribuion
A random variable (r.v.) X has the normal distribution N (, 2 ) (for
some 2 > 0 and < < +) if the density function of X is
"

2 #
1 x
1
exp
, < x < +.
fX (x) =
2

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions


Normal distribuion
The N (0, 1) distribuition is called the standard normal
distribution.
The sum normal independent of r.v.s is also a r.v. with
normal distribution.
Ra
P(X 6 a) = fX (x)dx (In SPSS
Transform > Compute > Function group > CDF & ... )
R x
fX (x)dx = (In
Quantile of order - x : P(X 6 x ) =
SPSS
Transform > Compute > Function group > Inverse DF ... )

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions


Chi-square distribuion
A random variable (r.v.) X has the chi-square with n degree of
freedom (X 2 (n)) (for some n positive integer) if the density
function of X is
fX (x) =

1
2n/2

n
2

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

 x n/21 e x/2 , x > 0.

ESEAD

06/11/2015

"Special" continuous distributions: revisions

Chi-square distribuion
If Z N (0, 1), them X = Z 2 is chi-square with 1 degree of
freedom.
2
If
i (1) (Xi independent r.v.s), i = 1, 2, ..., n, them
PX
n
2
X
i=1 i (n).
Ra
P(X 6 a) = fX (x)dx (In SPSS

Transform > Compute > Function group > CDF & ... )
R x
Quantile of order - x : P(X 6 x ) =
fX (x)dx = (In
SPSS
Transform > Compute > Function group > Inverse DF ... )

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions


Students t-distribuion
A r.v. T has Students t-distribuion with n degrees of freedom
(T t(n)) if (for some integer n > 0) the probability density
function of T is

n+1
1
2
, < t < +,
fT (t) =


2 n2 1 + xt2 (n+1)/2
n
R
where (t) = 0 x t1 e x dx is the function gamma. (n + 1) = n! for
any positive integer n.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions

Students t-distribuion
If Z N (0, 1) and X 2 (n), them, for Z and X independent
r.v.s, Y = Z t(n).
X /n
Ra
If T t(n), P(T 6 a) = fT (t)dt
(In SPSS
Transform > Compute > Function group > CDF & ... )
R t
Quantile of order - t : P(T 6 t ) =
fT (t)dt =
(In SPSS
Transform > Compute > Function group > Inverse DF ... )

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions


F-distribuion
A r.v. X has F-distribuion with n1 and n2 degrees of freedom
(positive integres) (X F (n1 , n2 )) if (for some integer n > 0) the
probability density function of X is

fX (x) =

  n1 n1 /2
n1 +n2
2
n
 2 
n21 n22


1+

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

x (1/2)(n1 2)
(1/2)(n1 +n2 ) , 0 < x <
n1
x
n
2

ESEAD

06/11/2015

"Special" continuous distributions: revisions

Students F-distribuion
If X1 2 (n1 ) and X2 2 (n2 ) two independent r.v.s, them
1
F (n1 , n2 )
Y = XX 1/n
1/n1
Ra
If X F (n1 , n2 ), P(X 6 a) = fX (x)dx.
(In SPSS
Transform > Compute > Function group > CDF & ... )
R t
Quantile of order - x : P(X 6 X ) =
fX (x)dx =
(In SPSS
Transform > Compute > Function group > Inverse DF ... )

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

"Special" continuous distributions: revisions

Exercices
Let X N (0, 1). Find the values of the P(X < 1.5) and the
2.5% percentile of X .
Let X t(n). Find the values of the P(X < 1.5) and the 2.5%
percentile of X for n = 7, n = 30 and n = 100.
Let X 2 (n). Find the value of the P(X > 5) for
n = 1, 3, 5, 10.
Let X F (2, 2). Find the value of the P(X > 5).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling and Sampling Distributions


Sampling
When we carry out a statistical study, usually, we are
interested to get conclusions about the populations under
study.
When it is possible observe all the elements of the
populations under study, are used descriptive statistics
techniques to describe/characterize the populations.
Frequently the populations are not completly accessible. For
example: infinite populations, high cost of research (e.g. the
time to obtain the data is unaffordable and in some cases
analysis may be destructive). Neste casos, sampling
techniques is needed and inferential techniques are applied
to conclude about the populations.
Remember "Inferential statistics are techniques that allow
us to use samples to make generalizations about the
populations from which the samples were drawn. The
principal methods of inferential statistics are (1) the
estimation of parameter(s) and (2) testing of statistical
hypotheses."
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling and Sampling Distributions

Sampling
Sampling is the process to obtain theses samples from the
populations."
It is, therefore, important that the sample accurately
represents the population.
There are many methods of sampling when doing research.
"Simple" random sampling is the ideal (Use in simple
experiments that require a single sample to be taken from a
given population. The people in the sample frame must all be
accessible and available).
In this course we will consider that the all samples used are
random.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling and Sampling Distributions


Sampling
Let the sequence of independent and identically distributed r.v.s
(i.i.d.) X1 , X2 , ..., Xn be a random sample (r.s.).
Parameter - numerical characteristic of a population
(usually unknown).
- function of r.s.:
= g (X1 , X2 , ..., Xn ).
Statistic or estimator
The base of the inference on the parameter .
for a particular observed random
Estimate - The value of
sample (point estimate of ).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling and Sampling Distributions

Parameter, Estimator, Estimate


(parameter)
- mean
2

- variance
p - proportion of sucesses

(estimator)

Pn
= i=1 Xi
X

n
Pn
2)
Xi2 n X
S2 = i=1(n1)
Pn
Xi 7
P = i=1
n

(estimate)
P

xi
n
2)
Xi2 n X
P(n1)
n
i=1 xi
n

xP=

S =
p =

n
i=1

7
Where Xi Bernoulli(p), that is, Xi = 1 when a success was
observed and Xi = 0 othewise.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling and Sampling Distributions

Central Limit Theorem (CLT)


Let X1 , X2 , ...Xn be independent and identically distributed (i.i.d.)
r.v.s with E(Xi ) = and Var(Xi ) = 2 > 0 (both finite). Then for
n > +

Pn

X
i=1 Xi /n

N (0, 1)
=
/ n
/ n
.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling Distributions

Let X1 , X2 , ...Xn be a independent and identically distributed r.v.s


with E(Xi ) = and Var(Xi ) = 2 > 0 (both finite).

with 2 known
1. Distribution of the sample mean X
Pn
=
X

i=1

Xi




2
X
N (0, 1)
,

n
/ n

This distribution is exact if Xi N (i , 2i ) and, under the CLT,


approximate for n (n > 30 in the practice).

with 2 unknown
2. Distribution of the sample mean X

X
t(n 1)
S/ n
This distribution is exact if Xi N (, 2 ).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling Distributions


with 2 unknown and
3. Distribution of the sample mean X
big samples (n > 30 in the practice)

X
N (0, 1)8
S/ n
In the practice, when 2 are unknown but n > 30 it is
possible to use the normal distribuition or the t-distribution
(Remember: when the degrees of freedom of the
t-distributions is > 30 the pdfs of theses two distributions
are almost the same).

It is possible show this result by using the Slutsky theorem.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling Distributions

Let X11 , ...Xn11 be i.i.d. r.v.s with E(Xi1 ) = 1 and Var(Xi1 ) = 21 > 0
(both finite) and X12 , ...Xn22 be i.i.d. r.v.s with E(Xi2 ) = 2 and
Var(Xi2 ) = 22 > 0 (both finite).
1 X
2 with 21 and 22 both
4. Distribution of the diference X
known:

X1 X2 (1 2 )
q 2
N (0, 1)
1
2
2
+
n
n
1

This distribution is exact for Xi1 (1 , 21 ) and Xi2 (2 , 22 )


and aproximate for n1 , n2 > 30 (CLT).

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling Distributions
1 X
2 with 21 and 22 both
5. Distribution of the diference X
unknown and unequal:


X1 X2 (1 2 )
q 2
t(v) where v =
S1
S22
+
n
n
1

1
n1

S12
n1

S12
n1

2

+
+

S22
n2

2

1
n2

S22
n2

2

Welch estimator for the degrees of freedom of the t-Student


distribution where the two distributions have unequal
variances. The approximate degrees of freedom is rounded
down to the nearest integer.
This distribution is exact for Xi1 (1 , 21 ) and Xi2 (2 , 22 ).
For big samples it is also possible to use the t distribution or
use the normal distribution.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Sampling Distributions

1 X
2 with 21 and 22 both
6. Distribution of the diference X
unknown but equals (21 = 22 = 2 )
In this case is considered the estimator to 2 :
S2 =

(n1 1)S12 + (n2 1)S22


n1 + n2 2


1 X
2 is, for normal
and the distribution of the diference X
populations,

X1 X2 (1 2 )
q
t(n1 + n2 2)
S n1 + n1
1

For not normal populations and big samples it is also


possible to use the t distribution or use the normal
distribution.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
A little review...
An unknown parameter (e.g., ) can be estimate by a point
in such a way that is "close" to (e.g,. E()
= ,
(e.g., )
is small). E.g., = and = x .
Var()
In this section we consider interval estimation;
that is, how

to estimate by an interval of values L , U that has high
probability of including but also has, small average length.
Let X be a r.v. with d.f. F(x|) where is unknown. If
P(L(X ) < < U (X )) = 1 then the interval (L(X ), U (X )) is
called a confidence interval for whith confidence
coefficient (CI) 1 .
To find a CI for a parameter is necessary consider a
sufficient statistic (function the depend on random sample
X1 , . . . , Xn solely). E.g., when = the sufficient statistic is
= X1 ++Xn
X
n

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
Confidence intervals for one normal mean
Let X1 , . . . , Xn be independent N (, 2 ) r.v.s where is unknown
and 2 > 0 is known. To obtain a CI to , consider the sufficient
statistic is



2
X

N (0, 1)
X N ,
Z =
(1)
n
/ n

There is a value z1/2 : P(z1/2 < Z < z1/2 ) = 1 .


Consider (1):



X
< z1/2 = 1
P z1/2 <
/ n


z1/2 < < X
+ z1/2
P X
= 1 .
n
n
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
Confidence intervals for one normal mean
The CI for with confidence (1 ) 100% is


z1/2 , X
+ z1/2 .
X
n
n
Confidence intervals for one normal mean with 2 unknown
In this case the sufficient statistic is
T =


X
T (n 1),
S/ n

and the CI for with confidence (1 ) 100% is




+ t/2,n1 S .
t/2,n1 S , X
X
n
n
For non normal populations but n >> 30 is usual replace t/2 by
z1/2 .
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
Confidence intervals for the diference between two
populational means. Normal populations with known
variances

If the populations has normal distributions with known variances,


the sufficient statistic is

X1 X2 (1 2 )
q 2
N (0, 1)
1
2
+ n2
n
1

and the CI for 1 2 with confidence (1 ) 100% is

s
s
2
2
2
2

1
1
X1 X2 z1/2
+ 2 , X1 X2 + z1/2
+ 2.
n1
n2
n1
n2
For non normal populations but n1 , n2 >> 30 is possible to use
these interval.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
Confidence intervals for the diference between two
populational means. Normal populations with unknown and
unequal variances

The CI for 1 2 with confidence (1 ) 100% is

s
s
2
2
2
2
S
S
S
S
1
1
2
2

X1 X2 t/2,v
+
, X1 X2 + t/2,v
+
n1
n2
n1
n2
where
2
S1
n1

v=
1
n1

S2
1
n1

S2

!2

+ n2

!2
+ n1

S2
2
n2

!2 .

For non normal populations but n1 , n2 >> 30 is possible to use


these interval or replace the tquantile by the respective
zquantile.
Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Confidence intervals
.
Confidence intervals for the diference between two
populational means. Normal populations with unknown but
equal variances

The CI for 1 2 with confidence (1 ) 100% is


!
r
r
1
1
1
1

X1 X2 t/2,n1 +n2 2 S
+
, X1 X2 + t/2,n1 +n2 2 S
+
,
n1
n2
n1
n2
where S2 =

(n1 1)S12 +(n2 1)S22


.
n1 +n2 2

For non normal populations but n1 , n2 >> 30 is possible to use


these interval or replace the tquantile by the respective
zquantile.

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Hyphotesis tests

A little review...
Hiphotesis H0 : = 0 versus H1 : 6= (> or <)0
The null hypothesis (H0 ) states that no effect or no
difference exists in the data.
The alternative or research hypothesis (H1 ) states
researchers belief that some difference or effect exists.
Type of tests
Bilateral H1 : 6= 0
Rigth unilateral H1 : > 0
Left unilateral H1 : < 0
Decision There are several rule to decide in a hypothesis test.
We decide if reject or no the null hypothesis

Sandra Ramos, DMA/LEMA - ISEP | CEAUL - FCUL

ESEAD

06/11/2015

Vous aimerez peut-être aussi