Vous êtes sur la page 1sur 80

What is statistics?

Inference and uncertainty: This is what


statistics is all about.
Statistics consists of a body of methods for
collecting and analyzing data. (Agresti & Finlay,
1997)
Developed for interpreting and drawing
conclusions from collected data
The major objective of statistics is to make
inferences about population from the analysis
of the sample data

What does statistics provide?


Design: Planning and carrying out research
studies
Description: Summarizing and exploring data
Inference: Making predictions and
generalizing about phenomena represented
by the data

Population vs. sample

Steps in Planning Statistical analysis

Terms and Terminologies


Population- Total group of samples or individuals
that the researcher is interested to study.
Sample- A group of individuals selected from the
population
Parameter- is a characteristic of a population
Statistic- is a characteristic of a sample
Variable- characteristic or attribute that can
assume different values.
Variate- A random variable taken from a known
probability distribution

Terms and Terminologies


Descriptive statistics- describe the relationship
between variables. E.g. Frequencies, means,
standard deviation
Exploratory statistics- Usually represented in
the form of graphs to see the patterns in a
datum.
Inferential statistics- are used to draw
inferences about a population from a random
sample

Terms and Terminologies


Qualitative variable- Also known as categorical
variable. Usually measured on a nominal scale.
Quantitative variable- They are measured on a
numeric scale. Ordinal, interval and ratio scales
are quantitative
Discrete variable- countable in a finite amount of
time.
Continuous variable- would (literally) take forever
to count. In fact, you would get to forever and
never finish counting them

Terms and Terminologies

Random sampling
Systematic sampling
Convenience sampling
Stratified sampling
Cluster sampling
Sampling Error- is the difference between the
sample measure and the corresponding
population measure

Descriptive vs. Inferential statistics


Descriptive statistics consist of methods for
organizing and summarizing information
(Weiss, 1999)
Inferential statistics consist of methods for
drawing and measuring the reliability of
conclusions about population based on
information obtained from a sample of the
population. (Weiss, 1999)

Types of Statistical Approaches


Descriptive Statistics- Describes your data
- How many?
- How much?

Exploratory Statistics- represented in the form


of graphs
- Is there any pattern?
- Are data points clustered or stretched?

Types of Statistical Approaches


Inferential Statistics
-

Are there any differences?


What is the relationship?
What is the effect?
Model building
What determines what?

Distributions

Positively skewed

Symmetric

Negatively skewed

Distributions

Distributions

Normal Probability distribution

Mean, Median and mode are same


Bell-shaped curve symmetrical around mean
Probability area under the curve will be 1
Denoted by

Normal Probability Distribution

Areas under a normal distribution curve

Common types of Probability


distributions
Other important types of
distribution
1.Poisson
2.Binomial

Poisson Distribution
used to represent the number of successive
independent events of a specified type with
low probability of occurrence (< 10%) in some
specified interval of time or space.
Example cases of flu
Denoted by

Poisson Distribution

Binomial Distribution
An experiment that consists of n independent,
repeated trials, each of which can end in only one
of two ways arbitrarily labeled success or
failure.
The probability that any trial ends in a success
is p (and hence q = 1 p for a failure).
Denoted by
where in

Binomial Distribution

Central Limit Theorem


Sampling distribution of means
As the sample size n increases without limit,
the shape of the distribution of the sample
means taken with replacement from a
population with mean m and standard
deviation will approach a normal
distribution.
This distribution will have a mean m and a
standard deviation /n

Central Limit Theorem

Central Limit Theorem


Importance of Central limit theorem
- we can describe the sampling distribution
from any variable without actually having to
infinitely sample the population of raw scores.

Types of Variables

Nominal
Ordinal
Interval
Ratio

Types of Variables

Sampling Techniques

Random sampling
Systematic sampling
Stratified sampling
Cluster sampling
Other sampling techniques- Convenience
sampling, Sequential sampling, Double
sampling and multi-stage sampling

Theory of Probability

Experiment
Outcome
Sample space
Event

Theory of Probability
P [A]= No of Possible outcomes in which an event A occurs
Total No of possible outcomes in the sample space

Where P [A]= Probability that an Event B will occur


P(A)= 0 to 1
P(A)+P(B)+.+P(n)= 1
P(AorB) = P(A)+P(B) >> Disjoint event
Independent events
P(AandB) = P(A)*P(B) >> Joint event

Theory of probability

P(AUB) = P(A)+P(B)- P(AB) >> contingent joint event


P(AB) = P(A)+P(B)- P(AUB) >> contingent joint event
P(A|B) = P(AB)/P(B) >>conditional probability for A
P(B|A) = P(AB)/P(A) >>conditional probability for B

Definition of Probability
A probability measure is a rule, say P, which associates
with each event contained in a sample space S a number such
that the following properties are satisfied:
1 For any event, A, P(A) 0.
2 P(S) = 1 (since S contains all the outcomes, S always
occurs).
3 P(not A)+P(A)=1.
4 If A and B are mutually exclusive events (that cannot
occur simultaneously) and independent events (that are not linked in
any way), then P(A or B) = P(A) + P(B) and
P(A and B) = 0

Note: Many elementary probability theorems (rules) follow directly


from these definitions.

Confidence Intervals
The range around any hypothetical value of
mean () within which 95% of the means of all
samples of size n taken from that population
will occur.
Denoted by

Where 95% confidence interval for mean, when population variable X is


normally distributed and known

Understanding Z-statistic

Confidence Intervals

Distribution of the Z statistic (the ratio of the difference of population mean and
sample mean divided by the Standard error of the mean (SEM) obtained by taking
the means of a large number of small samples from a normal distribution). The 95%
confidence interval obtained by taking the means of a large number of small samples
from a normally distributed population with known statistics is indicated by the black
horizontal bar enclosed within 1.96 SEM. By chance 95% of the sample means
will be within the range 1.96 to +1.96 , with the remaining 5% outside this range

Confidence Intervals
With larger sample sizes,
the 95% confidence
intervals get smaller

P-Value
It is defined as the probability of getting the
observed result, or a more extreme result, if the
null hypothesis is true. In other words it is the
measure of the likelihood of the result given the
null hypothesis is true or the statistical
significance of the claim.
range from 0 to 1

P-Value
"P=0.030" is a shorthand way of saying "The
probability of getting 17 or fewer male
chickens out of 48 total chickens, IF the null
hypothesis is true that 50 percent of chickens
are male, is 0.030.
It is a usual convention in biology to use
a critical P-value of 0.05 (often called alpha, )

P-Value
This p-value measures how likely it was that
you would have gotten your sample results if
the null hypothesis were true.
The farther out your test statistic is on the tails
of the standard normal distribution, the
smaller the p-value will be, and the more
evidence you have against the null hypothesis
being true

Interpreting P-value
If the p-value is greater than or equal to , you
fail to reject Ho.
If the p-value is less than , reject Ho.
p-values on the borderline (very close to )
are treated as marginal results

Interpreting P-value
- Heres how to interpret your results for any
given alpha level
To make a proper decision about whether or
not to reject Ho, you determine your cutoff
probability for your p-value before doing a
hypothesis test; this cutoff is called an alpha
level ().
Typical values for are 0.05 or 0.01

Interpreting P-value
- How to interpret your results if you use an alpha level of
0.05
If the p-value is less than 0.01 (very small), the results are
considered highly statistically significant reject Ho.
If the p-value is between 0.05 and 0.01 (but not close to
0.05), the results are considered statistically significant
reject Ho
If the p-value is close to 0.05, the results are considered
marginally significant decision could go either way
If the p-value is greater than (but not close to) 0.05, the
results are considered non-significant dont reject Ho

Biological vs statistical hypotheses


Biological and statistical hypothesis- "Sexual selection by females has not caused
male chickens to evolve bigger feet than
females
- Male chickens dont have a different average
foot size than females

Statistical Hypothesis
Statistical Hypothesis- statement about the
probability distribution of populations using one
or more data samples
Hypothesis H0: All data samples originate from
the same population (or the single data sample is
consistent with a given theoretical distribution).
Hypothesis H1: Some data samples do not
originate from the same population (or the single
data sample is not consistent with the given
theoretical distribution).

Statistical Inference and Hypothesis


Testing
What do we mean by chance?
What do we mean unlikely?
What do we mean by effect?

Hypothesis and Significance Testing


Hypothesis- is a statement about some
characteristic of a variable or a collection of
variables. (Agresti & Finlay, 1997).

Significance test- is a way of statistically


testing a hypothesis by comparing the data to
values predicted by the hypothesis

The Process of Hypothesis Testing

The Mechanism of Hypothesis Testing

Sample selected at random from very different population may not necessarily be different. Simply by
chance the samples from populations 1 and 2 are similar, so you might mistakenly conclude the two
populations are also similar

The Mechanism of Hypothesis Testing

Even a random sample may not necessarily be a good representative of the population. Two samples have been
taken at random from the same population. By chance, sample 1 contains a group of relatively large fish, while
those in sample 2 are relatively small.

Type I & Type II errors

Test Statistics and your decision

Type I & Type II errors

Four possible results of hypothesis testing

Parametric statistics
Also known as classical statistics
Parametric tests are designed for analysing
data from a known distribution
ANOVA (1920s and 30s), Multiple Regression
(1800s), T-tests (1900s), Pearson Correlation
(1880s) are parametric statistical methods

Parametric statistics
General Assumptions of Parametric Statistical
Tests
1. The sample of n subjects is randomly selected
from the population.
2. The variables are continuous and from the
normal distribution
3. The measurement of each variable is based
on interval or ratio data

Non parametric Statistics


Sometimes called distribution free statistics
Do not require data to be normally distributed
In general, a less powerful test than the
analogous parametric test
No normality assumption
Uses less information
Spearmans Rho (1904), Kendalls Tau (1938),
Kruskal-Wallis (1950s), Wilcoxon Signed-Ranks
Matched Pairs (1940s)

Parametric vs Non Parametric


Parametric test
T-test (unpaired)
Paired t-test
ANOVA
Repeated measures
ANOVA

Non-parametric analog
Wilcoxon rank sum test
Wilcoxon signed rank test
Kruskal-Wallis test
Friedman test

The parametric tests are called parametric because,


when we calculate the p-value, we use the parameters of
the normal distribution: mean and standard deviation
The non-parametric tests do not estimate these
parameters, but instead are based on ranks

Hypothesis and Statistical Tests


main goal of a statistical or Hypothesis testwhat is the probability of getting a result like
my observed data, if the null hypothesis were
true
Evaluate and compare groups of data
To determine whether hypothesis can be
retained or rejected and modified
can refer to a single group
can also refer to two groups

Steps for a hypothesis Test


1. Set up the null and alternative hypotheses:
Ho and Ha.
2. Take a random sample of individuals from the
population and calculate the sample statistics
(means and standard deviations).
3. Convert the sample statistic to a test statistic by
changing it to a standard score (all formulas for
test statistics are provided later in this chapter).
4. Find the p-value for your test statistic.
5. Examine your p-value and make your decision.

Structure of Hypothesis Tests


1. Choose the appropriate test.
2. Establish the null and alternate hypotheses.
3. Decide on an acceptable error rate .
4. Compute the test statistic from the data.
5. Compute the p-value.
6. Reject the null hypothesis if p .

Sampling Distributions
Major parametric test statistics Z distribution
T distribution
Chi-square
F distribution

Sample size is the key

Sampling test Distributions

Four common probability distributions of sample statistics z, t, chi-square,


and F

Z distribution
Represents the probability distribution of a
random variable that is the ratio of the
difference between a sample statistic and its
population value to the standard deviation of
the population statistic

Students t Distribution

Chi-square Distribution
represents the probability distribution of a
variable that is the square of values from a
standard normal distribution
bounded by 0 and infinity
used for interval estimation of population
variances
can also be used to determine the probability
of obtaining a sample difference (or one smaller
or larger) between observed values and those
predicted by a model

F Distribution
represents the probability distribution of a variable
that is the ratio of two independent chi-square
variables, each divided by its df (degrees of freedom)
(Hays 1994).
Because variances are distributed as 2, the F
distribution is used for testing hypotheses about ratios
of variances.
bounded by zero and infinity.
Used to determine the probability of obtaining a
sample variance ratio (or one larger) for a specified
value of the true ratio between variances

Hypothesis Testing
Null Hypothesis(H)&Alternate Hypothesis(H)
H: = / H: (Two-tailed test)
H: = / H: (one-tailed test)

Types of hypothesis tests

Associations and Differences


Relationship between variables Associations
and Differences
Association- The relationship between a wing
length and weight of a growing bird
Difference- The relationship between the
mean tail length of Gull-billed Tern and the
mean tail length of Common Tern.

Difference of mean tests


One sample t-test
Two independent samples t-test
t= SE /n
where t represents the effect size or test
statistic
Paired samples t-test

K-independent samples (n>2)


- ANOVA (Analysis of Variance)
One way ANOVA
Two way ANOVA

Difference of mean tests (Non parametric)


- One sample
Runs test
- Two independent samples
Kolmogrov-Smirnov test
Mann Whitney U test

Difference of mean tests (Non parametric)

Paired samples
Wilcoxon signed Ranks test
Mc Nemars test
Marginal Homogeneity test

- K independent samples
Kruskall- Wallis test
Friedmans Rank test

Test of Proportions, ratios and indices


Chi-square test
Goodness of fit

Correlation
Pearsons product moment correlation (r)
To investigate linear relationships between
two independent variables
r -1 to +1

Correlation

Scatter plots with various correlations

Regression
Prediction is made on the assumption the
hypothesis is correct
Simple linear regression
Investigate relationships- Dependent and
independent variable
Best fit linear line describes relation between
X and Y
Regression coefficient/ Coefficient of
determination (R)

Regression

Regression lines by gender and parity status for predicting weight at 1 month of age in
term babies

Classification of some hypothesis tests

Summary of Statistical Tests

Common Errors of statistical analysis


Samples are not random
Sample size is too low for any meaningful
interpretation
Non-independence of sample data
Overuse of non-parametric statistics, even
with low sample size
Failure to do a graphical exploration

Common Errors of statistical analysis


Power analysis and effect size
Interpreting simple correlation as cause and
effect
Use of complex model and multivariate
statistics without verifying the merit of the
data

Power of a test
Measure of likelihood of a test reaching a
correct conclusion

Vous aimerez peut-être aussi