Académique Documents
Professionnel Documents
Culture Documents
2010
Tsagris Michail
mtsagris@yahoo.gr
Table of Contents
1.1 Introduction ...................................................................................................................................... 3
2.1 Data Analysis toolpack ....................................................................................................................... 4
2.2 Descriptive Statistics .......................................................................................................................... 6
2.3 Z-test for two samples ....................................................................................................................... 8
2.4 t-test for two samples assuming unequal variances ........................................................................... 9
2.5 t-test for two samples assuming equal variances ............................................................................. 10
2.6 F-test for the equality of variances ................................................................................................... 11
2.7 Paired t-test for two samples ........................................................................................................... 12
2.8 Ranks, Percentiles, Sampling, Random Numbers Generation ........................................................... 13
2.9 Covariance, Correlation, Linear Regression ...................................................................................... 15
2.10 One-way Analysis of Variance ........................................................................................................ 19
2.11 Two-way Analysis of Variance with replication ............................................................................... 20
2.12 Two-way Analysis of Variance without replication ......................................................................... 23
2.13 Statistical functions ........................................................................................................................ 24
3.1 The Solver add-in ............................................................................................................................. 27
1.1 Introduction
One of the reasons for which these notes were written was to help students and not only to
perform some statistical analyses without having to use statistical software such as R, SPSS, and
Minitab etc. It is reasonable not to expect that excel offers much of the options for analyses
offered by statistical packages but it is in a good level nonetheless.
The areas covered by these notes are: descriptive statistics, z-test for two samples, ttest for two samples assuming (un)equal variances, paired t-test for two samples, F-test for
the equality of variances of two samples, ranks and percentiles, sampling (random and
periodic, or systematic), random numbers generation, Pearsons correlation coefficient,
covariance, linear regression, one-way ANOA, two-way ANOVA with and without
replication and the moving average.
We will also demonstrate the use of non-parametric statistics in Excel for some of the
previously mentioned techniques. Furthermore, informal comparisons with the results provided
by the Excel and the ones provided by SPSS and some other packages will be carried out to see
for any discrepancies between Excel and SPSS. One thing that is worthy to mention before
somebody goes through these notes is that they do not contain the theory underlying the
techniques used. These notes show how to cope with statistics using Excel.
The first edition was in May 2008. In the second edition (July 2012) we added the solver
library. This allows us to perform linear numerical optimization (maximization/minimization)
with or without linear constraints. It also offers the possibility to solve a system of equations
again with or without linear constraints. I am grateful to Vassilis Vrysagotis (teaching fellow at
the Technological Educational Institute of Chalkis,) for his contribution. This third edition
(November 2014) uses Excel 2010 (upgrading, even in 2014).
Any mistakes you find, or disagree with something stated here or anything else you want
to ask, please send me an e-mail. For more statistical resources the reader is addressed to
statlink.tripod.com.
Picture 1
Select add-Ins from the list on the left and the window of Picture 2 will appear. In this window
(Picture 2) press Go to move on to the window of Picture 3 where you select the two options
as I did in Picture 3. If you go to the tab Data in Excel you will see the Data analysis and Solver
libraries added (Picture 4). The solver we will need it later. The good thing is that we only have
to do this once, not every time we open the computer.
Picture 2
Picture 3
Picture 4
By pressing Data analysis (see picture 4) the window of Picture 5 will appear.
Picture 5
value for the confidence level is 95%. In other words the confidence level is set to the usual
95%. The results produced by Excel are provided in table 1.
Picture 6
Column1
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence
Level(95.0%)
15.4
0.74778585
15
20
5.28764444
27.9591837
0.50899442
0.11750986
21
4
25
770
50
1.50273192
than the number of rows we selected. The sample variances differ slightly but it is really not a
problem. SPSS calculates a 95% confidence interval for the true mean whereas Excel provides
only the quantity used to calculate the 95% confidence interval. The construction of this interval
is really straightforward. Subtract this quantity from the mean to get the lower limit and add it to
the mean to get the upper limit of the 95% confidence interval. So it is (mean-conf.level,
mean+conf.level)=(15.4-1.50273192, 15.4+1.50273192)=(13.89727, 16.90273).
Picture 7
We selected the hypothesized mean difference to be zero and filled the white boxes of the
variances with the variances. In order to perform the z-test we must know the variance of each
population from which the sample came from. Since we do not have this information, we put the
sample variances for illustration purposes. The value of the z-statistic, the critical values and the
p-values for the one-sided and two-sided tests are provided. The results, provided in Table 2, are
8
the same with the ones generated by R. Both of the p-values are equal to zero, indicating that the
mean difference of the two populations from which the data were drawn, is statistically
significant at an alpha equal to 0.05.
z-Test: Two Sample for Means
Mean
Known Variance
Observations
Hypothesized Mean Difference
z
P(Z<=z) one-tail
z Critical one-tail
P(Z<=z) two-tail
z Critical two-tail
Variable 1
Variable 2
10.25 18.83333333
8.618421
11.1092
20
30
0
9.589103286
0
1.644853627
0
1.959963985
Table 2: Z-test.
Picture 8
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
Variable 2
10.25 18.83333333
8.618421053 11.1091954
20
30
0
44
9.589104187
1.19886E-12
1.680229977
2.39773E-12
2.015367574
10
Picture 9
t-Test: Two-Sample Assuming Equal Variances
Mean
Variance
Observations
Pooled Variance
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
Variable 2
10.25 18.83333333
8.618421053 11.1091954
20
30
10.12326389
0
48
-9.34515099
1.10835E-12
1.677224196
2.2167E-12
2.010634758
11
Picture 10
F-Test Two-Sample for Variances
Mean
Variance
Observations
df
F
P(F<=f) one-tail
F Critical one-tail
Variable 1
Variable 2
10.25 18.83333333
8.618421053 11.1091954
20
30
19
29
0.775791652
0.285474981
0.481414106
12
Picture 11
t-Test: Paired Two Sample for Means
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
Variable 2
10.25
16.95
8.618421053 3.944736842
20
20
0.941021597
0
19
-23.76638508
6.77131E-16
1.729132812
1.35426E-15
2.093024054
13
Picture 12
The dialogue box of the sampling option is the one in the picture 13. Two sampling
schemes are available, of the systematic (periodic) and of the random sampling. In the first
case you insert a number (period), lets say 5, means that the first value of the sample will be the
number in that row (5th row) and all the rest values of the sample will be the ones of the 10th,
the 15th, the 20th rows and so on. With the random sampling method, you state the sample size
and Excel does the rest. If you specify a number in the second option of the sampling method,
say 30, then a sample of size 30 will be selected from the column specified in the first box.
Picture 13
If you are interested in a random sample from a known distribution then the random
numbers generation is the option you want to use. Unfortunately not many distributions are
offered. The dialogue box of this option is at picture 14. In the number of variables you can
select how many samples you want to be drawn from the specific distribution. The white box
below is used to define the sample size. The distributions offered are Uniform, Normal,
Bernoulli, Binomial, and Poisson. Two more options are also allowed. Different distributions
require different parameters to be defined.
14
Picture 14
The random seed is an option used to give the sampling algorithm a starting value but can
be left blank as well. If we specify a number, say 1234, then the next time we want to generate
another sample, if we put the same random seed again we will get the same sample. The number
of variables allows to generate more than one samples.
Picture 15
15
Column 1
Column 2
Column 1
1
0.941022
Column 2
1
Picture 16
16
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.941021597
R Square
0.885521645
Adjusted R
Square
0.879161737
Standard Error
0.690416648
Observations
20
ANOVA
df
Regression
Residual
Total
1
18
19
Intercept
X Variable 1
Coefficients
10.42442748
0.636641221
SS
MS
F
66.36984733 66.36984733 139.2349644
8.580152672 0.476675148
74.95
Standard Error
t Stat
0.574168954 18.15567944
0.053953621 11.79978663
Significance
F
6.6133E-10
P-value
Lower 95%
Upper 95%
5.08222E-13 9.21814327 11.6307117
6.6133E-10 0.523288869 0.74999357
17
15
10
Y
Predicted Y
5
0
0
10
15
X Variable 1
0.8
0.6
0.4
0.2
0
0
X Variable 1
The first figure is a scatter plot of the data, the X values versus the Y values and the
predicted Y values. The linear relation between the two variables is obvious through the graph.
Do not forget that the correlation coefficient exhibited a high value.
Excel produced also the residuals and the predicted values in the same sheet. We shall
construct a scatter plot of these two values, in order to check (graphically) the assumption of
homoscedasticity (i.e. constant variance through the residuals). If the assumption of
heteroscedasticity of the residuals holds true, then we should see all the values within a
bandwidth. We see that almost all values fall within -1.5 and 1.5. It seems like the variance is not
constant since there is seems to be evidence of a pattern. This means that the residuals do not
exhibit constant variance. But then again, we only have 30 points, so our eyes could be wrong. If
we are not sure about the validity of the assumption we can transform the Y values using a log
transformation and run the regression using the transformed Y values.
18
The Normal Probability Plot is used to check the normality of the residuals graphically.
Should the residuals follow the normal distribution, then the graph should be a straight line.
Unfortunately many times the eye is not the best judge of things. The Kolmogorov Smirnov test
conducted in SPSS provided evidence to support the normality hypothesis of the residuals.
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1.2
Sample Percentile
19
Picture 17
Anova: Single Factor
SUMMARY
Groups
Column 1
Column 2
Column 3
ANOVA
Source of Variation
Between Groups
Within Groups
Total
Count
15
12
23
SS
907.9652
462.0348
1370
Sum
Average Variance
139 9.266667 7.495238
188 15.66667 1.878788
443 19.26087 15.29249
df
MS
2 453.9826
47 9.830527
F
46.1809
P-value
F crit
8.07E12 3.195056
49
first, we must enter the data in the correct way. The proper way of data entry follows (the data
refer to the cars measurements). As you can see, we have three columns of data representing the
three levels of the one factor and the first columns contains only three words, C1, C2 and C3.
This first column states the two levels of the second factor. We used the R1, and R2 to define the
number of the rows representing the sample sizes of each combination of the two factors. In
other words the first combination the two factors are the cells from B2 to B6. This means that
each combination of factors has 5 measurements.
Picture 18
From the dialogue box of picture 4, we select Anova: Two-Factor with replication and the
dialogue box to appear is shown at picture 19.
Picture 19
21
We filled the two blank white boxes with the input range and Rows per sample. The alpha is at
its usual value, equal to 0.05. By pressing OK the results are presented overleaf. The results
generated by SPSS are the same. At the bottom of the table 10 there are three p-values; two pvalues for the two factors and one p-value for the interaction. The row factor is denoted as
sample in Excel.
A limitation of this analysis when performed in Excel is that the sample sizes in each
combination of column and rows (the two factors) must be equal. In other words, the design has
to be balanced, the same number of values everywhere.
Anova: Two-Factor With
Replication
SUMMARY
C1
C2
C3
Total
S1
Count
Sum
Average
Variance
5
31
6.2
4.7
5
48
9.6
60.8
5
58
11.6
0.3
15
137
9.133333
24.12381
5
75
15
62
5
130
26
34
5
73
14.6
9.3
15
278
18.53333
59.98095
S2
Count
Sum
Average
Variance
Total
Count
Sum
Average
Variance
10
10
10
106
178
131
10.6
17.8
13.1
51.15556 116.8444 6.766667
ANOVA
Source of Variation
Sample
Columns
Interaction
Within
Total
SS
662.7
267.2667
225.8
684.4
1840.167
df
1
2
2
24
29
MS
662.7
133.6333
112.9
28.51667
F
P-value
23.23904 6.55E-05
4.686148 0.019138
3.959088 0.032665
F crit
4.259677
3.402826
3.402826
Table 10: The table of the two-way analysis of variance with replication.
22
Picture 20
Anova: Two-Factor Without Replication
SUMMARY
Row 1
Row 2
Column 1
Column 2
Column 3
Count
3
3
2
2
2
Sum
Average Variance
17 5.666667 22.33333
25 8.333333 14.33333
8
4
0
12
6
32
22
11
0
ANOVA
Source of Variation
Rows
Columns
Error
Total
SS
10.66667
52
21.33333
84
df
MS
1 10.66667
2
26
2 10.66667
5
P-value
1 0.42265
2.4375 0.290909
F crit
18.51282
19
Table 11: The table of the two-way analysis of variance without replication.
23
AVEDEV calculates the average of the absolute deviations of the data from their
mean.
AVERAGE is the mean value of all data points.
AVERAGEA calculates the mean allowing for text values of FALSE (evaluated
as 0) and TRUE (evaluated as 1).
BETADIST calculates the cumulative beta probability density function.
BETAINV calculates the inverse of the cumulative beta probability density
function.
BINOMDIST determines the probability that a set number of true/false trials,
where each trial has a consistent chance of generating a true or false result, will
result in exactly a specified number of successes (for example, the probability that
exactly four out of eight coin flips will end up heads).
CHIDIST calculates the one-tailed probability of the chi-squared distribution.
CHIINV calculates the inverse of the one-tailed probability of the chi-squared.
Distribution.
CHITEST calculates the result of the test for independence: the value from the
chi-square distribution for the statistics and the appropriate degrees of freedom.
CONFIDENCE returns a value you can use to construct a confidence interval for
a population mean.
CORREL returns the correlation coefficient between two data sets.
COVAR calculates the covariance of two data sets. Mathematically, it is the
multiplication of the correlation coefficient with the standard deviations of the
two data sets.
CRITBINOM determines when the number of failures in a series of true/false
trials exceeds a criterion (for example, more than 5 percent of light bulbs in a
production run fail to light).
DEVSQ calculates the sum of squares of deviations of data points from their
sample mean. The derivation of standard deviation is very straightforward, simply
dividing by the sample size or by the sample size decreased by one to get the
unbiased estimator of the true standard deviation.
EXPODIST returns the exponential distribution
FDIST calculates the F probability distribution (degree of diversity) for two data
sets.
FINV returns the inverse of the F probability distribution.
FISHER calculates the Fisher transformation.
FISHERINV returns the inverse of the Fisher transformation.
FORECAST calculates a future value along a linear trend based on an existing
time series of values.
FREQUENCY calculates how often values occur within a range of values and
then returns a vertical array of numbers having one or more elements than Bins_array.
24
FTEST returns the result of the one-tailed test that the variances of two data sets
are not significantly different.
GAMMADIST calculates the gamma distribution.
GAMMAINV returns the inverse of the gamma distribution.
GAMMALN calculates the natural logarithm of the gamma distribution.
GEOMEAN calculates the geometric mean.
GROWTH predicts the exponential growth of a data series.
HARMEAN calculates the harmonic mean.
HYPGEOMDIST returns the probability of selecting an exact number of a single
type of item from a mixed set of objects. For example, a jar holds 20 marbles, 6 of
which are red. If you choose three marbles, what is the probability you will pick
exactly one red marble?
INTERCEPT calculates the point at which a line will intersect the y-axis.
KURT calculates the kurtosis of a data set.
LARGE returns the k-th largest value in a data set.
LINEST generates a line that best fits a data set by generating a two dimensional
array of values to describe the line.
LOGEST generates a curve that best fits a data set by generating a two
dimensional array of values to describe the curve.
LOGINV returns the inverse logarithm of a value in a distribution.
LOGNORMDIST Returns the number of standard deviations a value is away
from the mean in a lognormal distribution.
MAX returns the largest value in a data set (ignore logical values and text).
MAXA returns the largest value in a set of data (does not ignore logical values
and text).
MEDIAN returns the median of a data set.
MIN returns the largest value in a data set (ignore logical values and text).
MINA returns the largest value in a data set (does not ignore logical values and
text).
MODE returns the most frequently occurring values in an array or range of data.
NEGBINOMDIST returns the probability that there will be a given number of
failures before a given number of successes in a binomial distribution.
NORMDIST returns the number of standard deviations a value is away from the
mean in a normal distribution.
NORMINV returns a value that reflects the probability a random value selected
from a distribution will be above it in the distribution.
NORMSDIST returns a standard normal distribution, with a mean of 0 and a
standard deviation of 1.
NORMSINV returns a value that reflects the probability a random value selected
from the standard normal distribution will be above it in the distribution.
PEARSON returns a value that reflects the strength of the linear relationship
between two data sets.
PERCENTILE returns the k-th percentile of values in a range.
PERCENTRANK returns the rank of a value in a data set as a percentage of the
data set.
25
26
Column B Column C
=A1*400+300*B1
=4*A1+2*B1
=a1
=2*A1+4*B1
300
70
240
27
Picture 21
By pressing the button Add in the dialogue box of picture 21, the dialogue box of picture
22 will appear. We put the cell which describes the first constraint and the cell whose maximum
value is. We repeat this task until all constraints are entered. In case we have no constraints, we
do not have to come here. After the last constraint is entered we press Add. When we put the
final constraint we can either press OK or press Add first and then OK. In the second case a
message will appear (picture 23) preventing us from continuing. We will press Cancel and we
will go to picture 24, which is the same as picture 21, but with the constraints now added.
Picture 22
28
Picture 23
Picture 24
Then we select the Simplex LP as the solving method and the message of picture 25 will appear.
We press OK and the message disappears. The solution will also appear in Excel.
29
Picture 25
Column A
Row 1
Row 2
Row 3
Row 4
Row 5
60
Column B Column C
30
33000
300
300
70
60
240
240
30