Test Reliability chapter 8

© All Rights Reserved

6 vues

Test Reliability chapter 8

© All Rights Reserved

- Chapter 04 R
- A to Z Statistical Significance GET IT NOW
- Probabilistic Inventory Models Raw Materials
- Desert at Ion Final 1
- 109-513-1-PB.pdf
- Revised Output 7 V2
- stach9
- Excel
- basic statistics
- Vaona_paneldata
- Hypothesis Testing Summary
- Pricing Ctrp Risk
- AP Statistics Calculator Use
- control 2
- Art Manejo
- Hypothesis Testing of Question No
- Chapter 10
- 4. Caltech.pdf
- Constructing a Normal Curve.xls
- Exam1 Formulas

Vous êtes sur la page 1sur 229

Statistical

Inference

David Casado

∟ Faculty of Economic and Business Sciences 5 June 2015

∟ Department of Statistics and Operational Research II

∟ David Casado de Lucas

You can decide not to print this file and consult it in digital format – paper and ink will be saved. Otherwise, print

it on recycled paper, double-sided and with less ink. Be ecological. Thank you very much.

Contents

Links, Keywords and Descriptions 1–6

Framework and Scope of the Methods 7

Some Remarks 7–9

Sampling Probability Distribution 9 – 12

Methods for Estimating 13 – 27

Properties of Estimators 27 – 64

Methods and Properties 64 – 73

Methods for Estimating 74 – 81

Minimum Sample Size 81 – 83

Methods and Sample Size 83 – 93

Parametric 94 – 125

Based on T 94 – 117

Based on Λ 117 – 122

Analysis of Variance (ANOVA) 122 – 125

Nonparametric 126 – 137

Parametric and Nonparametric 138 – 142

PE – CI – HT 143 – 153

Additional Exercises 154 – 169

Appendixes 170 – 191

Probability Theory 170 – 175

Some Reminders 170

Markov's Inequality. Chebyshev's Inequality 170 – 171

Probability and Moments Generating Functions. Characteristic Function. 171 – 172

Mathematics 191 – 209

Some Reminders 191 – 192

Limits 192 – 194

References 210

Tables of Statistics 211 – 217

Probability Tables 218 – 222

Index 223 – 225

Prologue

These exercises and problems are a necessary complement to the theory included in Notes of Statistical

Inference, available at http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf. Nevertheless, some

important theoretical details are also included in the remarks at the beginning of each chapter. Those Notes are

thought for teaching purposes, and they do not include the advanced mathematical justifications and

calculations included in this document.

Although we can study only linearly and step by step, it is worth noticing that methods are usually

related—as tasks are in the real-world—in Statistical Inference. Thus, in most exercises and problems we have

made it clear which are the suppositions and how they should be proved properly. In same cases, several

statistical methods have been “naturally” combined in the statement. Many steps and even sentences are

repeated in most exercises of the same type, both to insist on them and to facilitate the reading of exercises

individually. The advanced exercises have been marked with the symbol (*).

Written in Courier New font style is the code with which we have done some calculation by using

the programming language R—you can copy and paste this code from the file. I include some notes to help,

up to my knowledge, students with a mother language different to the English.

Acknowledgements

This document has been created with Linux, LibreOffice, OpenOffice.Org, GIMP and R. I thank those who

make this software available for free. I donate funds to these kinds of project from time to time.

Links, Keywords and Explanations

Inference Theory (IT)

Framework and Scope of the Methods

> [Keywords] infinite populations, independent populations, normality, asymptoticness, descriptive statistics.

> [Description] The conditions under which the Statistics considered here can be applied are listed.

Some Remarks

> [Keywords] partial knowledge, randomness, certainty, dimensional analysis, validity, use of the samples, calculations.

> [Description] The partial knowledge justifies both the random character of the mathematical variables used to explain the variables of

the real-world problems and the impossibility of reaching the maximum certainty in using samples instead of the whole population. The

validity of the results must be understood within the scenario made of the assumptions, the methods, the certainty and the data.

Sampling Probability Distribution

Exercise 1it-spd

> [Keywords] inference theory, joint distribution, sampling distribution, sample mean, probability function.

> [Description] From a simple probability distribution for X, the joint distribution of a sample (X1,X2) and the sampling distribution of

the sample mean X are determined.

Methods for Estimating

Exercise 1pe-m

> [Keywords] point estimations, binomial distribution, Bernoulli distribution, method of the moments, maximum likelihood method,

plug-in principle.

> [Description] For the binomial distribution, the two methods are applied to estimate the second parameter (probability), when the

first (number of trials) is known. In the second method, the maximum can be found by looking at the derivatives. Both methods

provide the same estimator. The plug-in principle allows using the previous estimator to obtain others for the mean and the variance.

Exercise 2pe-m

> [Keywords] point estimations, geometric distribution, method of the moments, maximum likelihood method, plug-in principle.

> [Description] For the geometric distribution, the two methods are applied to estimate the parameter. In the second method, the

maximum can be found by looking at the derivatives. Both methods provide the same estimator. The plug-in principle is applied to use

the previous estimator to obtain others for the mean and the variance.

Exercise 3pe-m

> [Keywords] point estimations, Poisson distribution, method of the moments, maximum likelihood method, plug-in principle.

> [Description] For the Poisson distribution, the two methods are applied to estimate the parameter. In the second method, the

maximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principle is applied to

use the previous estimator to obtain others for the mean and the variance.

Exercise 4pe-m

> [Keywords] point estimations, normal distribution, method of the moments, maximum likelihood method.

> [Description] For the normal distribution, the two methods are applied to estimate at the same time the two parameters of this

distribution. In the second method, the maximum can be found by looking at the derivatives. The two methods provide the same

estimator.

Exercise 5pe-m

> [Keywords] point estimations, (continuous) uniform distribution, method of the moments, maximum likelihood method, plug-in

principle, integrals.

> [Description] For the continuous uniform distribution, the two methods are applied to estimate the parameter. In the second method,

the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative reasoning. The two

methods provide different estimators. The plug-in principle allows using the previous estimator to obtain others for the mean and the

variance. As a mathematical exercise, the theoretical expression of the mean and the variance are calculated.

Exercise 6pe-m

> [Keywords] point estimations, (translated) exponential distribution, method of the moments, maximum likelihood method, plug-in

principle, integrals.

> [Description] For a translation of the exponential distribution, the two methods are applied to estimate the parameter. In the second

method, the maximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principle

is applied to use the previous estimator to obtain others for the mean. As a mathematical exercise, the theoretical expression of the

mean and the variance of the distribution are calculated.

Exercise 7pe-m

> [Keywords] point estimations, method of the moments, maximum likelihood method, plug-in principle, integrals.

> [Description] For a distribution given through its density function, the two methods are applied to estimate the parameter. In the

second method, the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative

reasoning. The two methods provide different estimators. The plug-in principle is applied to obtain other estimators for the mean and

the variance. Additionally, the theoretical expression of the mean and the variance of this distribution are calculated.

Properties of Estimators

Exercise 1pe-p

> [Keywords] point estimations, probability, normal distribution, sample mean, completion (standardization).

> [Description] For a normal distribution with known parameters, the probability that the sample mean is larger than a given value is

calculated.

Exercise 2pe-p

> [Keywords] point estimations, probability, normal distribution, sample quasivariance, completion.

> [Description] For a normal distribution with known standard deviation, the probability that the sample quasivariance is larger than a

given value is calculated.

Exercise 3pe-p

> [Keywords] point estimations, probability, Bernoulli distribution, sample proportion, completion (standardization), asymptoticness.

> [Description] For a Bernoulli distribution with known parameter, the probability that the sample proportion is between two given

values is calculated.

Exercise 4pe-p

> [Keywords] point estimations, probability and quantile, normal distribution, sample mean, sample quasivariance, completion.

> [Description] For two (independent) normal distributions with known parameters, probabilities and quantiles of several events

involving the sample mean or the sample quasivariance are calculated or found out, respectively.

Exercise 5pe-p

> [Keywords] point estimations, probability, normal distribution, total sum, completion, bound.

> [Description] For two (independent) normal distributions with known parameters, the probabilities of several events involving the

total sum are calculated.

Exercise 6pe-p

> [Keywords] point estimations, trimmed sample mean, mean square error, consistency, sample mean, rate of convergence.

> [Description] To study the population mean, the mean square error and the consistency are studied for the trimmed sample mean.

The speed in converging is analysed through a comparison with that of the (ordinary) sample mean.

Exercise 7pe-p

> [Keywords] point estimations, chi-square distribution, mean square error, consistency.

> [Description] To study twice the mean of a chi-square population, the mean square error and the consistency are studied for a given

estimator.

Exercise 8pe-p

> [Keywords] point estimations, mean square error, relative efficiency.

> [Description] For a sample of size two, the mean square errors of two given estimators are calculated and compared by using the

relative efficiency.

Exercise 9pe-p

> [Keywords] point estimations, sample mean, mean square error, consistency, efficiency (under normality), Cramér-Rao's lower

bound.

> [Description] That the sample mean is always a consistent estimator of the population mean is proved. When the population is

normally distributed, this estimator is also efficient.

Exercise 10pe-p

> [Keywords] point estimations, (continuous) uniform distribution, probability function, sample mean, consistency, efficiency,

unbiasedness.

> [Description] For a population variable following the continuous uniform distribution, the density function is plotted. The

consistency and the efficiency of the sample mean, as an estimator of the population mean, are studied. Looking at the bias obtained, a

new unbiased estimator of the population mean is built, and its consistency is proved.

Exercise 11pe-p

> [Keywords] point estimations, geometric distribution, sufficiency, likelihood function, factorization theorem.

> [Description] When a population variable follows the geometric distribution, a (minimum-dimension) sufficient statistic for studying

the parameter is found by applying the factorization theorem.

Exercise 12pe-p (*)

> [Keywords] point estimations, basic estimators, population mean, Bernoulli distribution, population proportion, normality,

population variance, mean square error, consistency, rate of convergence.

> [Description] The mean square error is calculated for all basic estimators of the mean, the proportion (for Bernoulli populations) and

the variance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For two

populations, the two-variable limits that appear are studied by splitting them into two one-variable limits or by binding them.

Exercise 13pe-p (*)

> [Keywords] point estimations, basic estimators, normality, population variance, mean square error, consistency, rate of convergence.

> [Description] For the basic estimators of the variance of normal populations, the mean square errors are compared for one and two

populations. The computer is used to compare graphically the coefficients that appear in the expression of the mean square errors.

Besides, the consistency is also graphically studied.

Exercise 14pe-p (*)

> [Keywords] point estimations, Bernoulli distribution, normal distribution, mean square error, consistency, pooled sample proportion,

pooled sample variance, rate of convergence.

> [Description] The mean square error is calculated for some pooled estimators of the proportion (for Bernoulli populations) and the

variance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For pooled estimators,

one sample size tending to infinite suffices, that is, one sample can “do the whole work”. Each pooled estimator—for the proportion of

a Bernoulli population and for the variance of a normal population—is compared with the “natural” estimator consisting in the

semisum of the estimators of the two populations. The computer is also used to compare graphically the coefficients that appear in the

expression of the mean square errors. The consistency can be studied graphically.

Methods and Properties

Exercise 1pe

> [Keywords] point estimations, method of the moments, mean square error, consistency, maximum likelihood method.

> [Description] Given the density function of a population variable, the method of the moments is applied to find an estimator of the

parameter; the mean square error of this estimator is calculated; finally, its consistency is studied. On the other hand, the maximum

likelihood method is applied too; the maximum cannot be found by using the derivatives and some qualitative reasoning is necessary.

A simple analytical calculation suffices to see how the likelihood function depends upon the parameter. The two methods provide

different estimators.

Exercise 2pe

> [Keywords] point estimations, Rayleigh distribution, method of the moments, mean square error, consistency, maximum likelihood

method.

> [Description] Supposed a population variable following the Rayleigh distribution, the method of the moments is applied to build an

estimator of the parameter; the mean square error of this estimator is calculated and its consistency is studied. The maximum likelihood

method is also applied to build an estimator of the parameter. For this population distribution, both methods provide different

estimators. As a mathematical exercise, the expressions of the mean and the variance are calculated.

Exercise 3pe

> [Keywords] point estimations, exponential distribution, method of the moments, maximum likelihood method, sufficiency,

likelihood function, factorization theorem, sample mean, efficiency, consistency, plug-in principle.

> [Description] A deep statistical study of the exponential distribution is carried out. To estimate the parameter, two estimators are

obtained by applying both the method of the moments and the maximum likelihood method. For this population distribution, both

methods provide the same estimator. A sufficient statistic is found. The sample mean is studied as an estimator of the parameter and the

inverse of the parameter. In this exercise, it is highlighted how important the mathematical notation may be in doing calculations.

Methods for Estimating

Exercise 1ci-m

> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.

> [Description] The method of the pivot is applied twice to construct asymptotic confidence intervals for the mean and the standard

deviation of a normally distributed population variable with unknown mean and variance. For the first interval, the expression of the

margin of error is used to obtain the confidence when the length of the interval is one unit.

Exercise 2ci-m

> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.

> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the mean of a population variable

with unknown variance. There was a previous estimate of the mean that is inside the interval obtained. The value of the margin of error

is explicitly given.

Exercise 3ci-m

> [Keywords] confidence intervals, method of the pivot, Bernoulli distribution, asymptoticness.

> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the proportion of a population

variable following the Bernoulli distribution.

Exercise 4ci-m

> [Keywords] confidence intervals, asymptoticness, method of the pivot, Bernoulli distribution, pooled sample proportion.

> [Description] A confidence interval for the difference between two proportions is constructed by applying the method of the pivot.

The interval allows us to make a decision about the equality of the proportions, which is equivalent to applying a two-tailed hypothesis

test. As an advanced task, the exercise is repeated with the pooled sample proportion in the denominator of the statistic (estimation of

the variances of the populations), not in the numerator (estimation of the difference between the means).

Minimum Sample Size

Exercise 1ci-s

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev's

inequality.

> [Description] To find the minimum number of data necessary to guarantee theoretically the precision desired, two methods are

applied: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.

Methods and Sample Size

Exercise 1ci

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev's

inequality.

> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity. The

dependence of the length of the interval with the confidence is analysed qualitatively. Given all the other quantities, the minimum

sample size is calculated in two different ways: with the method based on the expression of the margin of error and the method based

on the Chebyshev's inequality.

Exercise 2ci

> [Keywords] confidence intervals, minimum sample size, asymptoticness, normal distribution, method of the pivot, margin of error,

Chebyshev's inequality.

> [Description] An asymptotic confidence interval for the mean of a population random variable is constructed by applying the method

of the pivotal quantity. The equivalent exact confidence interval can be obtained under the supposition that the variable is normally

distributed. Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on the

expression of the margin of error and the method based on the Chebyshev's inequality.

Exercise 3ci

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev's

inequality.

> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity.

Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on the expression of

the margin of error and the method based on the Chebyshev's inequality. The dependence of the length of the interval upon the

confidence is analysed qualitatively.

Exercise 4ci

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev's

inequality.

> [Description] The method of the pivot allows us to construct a confidence interval for the difference between the means of two

(independent) normal populations. Given the other quantities and supposing equal sample sizes, the minimum value is calculated by

applying two different methods: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.

Parametric

Based on T

Exercise 1ht-T

> [Keywords] hypothesis tests, normal distribution, two-tailed test, population mean, critical region, p-value, type I error, type II

error, power function.

> [Description] A decision on the equality of the population mean (of a variable) to a given number is made by applying a two-

-sided test and looking at both the critical values and the p-value. The two types of error are determined. With the help of a

computer, the power function is plotted.

Exercise 2ht-T

> [Keywords] hypothesis tests, normal population, one-tailed test, population standard deviation, critical region, p-value, type I

error, type II error, power function.

> [Description] A decision on whether the population standard deviation (of a variable) is smaller than a given number is made by

applying a one-tailed test and looking at both the critical values and the p-value. The expression of the type II error is found. With

the help of a computer, the power function is plotted. Qualitative analysis on the form of the alternative hypothesis is done. The

assumption that the population variable follows the normal distribution is necessary to apply the results for studying the variance.

Exercise 3ht-T

> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type I

error, type II error, power function.

> [Description] The equality of the population variance (of a variable) to a given number is tested by considering both one- and

two-tailed alternative hypotheses. Decisions are made after looking at both the critical values and the p-value. In the two cases, the

expression of the type II error is found and the power function is plotted with the help of a computer. The power functions are

graphically compared, and the figure shows that the one-sided test is uniformly more powerful than the two-sided test.

Exercise 4ht-T

> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type I

error, type II error, power function, statistical cook.

> [Description] From the hypotheses of a one-sided test on the population variance (of a variable), different ways are qualitatively

and quantitatively considered for the opposite decision to be made.

Exercise 5ht-T

> [Keywords] hypothesis tests, normal populations, one- and two-tailed tests, population standard deviation, critical region,

p-value, type I error, type II error, power function.

> [Description] A decision on whether the population standard deviation (of a variable) is equal to a given value is made by

applying three possible alternative hypotheses and looking at both the critical values and the p-value. The type II error is calculated

and the power function is plotted. The power functions are graphically compared: the figure shows that the one-sided tests are

uniformly more powerful than the two-sided test.

Exercise 6ht-T

> [Keywords] hypothesis tests, Bernoulli populations, one-tailed tests, population proportion, critical region, p-value, type I error,

type II error, power function.

> [Description] A decision on whether the population proportion is higher in one population is made after allocating this inequality

in the null hypothesis, firstly, and the alternative hypothesis, secondly. Two methodologies are considered, one based on the critical

values and the other based on the p-value. In both tests, the type II error is calculated and the power function is plotted. The

symmetry of the power functions of the two cases is highlighted. As an advanced section, the pooled sample proportion is used to

estimate the variance of the populations (in the denominator of the statistic), but not to estimate the difference between the

population proportions (in the numerator of the statistic).

Based on Λ

Exercise 1ht-Λ

> [Keywords] hypothesis tests, Neyman-Pearson's lemma, likelihood ratio test, critical region, Poisson distribution, exponential

distribution, Bernoulli distribution, normal distribution.

> [Description] The critical region is theoretically studied for the null hypothesis that a parameter of the distribution equals a given

value against four different alternative hypothesis. The form of the region is related to the maximum likelihood of the estimator.

Analysis of Variance (ANOVA)

Exercise 1ht-av

> [Keywords] hypothesis tests, normal populations, analysis of variance, critical region, p-value, type I error, type II error.

> [Description] The analysis of variance is applied to test whether the means of three independent normal populations—whose

variances are supposed to be equal—are the same. Calculations are repeated three times with different levels of “manual work”.

Nonparametric

Exercise 1ht-np

> [Keywords] hypothesis tests, chi-square tests, independence tests, critical region, p-value, type I error, table of frequencies.

> [Description] The independence between two qualitative variables or factors is tested by applying the chi-square statistic.

Exercise 2ht-np

> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, critical region, p-value, type I error, table of frequencies.

> [Description] The goodness-of-fit to the whole Poisson family, firsly, and to a member of the Poisson distribution family, secondly,

is tested by applying the chi-square statistic. The importance of using the sample information, instead of poorly justified assumptions,

is highlighted when the results of both sections are compared.

Exercise 3ht-np

> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, independence tests, homogeneity tests, critical region, p-value,

type I error, table of frequencies.

> [Description] Just the same table of frequencies is looked at as coming from three different scenarios. Chi-square goodness-of-fit,

independence and homogeneity tests are respectively applied.

Parametric and Nonparametric

Exercise 1ht

> [Keywords] hypothesis tests, Bernoulli distribution, goodness-of-fit chi-square test, position signs test, critical region, p-value, type I

error, type II error, power function, table of frequencies.

> [Description] Just the same problem is dealt with by considering three different approaches: one parametric test and two kinds of

nonparametric test. In this case, the same decision is made.

PE – CI – HT

Exercise 1pe-ci-ht

> [Keywords] point estimations, confidence intervals, method of the pivot, normal distribution, t distribution, pooled sample variance.

> [Description] The probability of an event involving the difference between the means of two independent normal populations is

calculated with and without the supposition that the variances of the populations are the same. The method of the pivot is applied to

construct a confidence interval for the quotient of the standard deviations.

Exercise 2pe-ci-ht

> [Keywords] confidence intervals, point estimations, normal distribution, method of the pivot, probability, pooled sample variance.

> [Description] For the difference of the means of two (independent) normally distributed variables, a confidence interval is

constructed by applying the method of the pivotal quantity. Since the equality of the means is included in a high-confidence interval,

the pooled sample variance is considered in calculating a probability involving the difference of the sample means.

Exercise 3pe-ci-ht

> [Keywords] hypothesis tests, confidence intervals, Bernoulli populations, one-tailed tests, population proportion, critical region,

p-value, type I error, type II error, power function, method of the pivot.

> [Description] A decision on whether the population proportion is smaller or equal in one population than in the other is made looking

at both the critical values and the p-value. The type II error is calculated and the power function is plotted. By applying the method of

the pivot, a confidence interval for the difference of the population proportions is built. This interval can be seen as the acceptance

region of the equivalent two-sided hypothesis test. In this case, the same decision is made with the test and with the interval.

Exercise 4pe-ci-ht

> [Keywords] point estimations, hypothesis tests, standard power function density, method of the moments, maximum likelihood

method, plug-in principle, Neyman-Pearson's lemma, likelihood ratio tests, critical region.

> [Description] Given the probability function of a population random variable, estimators are built by applying both the method of

the moments and the maximum likelihood method. Then, the plug-in principle allows us to obtain estimators for the mean and the

variance of the distribution of the variable. In testing the equality of the parameter to a given value, the form of the critical region is

theoretically studied when four different types of alternative hypothesis are considered.

Additional Exercises (Solved but not ordered by difficulty, described nor referred to in the final index.)

Appendixes

Probability Theory (PT)

Some Reminders

Markov's Inequality. Chebyshev's Inequality

Probability and Moments Generating Functions. Characteristic Function.

Exercise 1pt

> [Keywords] probability, quantile, probability tables, probability function, binomial distribution, Poisson distribution, uniform

distribution, normal distribution, chi-square distribution, t distribution, F distribution.

> [Description] For each of these distributions, the probability of a simple event is calculated both by using probability tables and by

using the mass function, or, on the contrary, a quantile is found by using the probability tables or a statistical software program.

Exercise 2pt

> [Keywords] probability, normal distribution, total sum, sample mean, completion (standardization).

> [Description] For a quantity that follows the normal distribution with known parameters, the probability of an event involving the

quantity is calculated after properly completing the two sides of the inequality, that is, after properly rewriting the event.

Exercise 3pt (*)

> [Keywords] probability, Bernoulli distribution, binomial distribution, geometric distribution, Poisson distribution, exponential

distribution, normal distribution, raw or crude population moments, series, integral, probability generating function, moment generating

function, characteristic function, differential equation, integral equation, complex analysis.

> [Description] For the distributions mentioned, the first two raw or crude population moments are calculated by using as many ways

as possible. Their level of difficulty is different, but the aim is to practice. Some calculations require strong mathematical justifications.

Several interested analytical techniques are used: changing the order of summation in series, using Taylor series, characterizing a

function through a differential or integral equation, et cetera.

Mathematics (M)

Some Reminders

Limits

Exercise 1m (*)

> [Keywords] real analysis, integral, exponential function, bind, Fubini's theorem, integration by substitution, multiple integrals, polar

coordinates.

> [Description] It is well-known that the function exp(–x2) has no antiderivative. The definite integral is calculated in three cases that

appear frequently, e.g. when working with the density function of the normal or the Rayleigh distributions. By applying the Fubini's

theorem for improper integrals, calculations are translated to the two-dimensional real space, where polar coordinates are used to solve

the multiple integral easily.

Exercise 2m

> [Keywords] real analysis, limits, sequence, indeterminate forms

> [Description] Several limits of one-variable sequences, similar to those necessary for other exercises, are calculated.

Exercise 3m (*)

> [Keywords] real analysis, limits, sequence, indeterminate forms, polar coordinates.

> [Description] Several limits of two-variable sequences, similar to those necessary for other exercises, are calculated.

Exercise 4m (*)

> [Keywords] algebra, geometry, real analysis, linear transformation, rotation, movement, frontier, rectangular coordinates.

> [Description] Several approaches are used to find the frontier and the regions determined by a discrete relation in the plain.

References

> [Keywords] estimators, statistics T, parametric tests, likelihood ratio, analysis of Variance (ANOVA), nonparametric tests, chi-square tests,

Kolmogorov Smirnov tests, runs test (of randomness), signs test (of position), Wilcoxon signed-rank test (of position).

> [Description] The statistics applied in the exercises are tabulated in this appendix. Some theoretical remarks are included.

> [Keywords] normal distribution, t distribution, chi-square distribution, F distribution.

> [Description] A probability table with the most frequently used values is included for each of the distributions abovementioned.

Index

Inference Theory

[IT] Framework and Scope of the Methods

Populations

[Ap1] When the entire populations can be studied, no inference is needed. Thus, here we suppose that

we have not such total knowledge.

[Ap2] Populations will be supposed to be independent—matched or paired data must be treated in a

slightly different way.

Samples

[As1] Sample sizes are supposed to be quite smaller than population sizes—a correction factor is not

necessary for these (closely) infinite populations.

[As2] At the same time, we consider either any amount of normally distributed data or many data

(large samples) from any distribution.

[As3] Data will be supposed to have been selected randomly, with the same probability and

independently; that is, by applying simple random sampling.

Methods

[Am1] Before applying inferential methods, data should be analysed to guarantee that nothing strange

will spoil the inference—we suppose that such descriptive analysis and data treatment have been done.

[Am2] We are able to learn only linearly, but in practice methods need not be applied in the order in

which they are presented here—e.g. nonparametric hypothesis tests to check assumptions before

applying parametric methods.

Partial Knowledge and Randomness

The partial knowledge mentioned in the previous section has crucial consequences. The use of only some

elements of the population implies that—we can only hypothesized about the other elements—variables must

be assigned a random character, on the one hand, and results will have no total certainty in the sense that

statements will be set with some probability, on the other hand. For example: a 95% confidence in applying a

method must be interpreted as any other probability: the results are true with probability 0.95 and false with

probability 1–0.95 (frequently, we will never know if the method has failed or not). See remark 1pt, in the

appendix of Probability Theory, on the interpretation of the concept of probability.

In Probability Theory, random variables are dimensionless quantities; in real-life problems, variables

almost always are not. Since usually this fact does not cause troubles in Statistics, we do not pay much

attention to the units of measurement, and we can understand that the magnitude of the real-life variable, with

no unit of measurement, is the part that is being modeled by using the proper probability distribution with the

proper parameter values (of course, units of measurement are not random). To get used to pay attention to the

units of measurement and to manage them, they have been written in most numerical expressions.

Regarding the interpretation of the whole statistical processes that we will apply either to practice their

use or to solve particular real-world problems, we highlight the main points on which results are usually

based:

(i) Assumptions.

(ii) The method applied, including particular details of its steps, mathematical theorems, statistic T, etc.

(iii) Certainty with which the method is applied: probability, confidence or significance.

(iv) The data available.

In Statistics, results may change severely when assumptions are really false, other method is applied, different

certainty is considered, or data has no proper information (quantity, quality, representativity, etc.). Alongside

this document, we do insist on the cautions that statisticians and reader of statistical works must take in

interpreting results. Even if you are not interested in “statistically cooking” data, you had better know the

recipes... (Some of them have been included in the notes mentioned in the prologue.)

Let X = (X1,...,Xn) be the data of a population. The information they contain is extracted and used

through appropriate mathematical functions: estimators and statistics. When applying the methods,

since we usually need to calculate a probability or to find a quantile, expressions must be written in

terms of those appropriate quantities whose sampling distribution is known.

In trying to make estimators or statistics appear, some Mathematics are needed . We do not

repeat them whenever they are applied in this document. For example, the standardization is a strictly positive

transformation that does not change inequalities when it is applied to both sides, or the positive branch of the

square root must be considered to work with population or sample variances and standard deviations (this

concepts are nonnegative by definition, while the square root is a general mathematical tool applied to this

particular situation). As an example of those mathematical explanations not repeated again and again, we

include the following:

Remark: Since variances are nonnegative by definition and the positive branch of the square root function is strictly increasing, it holds that

σx2 = σy2 ↔ σx = σy (similarly for inequalities). For general numbers a and b, it holds only that a2 = b2 ↔ |a| = |b|. From a strict

mathematical point of view, for the standard deviation we should write σ = +√σ2 = |√σ2|.

Finally, at the end of the possible theoretical part of exercises, we do not insist that a sample (X1,...,Xn)

would in practice be used by entering its values in the theoretical expressions obtained as a solution.

Estimators and statistics are random quantities until specific data are used.

Useful Questions

To make the answer, users can find it useful to ask themselves:

On the Populations

● Are their probability distributions known?

On the Samples

● If populations are not normally distributed, are the sample sizes large enough to apply asymptotic

results?

● Do we know the data themselves, or only some quantities calculated from them?

On the Assumptions

● Should it be checked for the populations: the random character, the independence of the populations,

the goodness-of-fit to the supposed models, the homogeneity between the populations, et cetera?

● Should it be checked for the samples: the within-sample randomness and independence, the between-

-samples independence, et cetera?

● Are there other assumptions (neither mathematical nor statistical)?

● Concretely, what is the statistical problem: point estimation, confidence interval, hypothesis test, etc?

● Which are the estimators, the statistics and the methods that will be applied?

On the Quantities

● Which are the units of measurement? Are all the units equal?

● How large are the magnitudes? Do they seem reasonable? Are all of them coherent (variability is

positive, probabilities and relative frequencies are between 0 and 1, etc)?

On the Interpretation

● How is the statistical solution interpreted in the framework of the problem we are working on?

● Do the qualitative results seem reasonable (as expected)?

● Do the quantities seem reasonable (signs, order of magnitude, etc)?

They may want to consult some other pieces of advice that we have written in Guide for Students of Statistics,

available at http://www.casado-d.org/edu/GuideForStudentsOfStatistics-Slides.pdf.

Remark 1it: The notation and the expression of the most basic estimators, for one population, are

n

1 n

X̄ = ∑i=1 X i ∑i=1 X i

n η̂ =

n

2 1 n 2 2 1 n 2 2 1 n 2

n ∑ j=1 n ∑ j =1

V = ( X j −μ ) s = ( X j− X̄ ) S = ∑

n−1 j=1

( X j − X̄ )

For two populations, other basic estimators are made with these:

V 2X s 2X S 2X

̄ −Ȳ

X η̂ X − η̂ Y

V 2Y s 2Y S Y2

Finally, all these estimators are used to make statistics whose sampling distribution is known.

Exercise 1it-spd

Given a population (variable) X following the probability distribution determined by the following values and

probabilities

Value x 1 2 3

3 1 5

Probability p

9 9 9

Determine:

(a) The joint probability distribution of the sample X = (X1,X2)

(b) The sampling probability distribution of the sample mean X

(Based on an exercise of the materials in Spanish prepared by my workmates.)

Discussion: The distribution of X is totally determined, since we know all the information necessary to

calculate any quantity—e.g. the mean:

3 1 5 18

μ X = E( X ) = ∑ Ω x j⋅P X ( x j )= ∑ {1,2,3 } x j⋅p j = 1⋅ + 2⋅ +3⋅ = =2.222222

9 9 9 9

Instead of a table, a function is sometimes used to provide the values and the probabilities—the mass or

density function. We can represent this function with the computer:

values = c(1, 2, 3)

probabilities = c(3/9, 1/9, 5/9)

plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)

The sampling probability distribution of X is determined once we give the possible values and the

probabilities with which they can be taken. Before doing that, we describe the probability distribution of the

random vector X = (X1,X2).

Since Xj are independent in any simple random sample, the probability that X = (X1,X2) takes the value x1 =

(1,1), for example, is calculated as follows (note the intersection):

3 3 1

f X (1,1)=P X ( 1,1) = P X ({ X 1=1 }∩{ X 2=1 })=P X ( X 1=1)⋅P X ( X 2 =1)= ⋅ =

1

9 9 9 2

To fill in the following table, the other probabilities are calculate in the same way.

Joint Probability Distribution of (X1,X2)

Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

Probability 3 3 3 1 3 5 1 3 1 1 1 5 5 3 5 1 5 5

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

of (x1,x2) 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

1 1 5 1 1 5 5 5 25

9 27 27 27 81 81 27 81 81

Notice that (1,3) and (3,1), for example, contain the same information. The values and their probabilities can

be given by extension (table or figure) or by comprehension (function).

## Install this package if you don't have it (run the following line without #)

# install.packages('scatterplot3d')

valuesX1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3)

valuesX2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3)

probabilities = c(1/9, 1/27, 5/27, 1/27, 1/81, 5/81, 5/27, 5/81, 25/81)

library('scatterplot3d') # To load the package

scatterplot3d(valuesX1, valuesX2, probabilities, type='h', xlab='Value X1', ylab='Value X2', zlab='Probability',

xlim=c(0, 4), ylim=c(0, 4), zlim=c(0,1), main= 'Mass Function', lwd=7)

1 1 5 1 1 5 5 5 25 9+3+15+3+1+5+15+5+25 81

∑Ω f X ( x j )= ∑Ω p j = 9 + 27 + 27 + 27 + 81 + 81 + 27 + 81 + 81 = 81

= =1

81

From the information in the table it is possible to calculate any quantity—e.g. the first-order joint moment:

1,1 1 1 5 25

μ X = E ( X 1⋅X 2 )= ∑ Ω x j⋅f X (x j )= 1⋅1⋅ +1⋅2⋅ +⋯+3⋅2⋅ +3⋅3⋅ =4.938272

9 27 81 81

(B) Sampling probability distribution of the sample mean

The sample mean X(X) = X(X1,X2) is a random quantity, since so are X1 and X2. Each pair of values (x1,x2) of

(X1,X2) gives a value x for X; on the contrary, a value x of X can correspond to different pairs of values (x1,x2).

Then, we will fill in a table with all values and merge those that are equal. For example:

1+1 2

X̄ (1,1) = = =1

2 2

The other values x of X are calculate in the same way to fill in the following table:

Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

Probability 1 1 5 1 1 5 5 5 25

of (x1,x2) 9 27 27 27 81 81 27 81 81

1+1 1+2 1+3 2+1 2+2 2+3 3+ 1 3+ 2 3+ 3

Value x of X

2 2 2 2 2 2 2 2 2

3 3 5 5

1 2 2 2 3

2 2 2 2

The sample mean X can take five different values while (X1,X2) could take nine different possible values

(x1,x2). Thus, the probability for X to take the value 2, for example, is calculated as follows (note the union):

5 1 5 31

P X̄ ( 2)= P X ({(1,3) }∪{(2,2) }∪{(3,1)})=P X ((1,3))+ P X ((2,2))P X ((3,1))= + + =

27 81 27 81

In the same way,

1

P X̄ ( 1) = P X ({(1,1)})=

9

( 32 ) = P ({(1,2)}∪{(2,1)})=P ({(1,2)})+ P ({(2,1) })= 271 + 271 = 272

P X̄ X X X

5 5 5 10

P ( ) = P ({(2,3) }∪{(3,2)})=P ({(2,3)})+ P ({( 3,2)})= + =

X̄ X X X

2 81 81 81

25

P X̄ (3) = P X ({(3,3)})=

81

Then, the sampling probability distribution of the sample mean X is determined, in this case, by

Probability Distribution of X

3 5

Value x 1 2 3

2 2

1 2 31 10 25

Probability of x

9 27 81 81 81

We can check that the total sum of probabilities is equal to one:

1 2 31 10 25 9+6+31+10+ 25 81

∑Ω P X̄ ( x j) = ∑Ω p j = 9 + 27 + 81 + 81 + 81 = 81

= =1

81

From the information in the table above it is possible to calculate any quantity—e.g. the mean:

1 3 2 31 5 10 25 9+9+62+25+75

μ X̄ = E( X̄ )= ∑ Ω x j⋅P X̄ ( x j )= ∑ Ω x j⋅p j = 1⋅ + ⋅ +2⋅ + ⋅ +3⋅ = =2.222222

9 2 27 81 2 81 81 81

It is worth noticing that this value is equal to the value that we obtained at the beginning, which agrees with

the well-known theoretical property:

μ X̄ = E( X̄ )= E ( X ) = μ X

Values and probabilities can also be provided by using a function—the mass or density function, which can be

represented with the help of a computer:

values = c(1, 3/2, 2, 5/2, 3)

probabilities = c(1/9, 2/27, 31/81, 10/81, 25/81)

plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)

Conclusion: For a simple distribution for X and a small sample size X = (X1,X2), we have written both the

joint probability distribution of the sample X and the sampling distribution of X. This helps us to understand

the concept of sampling distribution of any random quantity (not only the sample mean), whether we are able

to write it or even to know it (e.g. due to a theorem).

My notes:

Point Estimations

[PE] Methods for Estimating

Remark 1pe: When necessary, the expectations E(X) and E(X2) are usually given in the statement; once E(X) is given, either Var(X)

or E(X2) can equivalently be given, since Var(X) = E(X2)–E(X)2. If not given, these expectations can be calculated from their

definitions by adding up to or integrating, for discrete and continuous variables, respectively (this is sometimes an advanced

mathematical exercise).

Remark 2pe: If the method of the moments is used to estimate m parameters (frequently 1 or 2), the first m equations of the system

usually suffice; nevertheless, if not all the parameters appear in the first-order moments of X, the smallest m moments—and

equations—for which the parameters appear must be considered. For example, if μ1 = 0 or if the interest relies directly on σ2 because

μ is known, the first-order equation μ1 = μ = E(X) = m1 does not involve σ and hence the second-order equation μ2 = E(X2) = Var(X)

+ E(X)2 = σ2+μ2 = m2 must be considered instead.

Remark 3pe: When looking for local maxima or minima of differentiable functions, the first-order derivatives are equalized to zero.

After that, to discriminate between maxima and minima, the second-order derivatives are studied. For most of the functions we will

work with, this second step can be solved by applying some qualitative reasoning on the sign of the quantities involved and the

possible values of the data xi. When this does not suffice, the values found in the first step, say θ0, must be substituted in the

expression of the second step. On the other hand, global maxima and minima cannot in general be found using the derivatives, and

some qualitative reasoning must be applied. It is important to highlight that, in applying the maximum likelihood method, the

purpose is to find the maximum, whichever the mathematical way.

Exercise 1pe-m

If X is a population variable that follows a binomial distribution of parameters κ and η, and X = (X1,...,Xn) is

a simple random sample:

(a) Apply the method of the moments to obtain an estimator of the parameter η.

(b) Apply the maximum likelihood method to obtain an estimator of the parameter η.

(c) When κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6), use the estimators obtained in the two previous

sections to construct final estimates of the parameter η and the measures μ and σ2.

Hint: (i) In the two first sections treat the parameter κ as if it were known. (ii) In the likelihood function, join the combinatorial

terms into a product; this product does not depend on the parameter η and hence its derivative will be zero.

Discussion: This statement is mathematical, although in the last section we are given some data to be

substituted. In practice, that the binomial can be used to explain X should be supported. The variable X is

dimensionless. For the binomial distribution,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.) Particularly,

the results obtained here can be applied to the Bernoulli distribution with κ = 1.

(a1) Population and sample moments: The probability distribution has two parameters originally, but we

have to study only one. The first-order moments are

1 n

μ1 ( η)=E ( X )=κ⋅η and m1 ( x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

n

(a2) System of equations: Since the parameter of interest η appears in the first-order population moment of

X, the first equation is enough to apply the method:

1 n 1

n ∑ j=1 j

μ1 (η)=m1 ( x1 , x 2 , ... , x n ) → κ⋅η= x = x̄ → η= κ x̄

1

η^ M = κ X̄

(b1) Likelihood function: For the binomial distribution the mass function is f (x ; κ , η)= κ ηx (1−η) κ− x . ( )

x

We are interested only in η, so

n n

L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f ( x j ; η)=∏ j=1 κ η (1−η) = κ η (1−η) ⋯ κ η (1−η)

x κ− x x κ− x x κ− x

( ) ( ) ( )

j j 1 1 n n

xj x1 xn

[ ] [ ]

n n n n

n n

η∑ (1−η)∑ ∏ j=1 ( xκj ) ⋅η∑

n κ−∑ j=1 x j

∏ j =1 ( xκj )

xj (κ− x j ) xj

= j=1 j=1

= j=1

(1−η) .

(b2) Optimization problem: The logarithm function is applied to facilitate the calculations,

[∏ ( )]

n n

n

log [ L( x 1 , x 2 , ... , x n ; η)]=log κ +log [ η∑ j=1

xj

]+log [(1−η)

n κ− ∑ j=1 x j

]

j=1 xj

[ ]

n n n

=log ∏ j=1 ( xκj ) +(∑ j =1 x j )log(η)+( n κ−∑ j=1 x j )log(1−η).

To discover the local or relative extreme values, the necessary condition is

n n

d 1 n −1 n n κ−∑ j=1 x j ∑ j=1 x j

0= log[ L( x 1 , x 2 , ... , x n ;η)]=0 +( ∑ j =1 x j ) +( n κ−∑ j=1 x j ) → =

dη η 1−η 1−η η

n

n n n n ∑ j=1 x j 11 n 1

→ η n κ−η∑ j=1 x j =∑ j=1 x j −η ∑ j=1 x j → η n κ=∑ j =1 x j → η0= = κ ∑ j=1 x j = κ x̄

nκ n

To verify that the only candidate is a local or relative maximum, the sufficient condition is

n n

d2 n −1 −1 n ∑ j =1 x j n κ−∑ j=1 x j

2

log[ L ( x 1 , x 2 ,... , x n ;η)]=( ∑ j =1 x j) 2 −(n κ−∑ j=1 x j ) (−1)=− − <0

dη η (1−η)2 η2 (1−η) 2

n n

since κ ≥ xj and therefore n κ≥∑ j=1 x j ↔ n κ−∑ j=1 x j≥0 . This holds for any value, including η0 .

1

η^ ML = κ X̄

1 1 1

From the method of the moments: η̂ M = κ ̄x = (4+ 4+3+5+6)=0.44 .

10 5

From the maximum likelihood method, as the same estimator was obtained: η̂ ML =0.44 .

Since μ=E ( X )=κ⋅η, an estimator of η induces an estimator of μ by applying the plug-in principle:

From the method of the moments: μ^ M =κ⋅^ηM =4.4 .

From the maximum likelihood method: μ^ ML =4.4 .

From the method of the moments: σ^ 2M =κ⋅η^ M (1− η

^ M )=10⋅0.44 (1−0.44)=2.464 .

From the maximum likelihood method: σ^ 2ML =κ⋅η^ ML (1− η

^ ML )=2.464 .

Conclusion: We can see that for the binomial population the two methods provide the same estimator for η.

The value of κ must be known to use the expression obtained. In this particular case, the value 0.44 indicates

that, for each underlying trials (Bernoulli variables), one value seems more probable than the other. On the

other hand, the quality of the estimator obtained should be studied, especially if the two methods had provided

different estimators. As a particular case, κ = 1 for the Bernoulli distribution.

My notes:

Exercise 2pe-m

A random quantity X is supposed to follow a geometric distribution. Let X be a simple random sample.

A) Apply the method of the moments to find an estimator of the parameter η.

B) Apply the maximum likelihood method to find an estimator of the parameter η.

27

C) Given a sample such that ∑ j=1 x j = 134 , apply the formulas obtained in the two previous sections

to give final estimates of η. Finally, give estimates of the mean and the variance of X.

Discussion: This statement is mathematical, although we are given some data in the last section. The

random variable X is dimensionless. For the geometric distribution,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.)

a1) Population and sample moments: The population distribution has only one parameter, so one equation

suffices. The first-order moments of the model X and the sample x are, respectively,

1 1 n

μ1 (η)=E ( X )= η and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

n

a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first

equation suffices:

−1

1 1 n 1 n 1

μ1 (η)=m1 ( x1 , x 2 , ... , x n ) →

η n n (

= ∑ j=1 x j = x̄ → η= ∑ j=1 x j =

x̄)

a3) The estimator:

−1

1 n 1

η^ M =( ∑ X

n j=1 j ) =

X̄

B) Maximum likelihood method

b1) Likelihood function: For the geometric distribution, the mass function is f (x ; η)=η⋅(1−η) x−1 so

n

n ( ∑ j=1 x j )−n

L( x 1 , x 2 , ... , x n ; η)=∏ j=1 f (x j ; η)=η⋅(1−η) x −1 ⋯η⋅(1−η)x −1=ηn⋅(1−η)

1 n

b2) Optimization problem: The logarithm function is applied to make calculations easier

n

( ∑ j=1 x j )−n n

log[ L( x 1 , x 2 , ... , x n ; η)]=log [ηn ]+ log [(1−η) ]=n⋅log (η)+[( ∑ j =1 x j )−n ]⋅log(1−η)

The population distribution has only one parameter, so a onedimensional function must be maximized. To find

the local or relative extreme values, the necessary condition is:

n

d n n −1 n [( ∑ j =1 x j )−n ]

0= log [ L( x 1 , x 2 , ... , x n ; η)]= η +[(∑ j=1 x j )−n] → η=

dη 1−η 1−η

n n n 1

→ n−n η=η∑ j=1 x j−ηn → n=η ∑ j =1 x j → η0= =

n

x̄

∑ j=1 x j

To verify that the only candidate is a (local) maximum, the sufficient condition is:

d2 n n −(−1)

2

log[ L (x 1 , x 2 ,... , x n ; η)]=− 2 −[( ∑ j =1 x j )−n ] <0

dη η (1−η)2

n 1

as ( ∑ j=1 x j )−n > 0 (note that xj ≥1). This holds for any value, including η0= .

x̄

b3) The estimator:

−1

1 n 1

η^ ML = ( ∑ X

n j =1 j ) =

X̄

C) Estimation of η, μ, and σ2

27

Since n = 27 and ∑ j=1 x j = 134 ,

1 1 27

From the method of the moments: η^ M = = = =0.201 .

x̄ 1 27 134

⋅∑ j =1 x j

27

From the maximum likelihood method, as the same estimator was obtained: η̂ ML =0.201 .

1

Since μ=E ( X )= η , an estimator of η induces an estimator of μ:

1 134

From the method of the moments: μ^ M = = =4.96 .

η^ M 27

From the maximum likelihood method, since the same estimator was obtained: μ^ ML =4.96 .

Note: From the numerical point of view, calculating 134/27 is expected to have smaller error than calculating 1/0.201.

2 1−η

Finally, since σ =Var ( X )= ,

η2

27

1−

1−η^ M

134 (134−27)134

From the method of the moments: σ^ 2M = 2 = 2

= 2

=19.67 .

27 27

η^M

134 ( )

2

From the maximum likelihood method: σ^ ML =19.67 .

Note: From the numerical point of view, calculating 134/27 is expected to have smaller error than calculating 1/0.201.

Conclusion: For the geometric model, the two methods provide the same estimator for η. We have used the

estimator of η to obtain an estimator of μ. On the other hand, the quality of the estimator obtained should be

studied, especially if the two methods had provided different estimators.

My notes:

Exercise 3pe-m

A real-world variable is modeled by using a random variable X that follows a Poisson distribution. Given a

simple random sample of size n,

A) Apply the method of the moments to obtain an estimator of the parameter λ.

B) Apply the maximum likelihood method to obtain an estimator of the parameter λ.

C) Use these estimators to build estimators of the mean μ and the variance σ2 of the distribution.

assumed that the Poisson model is appropriate to study that variable (it can be supposed to be dimensionless).

In a statistical study, this supposition should be evaluated, e.g. by applying a hypothesis test, before looking

for an estimator of the population parameter. For a Poisson random variable,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.)

a1) Population and sample moments: The population distribution has only one parameter, so one equation

suffices. The first-order moments of the model X and the sample x are, respectively,

1 n

μ1 (λ)= E( X )=λ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

n

a2) System of equations: Since the parameter of interest λ appears in the first-order moment of X, the first

equation suffices. The system has only one trivial equation:

1 n

μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → λ= ∑ j=1 x j = x̄

n

1 n

λ^ M = ∑ j=1 X j= X̄

n

B) Maximum likelihood method

b1) Likelihood function: We write the product and reorder the terms that are similar:

n

n n xj x1 x2 xn ∑ j=1 x j

L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e = λ e ⋅ λ e ⋯ λ e = λn

−λ −λ −λ −λ

e−n λ

xj! x1 ! x2 ! xn !

∏ j=1 x j !

b2) Optimization problem: The logarithm function is applied to make calculations easier:

n

∑ j=1 x j n n n

log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ ]+log[e−n λ ]−log[ ∏ j=1 x j ! ]=( ∑ j =1 x j )log [λ]−n λ−log[ ∏ j=1 x j ! ]

The population distribution has only one parameter, so a onedimensional function must be maximized. To find

the local extreme values the necessary condition is:

d n 1 1 n

0= log[ L (x 1 , x 2 ,... , x n ; λ)]=( ∑ j =1 x j ) λ −n → λ 0= ∑ j=1 x j = x̄

dλ n

To verify that the only candidate is a (local) maximum, the sufficient condition is:

d2 n −1

dλ 2

log[ L(x 1 , x 2 ,... , x n ; λ )]=( ∑ x ) 2 <0

j=1 j

λ

n

since x ∈{0, 1, 2...} → ∑ j=1 x j≥0 . Then, the second derivative is always negative, also for λ 0 .

b3) The estimator: For λ, it is obtained after substituting the lower-case letters xj (numbers representing THE

sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):

1 n

λ^ ML= ∑ j=1 X j= X̄

n

C) Estimation of μ and σ2

To obtain estimators of the mean and the variance, we take into account that for this model μ=E ( X )=λ

and σ 2=Var ( X )=λ , so by applying the plug-in principle:

μ^ = λ^ = X̄ , σ^ 2 = λ^ = X̄

Conclusion: For the Poisson model, the two methods provide the same estimator for λ, and therefore for μ

and σ2 (when the plug-in principle is applied). On the other hand, the quality of the estimator obtained should

be studied (though the sample mean is a well-known estimator).

My notes:

Exercise 4pe-m

A random variable X follows the normal distribution. Let X = (X1,...,Xn) be a simple random sample of X (seen

as the population). To obtain an estimator of the parameters θ = (μ,σ), apply:

(A) The method of the moments (B) The maximum likelihood method

(For this distribution, the mean and the variance are directly μ and σ2; this is proved in the appendixes.)

The population distribution has two parameters, so two equations are considered. The first-order moments are

1 n

μ1 (μ , σ)= E ( X )=μ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

n

while the second-order moments are

1 n 2

m2 ( x1 , x 2 , ... , x n )= ∑ j=1 x j

2 2 2 2

μ 2 (μ ,σ )=E ( X )=Var ( X )+ E ( X ) =σ +μ and

n

{

1 n

{

∑ x = x̄

μ= μ= x̄

{

μ 1 (μ , σ)=m1 ( x1 , x 2 , ... , x n )

μ 2 (μ , σ)=m2 ( x 1 , x 2 , ... , x n)

→

2

n j =1 j

2 1 n

σ +μ = ∑ j=1 x j

n

2

→

σ =

2

( 1 n

n )

∑ j=1 x 2j − x̄ 2=s 2x

2

1 n 1 n

where Var ( X )= E ( X 2 )− E( X )2 and s 2x = ( ∑ x

n j=1 j

2

−

n

∑)( x

j=1 j

2 2

)

= x¯ − x̄ have been used.

{

θ^ M = μ^ M = X̄

σ^ M =s X

( x−μ )

1 −

2σ

2

√2 π σ2

( x j −μ )2

( )(

n − 1 ∑

n

( x j −μ)2

1 −

1

)

n n 2 2

j=1

2σ

e = e 2σ

√2 π σ2 √2 π σ 2

Logarithm: The logarithm function is applied to make calculations easier

n 2 1 n 2

log[ L( x 1 , x 2 , ... , x n ;μ , σ)]=− log[2 π σ ]− 2 ∑ j=1 ( x j−μ)

2 2σ

Maximum: The population distribution has two parameters, and then it is necessary to maximize a

twodimensional function. To discover the local extreme values, the necessary conditions are:

{

1 n

2 ∑ j=1

{

∂ − [2 ( x j −μ )(−1)]=0

∂μ log[ L( x 1 , x 2 , ... , x n ;μ , σ)]=0 → 2σ

∂ n 1 −2 σ

( )

n

∂ σ log [ L( x 1 , x 2 , ... , x n ; μ , σ)]=0 − √ 2 π− 2 [∑ j =1 ( x j −μ)2 ] 4 =0

σ √2 π σ

{

1 n

{

n

∑ ( x j−μ)=0 ∑ j=1 ( x j −μ)=0

{

n

σ 2 j=1 → →

∑ j =1 x j=nμ

n 1 n 1 n

−n+ 2 ∑ j =1 ( x j −μ )2=0 n

− σ + 3 ∑ j =1 (x j −μ )2=0 ∑ j=1 ( x j−μ)2=n σ 2

σ σ

{

1 n

n ∑ j =1 j

{

μ= x μ= x̄

σ2 =

1

∑

n

( x j −μ )2

→ 1 n

2 2 2

σ = ∑ j=1 (x j − x̄) =s x

n

→

{ μ= x̄

σ=s x

n j=1

To verify that the only candidate is a (local) maximum, the sufficient conditions on the partial derivatives of

second order are:

1

[ 1

]n

2 n n

A= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ ∑ (x j −μ ) = ∑ (−1)=− 2

∂μ ∂μ σ 2 j=1

σ 2 j=1

σ

[ ]

∂ log [ L( x , ... , x ; μ , σ)]= ∂ 1 ∑n (x −μ ) = −2 σ ∑ n (x −μ )=− 2 ∑n ( x −μ )

2

B= 1 n

∂μ ∂σ ∂ σ σ 2 j=1 j σ

4 j =1 j

σ

3 j=1 j

n 1

[ n 3

]

2 n n

C= ∂ 2 log [ L( x 1 , ... , x n ;μ , σ)]= ∂ − σ + 3 ∑ j=1 ( x j−μ)2 = 2 − 4 ∑ j=1 ( x j−μ)2

∂σ ∂σ σ σ σ

To calculate D = B –AC, substituting the pair (μ , σ)=( ̄x , s x ) in A, B and C simplifies the work

2

n

A∣( ̄x , s ) =− 2

< 0

sx

x

2n 2

x

2

sx

n

B∣(μ , s )=− 3 ∑ j=1 ( x j− x̄ )=0 → 2

x

( )( )

D∣( ̄x , s ) =− −

n

s 2x

−

2n

s2x

=−

s 4x

< 0

n 3 n 2 2n

C∣( x̄ , s ) = 2

− 4 ∑ j=1 (x j −μ ) =− 2

x

s x sx sx

n n n n n

as ∑ j=1 (x j − x̄)=(∑ j =1 x j )−n x̄=0 and ∑ j=1 (x j − x̄)2= n ∑ j =1 (x j − x̄ )2=n s 2x . Then,

log[ L( x ; μ , σ)] has a maximum at (μ ,σ )=( ̄x , s x ) since it is a local extreme value and D < 0, A < 0.

{ ̄

θ̂ ML = μ̂ ML = X

σ̂ ML =s X

Conclusion: Since in this case there are two parameters, both the parameter and its estimator can be thought

as twodimensional quantities: θ=(μ , σ) and θ=( ̂ μ̂ , σ).

̂ On the other hand, the quality of the estimator

obtained should be studied, especially if the two methods had provided different estimators.

My notes:

Exercise 5pe-m

The uniform distribution U[0,θ] has

{

1

if x∈[0,θ ]

f (x ; θ) = θ

0 otherwise

as a density function. Let X = (X1,...,Xn) be a simple random sample of a population X following this

probability distribution.

A) Apply the method of the moments to find an estimator of the parameter θ.

B) Apply the maximum likelihood method to find an estimator of the parameter θ.

Use this estimator to build others for the mean and the variance of X.

Discussion: This statement is mathematical, and there is no supposition that would require justification. The

random variable X is dimensionless. We are given the density function of the distribution of X, though for this

distribution it could be deduced from the fact that all values have the same probability. For the general

continuous uniform distribution,

Note: If we had not remembered the first population moments, with the notation of this exercise we could do

θ

E ( X )=∫−∞

+∞ 1

θ 1 x2

x f ( x ; θ) dx=∫0 x θ dx = θ

2 0

1 2

[ ] ( )

= θ θ −0 = θ

2 2

θ

1 x3

[ ] ( )

1 3 2

+∞ θ 1

E ( X )=∫−∞ x f (x ; θ) dx=∫0 = θ θ −0 = θ

2 2 2

x θ dx= θ

3 0 3 3

so

2 1 1

2 2 2

μ=E ( X )= θ

2

and σ =Var ( X )=E ( X )−E ( X ) = θ − θ =θ

2 2 2

3 2

− =θ

3 4 12 ( ) ( )

A) Method of the moments

a1) Population and sample moments: For uniform distributions, discrete or continuous, the mean is the

middle value. Then, the first-order moment of the distribution and of the sample are

0+θ θ 1 n

μ1 (θ )= = and m1 (x 1 , ... , x n)= ∑ j=1 x j = x̄

2 2 n

θ = 1 n x = x̄ 2 n

2 n ∑ j =1 j n ∑ j =1 j

μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → → θ0 = x =2 x̄

2 n

θ^ M = ∑ j =1 X j =2 X̄

n

B) Maximum likelihood method

1

b1) Likelihood function: The density function is f (x ; θ)= for 0≤x ≤θ , so

θ

n n 1 1

L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θ = n

θ

b2) Optimization problem: First, we try to discover the maximum by applying the technique based on the

derivatives. The logarithm function is applied,

log[ L( x 1 , x 2 , ... , x n ; θ)]=log [θ−n ]=−n log(θ),

and the first condition leads to a useless equation:

0=

d

dθ

1

log[ L(x 1 , x 2 , ... , x n ; θ)]=−n θ → ?

Then, we realize that global minima and maxima cannot always be found through the derivatives (only if they

are also local extremes). In fact, it is easy to see that the function L monotonically decreases with θ and

therefore monotonically increases when θ decreases (this pattern or just the opposite tend to happen when the

probability function changes monotonically with the parameter, e.g. when the parameter appears only once in

the expression). As a consequence, it has no local extreme values. Since, on the other hand, 0≤x j≤θ , ∀ j ,

{ButL xwhen θ

≤θ , ∀ j

j

→ θ0 =max j { x j }

b3) The estimator: It is obtained after substituting the lower-case letters xj (numbers representing THE

sample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):

θ^ ML =max j { X j }

C) Estimation of μ and σ2

To obtain estimators of the mean, we take into account that μ=E ( X )= θ and apply the plug-in principle:

2

^θ M 2 X̄ ^θ ML max j { X j }

μ^ M = = = X̄ μ^ ML = =

2 2 2 2

2

To obtain estimators of the variance, since σ =Var ( X )= θ

2

12

2

2 θ^ M (2 X̄ )2 ( X̄ ) 2

2

2 θ^ 2ML ( max j { X j })

σ^ M = = = σ^ ML = =

12 12 3 12 12

Conclusion: For the uniform distribution, both methods provide different estimators of the parameter and

hence of the mean. The quality of the estimators obtained should be studied.

My notes:

Exercise 6pe-m

A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,

{

0 if x <3

f (x ; θ) =

1 − x−3

θe

θ

if x≥3

B) Apply the maximum likelihood method to find an estimator of the parameter θ.

C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.

Hint: Use that E(X) = θ + 3 and E(X2) = 2θ2 + 6θ + 9.

Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The

probability function and the first two moments are given, which is enough to apply the two methods. In the

last step, the plug-in principle will be applied.

Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and

exponentials are functions “of different type”):

∞

1 − x−3

[ ]

+∞ ∞ x−3 x−3

− −

E ( X )=∫−∞ x f ( x ; θ)dx=∫3 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx 3

θ

x−3 ∞ x−3 3

[

= −x e

−

x−3

θ

−θe

−

θ

3

] =[( x+ θ)e ] =3+θ .

−

θ

∞

That ∫ u (x )⋅v ' (x )dx=u (x )⋅v ( x )−∫ u ' (x)⋅v (x ) dx has been used with

• u=x → u ' =1

1 − x−3 1 − x−3 −

x−3

• v '= θ e θ → v=∫ θ e θ dx=−e θ

On the other hand, ex changes faster than xk for any k. To calculate E(X2):

∞

1 − x−3

[ ]

+∞ ∞ x−3 x−3

2 − −

E ( X )=∫−∞ x f (x ; θ)dx=∫3 x e θ dx= −x e θ + 2∫ x e θ dx

2 2 2

θ 3

x−3 3

=x e [ 2 − θ

∞

] +2 θ∫ ∞

3

1 − x−3

x θ e θ dx =(3 2−0)+2 θ μ=9+2 θ (3+θ)=2θ 2 +6 θ+9 .

Integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with

2

• u=x → u ' =2 x

x−3

1 − 1 − x−3 −

x−3

• v '= θ e θ

→ v=∫ θ e θ dx=−e θ

Again, ex changes faster than xk for any k.

a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order

moments of the model X and the sample x are, respectively,

1 n

μ1 (θ )=E ( X )=θ +3 and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

n

a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first

equation suffices:

1 n

n ∑ j =1 j

μ1 (θ )=m1 ( x 1 , x 2 ,... , x n ) → θ+3= x = x̄ → θ= x̄−3

θ^ M = X̄ −3

1 − x−3

b1) Likelihood function: For this probability distribution, the density function is f ( x ; θ)= θ e θ so

1 n

n n 1 − x θ−3 1 −θ ∑

j ( x j −3)

L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θe = ne

j=1

θ

b2) Optimization problem: The logarithm function is applied to make calculations easier

1 n 1 n

log [ L( x 1 , x 2 , ... , x n ; θ)]=log (θ )− ∑ j =1 (x j −3)=−n log (θ)− ∑ j=1 ( x j−3)

−n

θ θ

The population distribution has only one parameter, so a onedimensional function must be maximized. To find

the local or relative extreme values, the necessary condition is:

d 1 1 n n 1 n

0= log [ L(x 1 , x 2 , ... , x n ; θ)]=−n + 2 ∑ j=1 ( x j−3) → ∑

θ θ2 j =1 ( x j −3)

θ θ =

dθ

1 n 1 n 1 n

→ θ= ∑ j=1 (x j −3)= ∑ j =1 x j− ∑ j=1 3= x̄−3 → θ0 = x̄−3

n n n

To verify that the only candidate is a (local) maximum, the sufficient condition is:

d2 d 1 1 n n 2θ n ?

2

log [ L( x 1 , x 2 , ... , x n ;θ)]= [−n θ + 2 ∑ j=1 ( x j−3)]= 2 − 4 ∑ j=1 ( x j −3) < 0

dθ dθ θ θ θ

The first term is always positive but the second is always negative, so we had better substitute the candidate

n 2θ

2

d n 2θ n

2

log[ L( x 1 , x 2 , ... , x n ; θ)]= 2 − 4 n( x̄−3)= 2 − 40 n θ0=− 2 < 0

dθ θ θ θ0 θ0 θ0

b3) The estimator:

θ^ ML = X̄ −3

C) Estimation of η and σ2

c1) For the mean: By using the hint and the plug-in principle,

From the method of the moments: μ^ M =θ^ M + 3= X̄ −3+3= X̄ .

From the maximum likelihood method, as the same estimator was obtained: μ^ ML = X̄ .

c2) For the variance: We must write it in terms of the first two moments of X,

σ 2=Var ( X )=E ( X 2)−E ( X )2=2 θ2 +6 θ+ 9−(θ+3)2=2θ 2 +6 θ+9−θ 2−6 θ−9=θ 2

Then,

From the method of the moments: σ^ 2M =θ^ 2M =( X̄ −3)2=( X̄ )2 −6 X̄ +9 .

From the maximum likelihood method: σ^ 2ML =θ^ 2ML =( X̄ −3)2 =( X̄ ) 2−6 X̄ +9 .

Conclusion: For this model, the two methods provide the same estimator. We have used the estimator of θ

to obtain estimators of μ and σ2. The quality of the estimator obtained should be studied, especially if the two

methods had provided different estimators. Regarding the original probability distribution: (i) the expression

reminds us the exponential distribution; (ii) the term x–3 suggests a translation; and (iii) the variance θ2 is the

same as the variance of the exponential distribution. After translating all possible values x, the mean is also

translated but the variance is not. Thus, the distribution of the statement is a translation of the exponential

distribution, which has this equivalent notation

1 − x−δ

In fact, the distribution with probability function f ( x ; θ) = θ e θ , x >δ (and zero elsewhere) is termed

two-parameter exponential distribution. It is a translation of size δ of the usual exponential distribution. A

particular, simple case is obtained for θ = 1 and δ =0, since f ( x ) = e− x , x > 0 .

My notes:

Exercise 7pe-m

A random quantity X is supposed to follow a distribution whose probability function is, for θ>0,

{

3 x2

3

if 0≤ x≤θ

f (x ; θ) = θ

0 otherwise

B) Apply the maximum likelihood method to find an estimator of the parameter θ.

C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.

Hint: Use that E(X) = 3θ/4 and Var(X) = (3θ2)/80.

Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. The

probability function and the first two moments are given, which is enough to apply the two methods. In the

last step, the plug-in principle will be applied.

Note: If E(X) had not been given in the statement, it could have been calculated by integrating:

θ

3x2

+∞

θ

θ3 4 3

E ( X )=∫−∞ x f ( x ;θ)dx=∫0 x 3 dx= 3 θ = θ

θ 4 0 4 [ ]

On the other hand, if Var(X) had not been given in the statement, it could have been calculated by using a property and integrating:

θ

Now,

+∞

E ( X )=∫−∞ x f (x ;θ)dx=∫0

2 2

θ

23 x2 3 x5

x 3 dx= 3

θ θ 5 0

3

[ ]

= θ2 .

5

3 2 3 2 3 32 2 3 2

3

μ=E ( X )= θ

4

and

2 2 2

σ =Var ( X )=E ( X )−E ( X ) = θ − θ = − 2 θ = θ .

5 4 5 4 80 ( ) ( )

A) Method of the moments

a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-order

moments of the model X and the sample x are, respectively,

3 1 n

n ∑ j =1 j

μ1 (θ )=E ( X )= θ and m1 (x 1 , x 2 ,... , x n )= x = x̄

4

a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the first

equation suffices:

3 1 n 4

μ1 (θ )=m1 ( x 1 , x 2 ,... , x n ) → θ= ∑ j=1 x j = x̄ → θ0 = x̄

4 n 3

a3) The estimator:

4

θ^ M = X̄

3

2

3x

b1) Likelihood function: For this probability distribution, the density function is f ( x ; θ)= 3

so

θ

n n 3 x 2j 3n n

L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f (x j ; θ)=∏ j=1 3

= 3n ∏ x2

j=1 j

θ θ

b2) Optimization problem: The logarithm function is applied to make calculations easier

n

log[ L( x 1 , x 2 , ... , x n ; θ)]=log( 3n)−3 n log(θ)+ log( ∏ j =1 x 2j)

Now, if we try to find the maximum by looking at the first-order derivatives, a useless equation is obtained:

0=

d

dθ

1

log[ L(x 1 , x 2 , ... , x n ; θ)]=−3 n θ → ?

Then, we realize that global minima and maxima cannot in general be found through the derivatives (only if

they are also local). It is easy to see that the function L monotonically increases when θ decreases (this pattern

or just the opposite tend to happen when the probability function changes monotonically with the parameter,

e.g. when the parameter appears only once in the expression). As a consequence, it has no local extreme

values. On the other hand, 0≤x j≤θ , ∀ j , so

{ButL xwhen θ

≤θ , ∀ j

j

→ θ0 =max j { x j }

θ^ ML =max j { X j }

C) Estimation of η and σ2

c1) For the mean: By using the hint and the plug-in principle,

3 34

From the method of the moments: μ^ M = θ^ M = X̄ = X̄ .

4 43

3 3

From the maximum likelihood method: μ^ ML = θ^ ML = max j { X j }.

4 4

c2) For the variance: By using that principle again,

2

3 3 4 1

From the method of the moments: σ^ 2M = θ^ 2M =

80 80 3

X̄ = ( X̄ )2 .

15 ( )

26 Solved Exercises and Problems of Statistical Inference

2 3 ^2 3 2

From the maximum likelihood method: σ^ ML = θ ML = ( max j { X j }) .

80 80

Conclusion: For this model, the two methods provide different estimators. The quality of the estimators

obtained should be studied. We have used the estimator of θ to obtain estimators of μ and σ2.

My notes:

Remark 4pe: As regards the sample sizes, we can talk about static situations where we study the dependence of the concepts on the

sizes, or the possible relation between the sizes, say nX = c·nY. On the other hand, we can talk about dynamic situations where the

same dependences are studied asymptotically while the sample sizes are always increasing, say nX(k)= c(k)·nY(k), where k is the

index of a sequence of statistical schemes with those sample sizes. (Statistically, we are interested in sequences with nondecreasing

sample sizes; mathematically, all possible sequences should be taken into account.) The static and the dynamic situations are

respectively represented in the following figures:

Remark 5pe: We do not usually use the definition of the mean square error but the result at the end of the following equalities:

^

MSE ( θ)= ^

E ([ θ−θ] 2 ^

)= E ([ θ−E ^ E ( θ)−θ

( θ)+ ^ ^

]2 )=E ([θ−E ^ 2+[ E ( θ)−θ

( θ)] ^ ^ E(θ)]⋅

]2 +2 [θ− ^ [ E ( θ)−θ

^ ])

^

= E ([ θ−E ^ ) + [ E ( θ)−θ]

( θ)]

2 ^ 2 ^ E ( θ)−θ]−2

+ 2 E ( θ)⋅[ ^ ^ E ( θ)−θ

E ( θ)⋅[ ^ ^ b( θ)

]= Var ( θ)+ ^ 2

Remark 6pe: To study the consistency in probability we have been taught a sufficient—but not necessary—condition that is

equivalent to the consistency in mean of order two (managing the definition is quite complex). Thus, this type of consistency is

proved when the condition is fulfilled, which is sufficient—but not necessary—for the consistency in probability. By using the

Chebyshev's inequality:

^

E(( θ−θ)2 ^

) MSE ( θ) ^

lim n→∞ MSE ( θ)

|^ |

P( θ−θ ≥ϵ)≤ = → ^ |≥ϵ) ≤

lim n →∞ P (|θ−θ

ϵ2 ϵ2 ϵ2

If the sufficient condition is not fulfilled, the estimator under study is not consistent in mean of order two, but it can still be

consistent in probability—this type of consistency should be studied using a different way. Additionally, since MSE ( θ), ^

b ( θ) ^

^ 2 and Var ( θ) are nonnegative, the mean square error is zero if and only if the other two are zero at the same time, and viceversa.

The same happens for their limits. That is why we are allowed to split the limit of the mean square error into two limits.

Exercise 1pe-p

The efficiency (in lumens per watt, u) of light bulbs of a certain type have a population mean of 9.5u and

standard deviation of 0.5u, according to production specifications. The specifications for a room in which

eight of these bulbs (the simple random sample) are to be installed call for the average efficiency of the eight

bulbs to exceed 10u. Find the probability that this specification for the room will be met, assuming that

efficiency measurements are normally distributed.

(From Mathematical Statistics with Applications, Mendenhall, W., D.D. Wackerly and R.L. Scheaffer, Duxbury Press.)

Discussion: The supposition that efficiency measurements follow the distribution N(μ=9.5u, σ2=0.52u2)

should be tested by applying an appropriate statistical technique. The event is defined in terms of X. We think

about making the proper statistic appear, and hence to be allowed to use its sampling distribution.

Identification of the variable and selection of the statistic : The variable is the efficiency of the

light bulbs, while the estimator is the sample mean of eight elements. Since the population is normal and the

two population parameters are known, we will consider the (dimensionless) statistic:

X̄ −μ

T ( X ;μ )= ∼ N ( 0,1)

σ2

n √

( n)

2

Rewriting the event: Although in this case the sampling distribution of X is known, as X̄ ∼ N μ , σ ,

we need to standardize before consulting the table of the standard normal distribution:

(√ ) ( ) ( )

X

̄ −μ 10−μ 10−9.5 0.5 √ 8

̄ > 10)=P

P(X > =P T > =P T > =P ( T > √ 8) =0.0023

√ √ √ 0.5 2

2 2

σ σ 0.52

n n 8

> 1 - pnorm(sqrt(8),0,1)

where in this case the language R has been used: [1] 0.002338867

Conclusion: The production specifications will be met, for the room mentioned, with a probability of

0.0023, that is, they will hardly be met.

My notes:

Exercise 2pe-p

When a production process is working properly, the resistance of the components follows a normal

distribution with standard deviation 4.68u. A simple random sample with four components is taken. What is

the probability that the sample quasivariance will be bigger than 30u2?

Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable

resistance should be evaluated by using proper statistical techniques. The question involves S2. Again, it is

necessary to make the proper statistic appear, in order to use its sampling distribution.

R ≡ Resistance (of one component) R ~ N(μ, σ2 = 4.682u2)

R1, R2, R3, R4 (The resistance of four components is measured.) → n = 4

2 1 4 2

S= ∑ ( R − R̄)

4−1 j=1 j

Sample quasivariance

Search for a known distribution: The quantity required is P(S2 >30). To calculate the probability of an

event, we need to know the distribution of the random quantity involved. In this case, we do not know the

sampling distribution of S 2 , but since R follows a normal distribution we are allowed to use

(n−1) S 2

T= 2

∼χ 2n−1

σ

Then, by completing the inequality with the necessary constants (until making T appear):

(n−1) S 2 ( n−1)30

P( S2 >30)=P ( σ 2

>

σ 2

=P T > ) (

( 4−1)30

4.68 2

=P(T > 4.11) )

where T ∼ χ 23 . Multiplying and dividing by positive quantities have not changed the inequality.

Table of the χ2 distribution: Since n–1=4–1=3, it is necessary to look at the third row.

The probabilities in the table are given for events of the form P(T < x ) (or P(T ≤x ) , as the distribution is

continuous), and therefore the complementary of the event must be considered:

P(T > 4.11)=1−P (T ≤4.11)=1−0.75=0.25

Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value larger than

30u2, when evaluated at specific data x coming from the mentioned distribution.

My notes:

Exercise 3pe-p

A simple random sample of 270 homes was taken from a large population of older homes to estimate the

proportion of homes with unsafe wiring. If, in fact, 20% of homes have unsafe wiring, what is the probability

that the sample proportion will be between 16% and 24%?

Hint: Since probabilities and proportions are measured in a 0-to-1 scale, write all quantities in this scale.

(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)

home, house. It is a tribute to the unquenchable sentimentalism of users of English that one of the matters of usage that seem to agitate

them the most is the use of home to designate a structure designed for residential purposes. Their contention is that what the builder erects

is a house and that the occupants then fashion it into a home.

That is, or at least was, basically true, but the distinction has become blurred. Nor is this solely the doing of the real estate

operators. They do, indeed, lure prospective buyers not with the thought of mere masonry but with glowing picture of comfort,

congeniality, and family collectivity that make a house into a home. But the prospective buyers are their co-conspirators; they, too, view

the premises not as a heap of stone and wood but as a potential abode.

There may be areas in which the words are not used interchangeably. In legal or quasi-legal terminology we speak of a “house and

lot,” not a “home and lot.” The police and fire departments usually speak of a robbery or a fire in a house, not a home, at Main Street and

First Avenue. And the individual most often buys a home, but sells his house (there, apparently, speaks sentiment again). But in most

areas the distinction between the words has become obfuscated. When a flood or a fire destroys a community, it wipes out not merely

houses but homes as well, and homes has come to be accepted in this sense. No one would discourage the sentimentalists from trying to

pry the two words apart, but it would be rash to predict much success for them.

Discussion: The information of this “real-world study” must be translated into the mathematical language.

Since there are two possible situations, each home can be “modeled” by using a Bernoulli variable. Although

given in a 0-to-100 scale, the population and sample proportions—always in a 0-to-1 scale—are involved. The

dimensionless character of a proportion is due to its definition. Note that if the data (x1,...,xn) are taken and we

have access to them, there is nothing random any longer. The lack of knowledge, as if we had to select n

elements to build (X1,...,Xn), justifies the use of Probability Theory.

Identification of the variable and selection of the statistic : The variable having unsafe wiring can

take two possible values: 0 (not having unsafe wiring) and 1 (having it, if one want to register or count this

fact). The theoretical proportion of older homes with unsafe wiring is known: η = 0.20 (20%). For this

framework—a large sample from a Bernoulli population with parameter η—we select the dimensionless,

asympotic statistic:

̂

η−η d

T ( X ; η)= → N (0,1)

n √

?(1−? )

̂ Here we know η.

Rewriting the event: We are asked for the probability P (0.16 < η̂ < 0.24), but to calculate it we need to

rewrite the event until making T appear:

0.16−η η

̂ −η 0.24−η

P (0.16 < η

̂ < 0.24)=P

(√ η(1−η)

n

<

√ η(1−η)

n

<

√ η(1−η)

n

)

( )( )

0.24−0.20 0.16−0.20

=P T < −P T ≤ = P(T < 1.64)−P (T ≤−1.64)

√ 0.20(1−0.20)

270 √ 0.20 (1−0.20)

270

(In these calculations, we have standardized and then decomposed, but it is also possible to decompose and

then to standardize.) Now, let us assume that we have a table of the standard normal distribution including

positive quantiles only. By using a simple plot with the density function of this distribution, it is easy to see

(look at the areas) that for the second probability P (T ≤−1.64)=P (T ≥ +1.64)=1−P (T <+ 1.64) , so

P (T < 1.64)−P (T ≤−1.64)=P (T < 1.64)−[1−P (T < 1.64)]=2⋅P (T < 1.64)−1=2⋅0.9495−1=0.90.

> pnorm(1.64,0,1) - pnorm(-1.64,0,1)

Alternatively, by using the language R: [1] 0.8989948

Conclusion: The probability of the event is 0.90, which means that the sample proportion of older homes

with unsafe wiring, calculated from the sample X = (X1,...,X270), will take a value between 0.16 and 0.24 with

this probability. As a percentage: the proportion of the 270 homes with unsafe wiring will be between 16%

and 24% with 90% certainty.

My notes:

Exercise 4pe-p

Simple random samples X = (X1,...,X11) and Y = (Y1,...,Y6) are taken from two independent populations

2 2

X ∼ N (μ X =1 , σ X =1) and Y ∼ N (μ Y =2 , σY =0.5)

Calculate or find:

(2) The quantile c such that P ( X

̄ > c) = 0.25.

(3) The probability P ( X̄ −0.1 > 0.1+ Ȳ ) .

S 2X

(4) The quantile c such that P

( S 2Y )

≤ c = 0.9.

(Advanced Item) The probability P ( X

Discussion: There are two independent normal populations whose parameters are known. The variances, not

the standard deviation, are given. It is required to calculate probabilities or find quantiles for events involving

the sample means and the sample quasivariances. In the first two sections, only one of the populations is

involved. Sample sizes are 11 and 6, respectively. The variables X and Y are dimensionless, and so are both

sides of the inequalities.

(nY −1)S Y2

(1) The event involves the estimator S 2 , which reminds us of the statistic T = 2

∼ χ 2n −1 . Then,

σY Y

(n y −1)S 2y

2

P ( S ≤ 1.5)= P

y

( σ 2

y

≤

(n y −1)1.5

σ 2

y

) =P T≤( (6−1) 1.5

0.5

=P T ≤ )

5⋅1.5

1

2 (

= P ( T ≤ 15 ) = 0.99

)

̄ −μ X

X

̄ , so we think about the statistic T =

(2) The event involves X ∼ N ( 0,1) . Then,

√

2

σ X

nX

̄ −μ X

X c−μ X c−μ x

(√ ) ( √ ) ( √)

c−1

̄ > c) = P

0.25 = P ( X > =P T> =P T>

√ 1

2 2 2

σX σX σ X

nX nX nX 11

or, equivalently,

c−1

1−0.25 = 0.75 = P T ≤

( √ 1

11

)

Now, the quantile found in the table of the standard normal distribution must verify that

r 0.25=l 0.75=0.674 =

c−1

√1

11

→ c = 0.674

√ 1

11

+1 = 1.20

( X̄ −Ȳ )−(μ X −μ Y )

(3) To work with the means of two populations, we use T = ∼ N (0,1), so

̄ −Ȳ )−(μ x −μ y )

√ σ 2X σ Y2

+

n X nY

(X 0.2−(μ x −μ y )

( ) (

0.2−(1−2)

̄ −0.1 > 0.1+ Ȳ ) = P ( X

P(X ̄ −Ȳ > 0.2) = P

√ σ 2x σ 2y

+

nx ny

>

√ σ 2x σ 2y

+

nx ny

=P T >

√ 1 0.5

+

11 6

)

31 Solved Exercises and Problems of Statistical Inference

0.2−1+ 2

=P T >

( 1 1

+

11 12 √ )

= P ( T > 2.87 ) = 1−P (T ≤ 2.87 ) = 1−0.9979 = 0.0021

S 2X σ 2Y

(4) To work with the variances of two populations, T = 2 2 ∼ F n −1 ,n −1 is used:

S Y σX X Y

S 2X σ 2Y S 2X σY2 σ 2Y

0.9 = P

( S 2

Y

) (

≤c =P

σ S 2

X

2

Y

≤c

σ 2

X

) ( = P T ≤c

σ 2

X

) =P T ≤c ( 0.5

1) (

=P T ≤

c

2 )

The quantile found in the table of the distribution F n X −1 , nY −1 =F 11−1 ,6−1=F 10,5 is 3.30, which allows us to

find the unknown c:

c > qf(0.9, 10, 5)

r 0.1=l 0.9=3.30= → c = 6.60. [1] 3.297402

2

(Advanced Item) In this case, allocating the two sample means in the first side of the inequality leads to

̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2)

P(X

We remember that

σ2X σ2Y

̄ ∼ N μX ,

X ( nX ) and Ȳ ∼ N μ Y ,

nY ( )

so the rules that govern the sums—and hence subtractions—of normally distributed variables imply both

σ 2X σ 2Y σ 2X σ2Y

̄ −Ȳ ∼ N μ X −μ Y ,

X ( +

n X nY ) and X̄ + Ȳ ∼ N μ X +μY , +

n X nY ( )

(Note that in both cases the variances are added—uncertainty increases.) Although the difference is used more

frequently, to compare to populations, the sampling distribution of the sum of the sample means is also known

thanks to the rules for normal variables; alternatively, we could still use the first result by doing X+Y = X–(–Y)

and using the –Y has mean and variances equal to –μY and σY2. Either way, after standardizing:

( X̄ + Ȳ )−(μ X +μ Y )

T= ∼ N ( 0, 1 )

̄ + Y.

̄ Now,

√ σ 2X σ2Y

+

n X nY

This is the “mathematical tool” necessary to work with X

̄ + Ȳ )−(μ X +μY )

(X 0.2−(μ X +μY )

( ) (

0.2−(1+ 2)

̄ −0.1 > 0.1−Ȳ ) = P( X̄ + Ȳ > 0.2) = P

P(X

√ σ2X σ 2Y

+

nX nY

>

√ σ 2X σY2

+

n X nY

=P T >

√ 1 0.5

+

11 6

)

0.2−3

=P T >

( √ 1 1

+

11 12

)

= P ( T > −6.71 ) = 1−P ( T ≤−6.71 ) = 1

The quantile 6.71 is not usually in the tables of the N(0,1), so we can consider that P ( T ≤−6.71 )≈0. Or, if

we use the programming language R: > 1-pnorm(-6.71,0,1)

[1] 1

Conclusion: For each case, we have selected the appropriate statistic. After completing the expression of the

event, the statistic T appears. Then, since the (sampling) distribution of T is known, the tables can be used to

calculate probabilities or to find quantiles. In the latter case, the unknown c is found after the quantile of T.

My notes:

Exercise 5pe-p

Suppose that you manage a bank where the amounts of daily deposits and daily withdrawals are given by

independent random variables with normal distributions. For deposits, the mean is ₤12,000 and the standard

deviation is ₤4,000; for withdrawals, the mean is ₤10,000 and the standard deviation is ₤5,000.

(a) For a week, calculate or bind the probability that the five withdrawals will add up to more than

₤55,000.

(b) For a particular day, calculate or bind the probability that withdrawals will exceed deposits by more

than ₤5,000.

Imagine that you are to launch a new monthly product. A prospective study indicated that profits (in million

dollars) can be modeled through the random quantity Q = (X+1)/2.325, where X follows a t distribution with

twenty degrees of freedom.

(c) For a particular month, calculate or bind the probability that profits will be smaller than ₤106 (one

million pounds).

(Based on an exercise of Business Statistics, Douglas Downing and Jeffrey Clark, Barron's.)

Discussion: There are several suppositions implicit in the statement, namely: (i) the normal distribution can

reasonably be used to model the two variables of interest D and W; (ii) withdrawals and deposits are

independent; and (iii) X can reasonably be modeled by using the t distribution. These suppositions should

firstly be evaluated by using proper statistical techniques. To solve this exercise, the rules on sums and

differences of normally distributed variables must be used.

Identification of variables and distributions: If D and W represent the random variables daily sum

of deposits and daily sum of withdrawals, respectively, from the statement we have that

D ∼ N (μ D =₤ 12,000 , σ 2D =₤ 2 4,0002 ) and W ∼ N (μW =₤ 10,000 , σ 2W =₤ 2 5,000 2)

(a) Since the variables are measured daily, in a week we have five measurements (one for each working day).

Translation into the mathematical language: We are asked for the probability

5

P (W 1+ W 2+ W 3+W 4+ W 5 > 55,000)=P ( ∑ j =1 W j > 55,000)

Search for a known distribution: To calculate or bind this probability, we need to know the distribution of

the sum or, alternatively, to relate it to any quantity whose distribution we know. By using the rules that

govern the sums and subtractions of normal variables,

5

∑ j=1 W j ∼ N (5μ W ,5 σ 2W )

Rewriting the event: We can easily rewrite the event in terms of the standardized version of this normal

distribution:

5

5

P ( ∑ j=1 W j >55,000)=P

( ∑ j =1 W j−5μ W 55,000−5 μW

√ W

5 σ 2

>

√ 5 σ2W )=P Z>

( 55,000−50,000

√5⋅5,000 2 )

=P ( Z > 0.4472)

Consulting the table: Finally, it is enough to consult the table of the standard normal distribution Z. On the

one hand, in the table we are given values for the quantiles 0.44 and 0.45, so we could round the value 0.4472

to the closest 0.45 or, more exactly, we can bind the probability. On the other hand, our table provides lower-

-tail probabilities, so we will consider the complementary of some events. From the figure below, it is easy to

deduce that

P (Z > 0.44)> P ( Z > 0.4472)> P (Z > 0.45)

1−P ( Z ≤0.44)> P (Z > 0.4472)> 1−P ( Z≤0.45)

1−0.6700> P (Z > 0.4472)> 1−0.6736

0.3300> P (Z > 0.4472)> 0.3264

Then,

5

0.3264< P (∑ j=1 )

W j > 55,000 < 0.3300

Note: It is also possible to relate the total sum to the sample mean

1 5 1

( )

5

P ( ∑ j=1 W j >55,000)=P

5 ∑ j=1 j 5

W > 55,000 =P ( W̄ >11,000 )

and use that

2

1 5 σ

W̄ = ∑ j=1 W j ∼ N μ W , W

5 5 ( ) →

̄ −μ W

W

∼ N (0,1)

√ σ 2W

5

(b) Translation into the mathematical language: We are asked for the probability P (W > D+ 5,000).

Search for a known distribution: To calculate or bind this probability, we rewrite the event until all random

quantities are on the left side of the inequality:

P (W > D+ 5,000)=P (W −D >5,000)

Now we need to know the distribution of W – D or, alternatively, of a quantity involving this difference. By

again using the rules that govern the sums and differences of normal variables, it holds that

W −D ∼ N (μ W −μ D , σW2 +σ 2D )= N (₤ 10,000−₤ 12,000 , ₤ 2 5,000 2 + ₤ 2 4,0002 )

Rewriting the event: We can easily express the event in terms of the standardized version of W – D:

(W − D)−(μ W −μ D ) 5,000−(μ W −μ D )

P (W −D> 5,000)=P

( √ σ2W + σ2D

>

√ σ2W + σ 2D )

(W −D)−(−2,000) 5,000−(−2,000) 7⋅103

=P

( >

√ 25⋅106 +16⋅106 √25⋅106 +16⋅106

= P(Z >

√ 25+16⋅103 )

)=P (Z >1.0932)

Consulting the table: We can bind the probability as follows (see the figure below)

P (Z > 1.0900)> P (Z> 1.0932)> P( Z > 1.1000)

1−P ( Z ≤1.0900)> P ( Z > 1.0932)> 1−P (Z ≤1.1000)

1−0.8621> P ( Z> 1.0932)> 1−0.8643

0.1379> P(Z > 1.0932)> 0.1357

Then,

0.1357< P (W > D+ 5,000)< 0.1379

X +1 X +1

(c) Translation into the mathematical language: We are asked for P ⋅106 <1⋅106 =P ( 2.325

<1 . ) ( 2.325 )

Search for a known distribution: We do not know the distribution of (X+1)/2.325, but we know that

X ∼ t 20

Rewriting the event: The event can easily be rewritten in terms of this known distribution:

X +1

P ( 2.325 <1)=P ( X +1< 2.325)=P ( X <2.325−1)=P ( X <1.325)

Consulting the table: Finally, it is enough to consult the table of the t distribution. The quantity 1.325 is in

our table of lower-tail probabilities, so

P ( X <1.325)=0.900

Conclusion: For a week, the probability that the five withdrawals will add up to more than $55,000 is

around 0.33. For a particular day, the probability that withdrawals will exceed deposits by more than $5,000 is

around 0.13. For a particular month, the probability that profits will be smaller than one (million dollars) is

0.9, that is, quite high.

My notes:

Exercise 6pe-p

To study the mean of a population variable X, μ = E(X), a simple random sample of size n is considered.

Imagine that we do not trust the first and the last data, so we think about using the statistic

~ 1 n−1 1 X + X 3 +⋯+ X n−1

X=

n−2

∑ j=2

X j=

n−2

( X 2 + X 3+⋯+ X n−1 ) = 2

n−2

Calculate the expectation and the variance of this statistic. Calculate the mean square error (MSE) and its

limit when n tends to infinite. Study the consistency. Compare the previous error with that of the ordinary

sample mean.

Discussion: The statement of this exercise is mathematical. Here we are interested in the mean. The quantity

X is dimensionless. We could not apply the defintions, and the mean and the variance must be written in terms

of the mean and the variance X by applying the basic properties of these measures.

Expectation and variance: The basic properties of the mean and the variance are applied to do:

1 1 1

~

E ( X )= E ( n−2 ( X + X +⋯+ X ))=

2 3

n−2

(

n−1 E ( X )+⋯+ E ( X ) ) =

2

n−2

(n−2)μ=μ

n−1

1 1 1 2

Var ( X ) =Var ( ( X + X +⋯+ X ) )=

~ n−1

(n−2)σ = σ

2

n−2 2 3

(n−2)

n−1∑ Var ( X )=

2 j=2

(n−2)

j

n−2 2

When n increases, that is, when the sample consists of more and more data, the limits are, respectively:

2

~ ~

lim n →∞ E ( X )=lim n →∞ μ=μ and lim n →∞ Var ( X ) =lim n →∞ σ =0

n−2

Consistency: The previous limits show that ~

X has some basic desirable properties: (asymptotic)

unbiasedness and evanescent variance. This pair is equivalent to the evanescence of the mean square error

(MSE), that is, the consistency in mean of order two—a sufficient, but not necessary, condition for the

consistency in probability.

Comparison of errors:

2 2

MSE ( ~

X )= σ MSE ( X̄ ) = σ

n−2 n

Since σ2 appears in the two positive quantities, by looking at the coefficients it is easy to see that,

~

MSE ( X̄ ) < MSE ( X )

(for n larger than 2). This result is due to the fact that the sample mean uses all the data available, though only

the number of data—not their quality, since all of them are supposed to follow the same distribution—is

considered in calculating the mean square error. In the limit, –2 is negligible. We can plot the coefficients

(they are also the mean square errors when σ=1).

# Grid of values for 'n'

n = seq(from=3,to=10,by=1)

# The three sequences of coefficients

coeff1 = 1/(n-2)

coeff2 = 1/n

# The plot

allValues = c(coeff1, coeff2)

yLim = c(min(allValues), max(allValues));

x11(); par(mfcol=c(1,3))

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')

plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')

points(n, coeff2, type='b')

Conclusion: ~

X is a consistent estimator of μ. The estimator is appropriate for estimating μ. When nothing

suggests removing data, it is better to maintain them in the sample.

Advanced theory: The estimator in the statement is the usual sample mean when the sample has n–2 data

instead of n (leaving out these two data can be seen as a sort of data treatment implemented in the method, not

in the previous analysis of data). When any of the two left out data is not trustable, using this estimator makes

sense; otherwise, it does not exploit the information available efficiently. On the other hand, the sample mean

can be affected by tiny or huge values (outliers). To make the sample mean robust, this estimator is sometimes

considered after ordering the data from the smallest to the largest; if X(j) is the j-th datum in the sample already

reordered:

~ 1 n−1 1

X= ∑ X =

n−2 j=2 ( j ) n−2 (2)

( X + X (3) +⋯+ X (n−1) )

This new robust estimator of the population mean μ is called trimmed sample mean, and any number of data

can be left out—not only two.

My notes:

Exercise 7pe-p

A population variable X follows the χ2 distribution with κ degrees of freedom. We consider a statistic T that

uses the information contained in the simple random sample X = (X1, X2,...,Xn). If

T ( X )=T ( X 1 , X 2 , ... , X n )=2 X̄ −1 ,

calculate its expectation and variance. Calculate the mean square error of T. As an estimator of twice the

mean of the population law, is T a consistent estimator?

Hint: If X follows the χ2 distribution with κ degrees of freedom, μ = E(X) = κ and σ2 = Var(X) = 2κ.

Discussion: Even if a population is mentioned, this statement is mathematical. To calculate the value of

these two properties of the sampling distribution of T, we have to apply the general properties of the

expectation and the variance. The knowledge about the distribution of X will be used in the last steps. This is a

dimensionless quantity. The mean square error is defined in terms of these quantities.

Expectation or mean:

([ 1 n

] ) (

2 n 2

)

n

E ( T ( X ) ) =E 2 ∑

n j=1

X j −1 =E

n ∑ j =1 j

X − E ( 1 )= E

n (∑ j=1

X j −1)

2 n 2

=

n ∑ j=1

E ( X j )−1= n E ( X )−1=2 κ−1

n

(Since μ = E(X) = κ)

Variance:

2

([ 1 n

] ) ( 2 n 2

)() 4

n n

Var ( T ( X ) )=Var 2

n ∑ j =1

X j −1 =Var ∑ j=1 X j =

n n

Var (∑ j=1

Xj= ) n

2 ∑ j=1

Var ( X j )

4 8κ

= n Var ( X )= (Since σ2 = Var(X) = 2κ) Independence of Xj

n 2

n (simple random sample)

Mean square error: Since b (T )=E (T )−2 E( X )=( 2 κ−1−2 κ)=−1 , then

2 8κ d

MSE ( T ) = b(T ) +Var (T ) = 1+ → 1

n

Consistency: Although the variance of T tends to zero when n increases, the bias does not (thus, T is

asymptotically biased). Hence, the mean square error does not tend either, and nothing can be said about the

consistency in probability using this way (although we can say that it is not consistent in mean of order two).

Conclusion: Since the mean square error tends to 1, in general T is not a “good” estimator of 2μ even for

many data.

My notes:

Exercise 8pe-p

Given a simple random sample of size n = 2, that is, X = (X1, X2), the following estimators of μ = E(X) are

defined:

1 1 1 2

μ̂ 1= X 1+ X 2 μ̂ 2= X 1+ X 2

2 2 3 3

1) Calculate their mean square error.

2) Calculate the relative efficiency. Which one would you use to estimate μ?

(Based on an exercise of Statistics for Business and Economics. Newbold, P., W. Carlson and B. Thorne. Pearson-Prentice Hall.)

Discussion: This statement is basically mathematical. The relative efficiency is defined in terms of the mean

square error of the estimators.

( 12 X + 12 X )= 12 E ( X )+ 12 E ( X )= 12 E ( X ) + 12 E ( X )= 12 μ+ 12 μ=μ

E ( μ̂ 1 ) =E 1 2 1 2

1 2 1 2 1 2 1 2

E μ̂ = E ( X + X )= E ( X ) + E ( X ) = E ( X ) + E ( X )= μ + μ=μ

( 2) 1 2 1 2

3 3 3 3 3 3 3 3

2 2

1 1 1 1 1 1 1 1 1

(

Var ( μ̂ 1 ) =Var )() 2 2 ()

X 1+ X 2 =

2

Var ( X 1 ) +

2

Var ( X 2 )= Var ( X ) + Var ( X )= σ2 + σ 2= σ2

4 4 4 4 2

2 2

1 2 1 2 1 4 1 4 5

Var μ̂ =Var ( X + X )=( ) Var ( X )+ ( ) Var ( X )= Var ( X ) + Var ( X )= σ + σ = σ 2 2 2

( 2) 1 2 1 2

3 3 3 3 9 9 9 9 9

Mean square errors:

1 1

MSE ( μ̂ 1) = b( μ̂ 1 )2 +Var ( μ̂ 1 )=[ E( μ̂ 1)−μ ]2 +Var ( μ̂ 1)=[μ−μ]2+ σ 2= σ2

2 2

5 5

MSE ( μ̂ 2 ) = b( μ̂ 2 )2+ Var (μ̂ 2 )=[ E( μ̂ 2 )−μ ] 2+Var (̂μ 2 )=[μ−μ ] 2+ σ2 = σ 2

9 9

Since bias is zero for unbiased estimators, the mean square error is equal to the variance and we will prefer the

estimator with the smallest variance. An easy way of comparing two estimators consists in using the concept

of relative efficiency, which is a simple quotient (take into account which estimator you allocate in the

numerator). When this quotient is over one, the estimator in the denominator has smaller mean square error,

and vice versa. In this case,

2

5σ

MSE (̂μ 2 ) 9 10

e ( μ̂ 1 , μ̂ 2) = = 2 = >1 → μ̂ 1 is preferred for estimating μ.

MSE (μ̂ 1 ) σ 9

2

Conclusion: Both estimators are unbiased while the first has smaller variance; then, the first is preferred.

We have not mathematically proved that this first estimator minimizes the variance, so we cannot say that it is

an efficent estimator.

My notes:

Exercise 9pe-p

The mean μ = E(X) of any population can be estimated from a simple random sample of size n through X.

Prove that:

(a) This estimator is always consistent.

(b) For X normally distributed (normal population), this estimator is efficient.

Discussion: This statement is theoretical. The first section of this exercise needs calculations similar to

those of previous exercises. To prove the efficiency, we have to apply its definition.

(a) Consistency: The expectation of the sample mean is always—for any population—the population mean.

Nevertheless, we repeat the calculations:

1 n 1

( ) ) 1n ∑ 1

n n

E ( X̄ )= E

n ∑ j =1

Xj = E

n (∑ j=1

Xj = j=1

E ( X j ) = n E ( X )= E ( X )=μ

n

The variance of the sample mean is always—for any population—the population variance divided by n. We

repeat the calculations too:

Independence of Xj (simple random sample)

2

1 n 1 1 1 σ2

( )()

n n

Var ( X̄ )=Var

n

∑ j=1

X j =

n

Var (∑ j=1 )

Xj =

n

2 ∑ j=1

Var ( X j )=

n

2

n Var ( X ) =

n

The bias is defined as b ( X ̄ )−μ = 0 . We prove the consistency (in probability) by using the

̄ )= E ( X

sufficient—but not necessary—condition (consistency in mean of order two):

[ ]

2

̄ ) ] = lim n→∞ 0 + σ = 0

lim n →∞ MSE ( X̄ )= lim n →∞ [ b( X̄ )2+ Var ( X

n

Then, it is consistent in mean of order two and therefore in probability.

(b) Efficiency: It is necessary to prove that the two conditions of the definition are fulfilled:

i. The expectation of X is always μ = E(X), that is, X is always an unbiased estimator of μ.

ii. X has minimum variance, which happens—because of a theoretical result—when Var(X) attains the

Cramér-Rao's lower bound

1

[( )]

2

∂ log[ f ( X ; θ)]

n⋅E ∂θ

where θ = μ in this case, and f(x;θ) is the probability function of the population law where the

nonrandom variable x is substituted by the random variable X (otherwise, it is not possible to talk

about expectation, since f(x;θ) is not random when θ is a parameter).

The unbiasedness is proved. On the other hand, we compute the Cramér-Rao's lower bound step by step:

(1) Function (with X in place of x)

2

( X −μ)

1 −

2σ

2

f ( X ;μ)= e

√2 π σ2

(2) Logarithm of the function:

2

( X −μ )

1 − ( X −μ) 2

( ) )=−log( √ 2 π σ 2)−

2

2σ

log[ f ( X ;μ )]=log + log(e

√ 2 π σ2 2 σ2

(3) Partial derivative of the logarithm of the function:

∂ ( log[ f ( X ;μ)]) =0− 1 2 ( X −μ)(−1)= X −μ

∂μ 2 σ2 σ2

(4) Expectation of the squared partial derivative of the logarithm of the function: In this step, we must

rewrite the terms so as to make σ 2=Var ( X )=E ( ( X −E ( X ))2 ) =E ( ( X −μ)2 ) appear.

[( ) ]= E[( )]

2 2

∂ log[ f ( X ; μ)] X −μ 1 1 1 1

E [ ( X −μ ) ] = 4 Var ( X )= 4 σ 2= 2

2

E =

∂μ σ

2

σ

4

σ σ σ

(5) Cramér-Rao's lower bound:

1 1 2

= =σ

1 n

[( )]

2

∂ log[ f ( X ;μ )] n⋅ 2

n⋅E

∂μ σ

The variance of the estimator, calculated in section (a), attains the bound and hence the estimator has

minimum variance. Since both conditions are fulfilled, the efficient is proved.

Conclusion: We have proved that the sample mean X is always—for any population—a consistent estimator

of the population mean μ. For a normal population, it is also efficient.

Advanced theory: When log[f(x;θ)] is twice differentiable with respect to θ, the Cramér-Rao's bound can

equivalently be written as

−1

[ ]

2

∂ log[ f ( X ; θ)]

n⋅E

∂θ2

Concerning the regularity conditions, Wikipedia refers (http://en.wikipedia.org/wiki/Fisher_information) to

eq. (2.5.16). of Theory of Point Estimation, Lehmann, E. L. and G. Casella, 1998. Springer. Let us assume that

this alternative expression can be applied; then, step (3) would be

∂2 ( log[ f ( X ; μ)]) = ∂ X −μ = 1 ⋅(−1)=− 1

∂μ 2 ∂μ σ 2 σ2 σ2( )

step (4) would be

E

[

∂ 2 log[ f ( X ;μ )]

∂μ 2

1 1

=E − 2 =− 2

σ σ ] [ ]

and, finally, step (5) would be

2

−1 −1

= =σ

−1 n

n⋅E

[ ∂2 log [ f ( X ; μ)]

∂μ2 ] n⋅ 2

σ

We would have obtained the same result with easier calculations, although the fulfillment of the regularity

conditions must have been verified previously.

My notes:

Exercise 10pe-p

Let θ be the parameter of a population random variable X that follows a continuous uniform distribution on

the interval [θ–2, θ+1], and let X = (X1,...,Xn) be a simple random sample; then,

(a) Plot the density function of the variable X.

(b) Study the consistency of the sample mean X when it is used to estimate the parameter θ.

(c) Study the efficiency of the sample mean X when it is used to estimate the parameter θ.

(d) Find an unbiased estimator of θ and study its consistency.

Hint: Use that E(X) = θ – 1/2 and Var(X) = 3/4.

Discussion: This statement is mathematical. We should know the density function of the continuous uniform

distribution, although it could also be deduced from the fact that all possible values have the same probability.

The quantity X is dimensionless.

(a) Density function: For this distribution, all values have the same probability, so the density function must

be a flat curve. For the case θ > 2 (there is a similar figure for any other θ),

̂

We apply the sufficient consistency in mean of order two: lim n →∞ MSE ( θ)=0 ↔

{ lim n→∞ b( θ̂ )=0

lim n →∞ Var ( θ̂ )=0

(b1) Bias: By applying a property of the sample mean and the information of the statement,

̄ )= E ( X )=θ− 1

E(X ̄ )= E( X̄ )−θ=θ− 1 −θ=− 1

→ b(X

1

→ lim n →∞ b( X̄ )=lim n→∞ − =−

1

2 2 2 2 2

(It is asymptotically biased.) Since one condition of the pair is not verified, it is not necessary to check the

other, and neither the fulfillment of the consistency in probability nor the opposite can be proved using this

way (though the estimator is not consistent in the mean-square sense).

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter is

checked by comparing the variance and the Cramér-Rao's bound).

(c1) Unbiasedness: In the previous section it has been proved that X is a biased estimator of θ.

The first condition does not hold, and hence it is not necessary to check the second one. The conclusion is that

X is not an efficient estimator of θ.

(d) An unbiased estimator of θ and its consistency

̄ )=− 1 , which suggests correcting the previous estimator by adding 1/2, that is:

In (b) we found that b ( X

2

1

̂ X̄ + . To study its consistency (in probability), we apply the sufficient condition mentioned in section

θ=

2

b (the consistency in mean of order two).

(d1) Bias: By applying a property of the sample mean and the information of the statement,

̂ E( X̄ )+ 1 =θ− 1 + 1 =θ → b( θ)=E

E ( θ)= ̂ ̂ ̂

→ lim n →∞ b( θ)=lim

( θ)−θ=θ−θ=0 n →∞ 0 = 0

2 2 2

(d2) Variance: By applying a property of the sample mean and the information of the statement,

1 Var ( X ) 3 3

^

Var ( θ)=Var ( )

X̄ + =Var ( X̄ )=

2 n

=

4⋅n

→ lim n →∞ Var ( θ̂ )=lim n→∞

4⋅n

=0

As a conclusion, the mean square error (MSE) tends to zero and hence the proposed estimator θ=̂ X̄ + 1 is a

2

consistent—in mean square error and hence in probability—estimator of θ.

Conclusion: We could prove neither the consistency nor the efficiency. Nevertheless, the bias has allowed

us to build an unbiased, consistent estimator of the parameter. The efficiency of this new estimator could be

studied, but it is not required in the statement.

My notes:

Exercise 11pe-p

A population random quantity X is supposed to follow a geometric distribution. Let X = (X1,...,Xn) be a simple

random sample. By applying the factorization theorem below, find a sufficient statistic T(X) = T(X1,...,Xn) for

the parameter. Give explanations.

Discussion: The factorization theorem can be applied both to prove that a given statistic is sufficient and to

find sufficient statistics. On the other hand, for the distribution involved we know that

Likelihood function:

n

L( X ; η)=∏ j =1 f ( X j ; η)= f ( X 1 ; η)⋅ f ( X 2 ; η)⋯ f ( X n ; η)=η⋅(1−η) X −1⋅η⋅(1−η) X −1 ⋯η⋅(1−η) X

1 2 n −1

n

n X 1−1+ X 2−1+⋯+ X n−1 n (∑ )

X j −n

=η ⋅(1−η) =η ⋅(1−η) j=1

Theorem:

We must try allocating each term of the likelihood function:

n

➔ η depends only on the parameter, not on Xj. Then, it would be part of g.

n

(∑ )

X j −n

➔ (1−η) depends on both the parameter and the data Xj, and these two kinds of information

j=1

neither are mixed nor can mathematically be separated. Then, it would be part of g and the only

n

possible sufficient statistic, if the theorem holds, is T =∑ j =1 X j .

n

n −n ∑ j=1 X j

By considering g (T ( X ) ; η)=η ⋅(1−η) (1−η) and h( X )=1 , the theorem holds and hence the

n

statistic T ( X )=∑ j=1 X j is sufficient for studying η. The idea behind this kind of statistics is that they

“summarize the important information (about the parameter)” contained in the sample. In fact, the statistic T

has essentially the same information as any one-to-one transformation of it, particularly the sample mean

n n

T ( X )= ∑ j =1 X j =n X̄ .

n

Conclusion: The factorization theorem has been used to find a sufficient statistic (for the parameter). Since

the total sum appears, we complete the expression to write the result in terms of the sample mean. Both

statistics contain the same information about the parameter of the distribution.

My notes:

For population variables X and Y, simple random samples of size n X and nY are taken. Calculate the mean

square error of the following estimators, possibly by using proper statistics (involving them) whose sampling

distribution is known.

(A) For any populations: X̄ X̄ −Ȳ

2 V 2X 2 s 2X 2 S 2X

(C) For normal populations: V 2 s 2 S 2

VY sY SY

Suppose that the two populations are independent. Study the consistency in mean of order two and then the

consistency in probability.

Discussion: In this exercise, the most important estimators are involved. The basic properties of the

expectation and the variance allows us to calculate the mean square error. In most cases, the estimators will be

completed for a proper quantity (with known sampling distribution) to appear, and then use its properties.

Although the estimators of the third section can be used for any X and Y, the calculations for normally

distributed variables are easier due to the use of additional information—the knowledge about statistics and

their sampling distribution. Thus, the results of this section are based on the normality of the variables X and

Y. (Some of the quantities are also valid for any variables.)

The mean square errors are found for static situations, but the idea of limit involves dynamic

situations. Statistically speaking, we want to study the behaviour of the estimators when the number of data

increases—we can imagine a sequence of schemes where more and more data are added to the samples, that

is, with the sample sizes always increasing. (From the mathematical point of view, limits must be studied for

any possible way in which the sample sizes tend to infinite.)

Fortunately, the limits of the two-variable functions—sequences, really—that appear in this exercise can

easily be solved either by decomposing them into two limits of one-variable functions or by binding the two-

variable sequences. That the limits are studied when nX and nY tend to infinite facilitates the calculations (e.g. a

constant like –2 is negligible when it appears in a factor).

It holds that

1 n 1 n 1

E ( X̄ )= E ( n

∑ j =1

X )

j =

n

∑ j=1

E ( X j )= n E ( X )= E ( X )=μ

n

1 n 1 1 Var ( X ) 1 2

( )

n

Var ( X̄ )=Var

n ∑ j=1

X j =

n 2 ∑ j =1

Var ( X j ) = 2 n Var ( X ) =

n n

= σ

n

2

1 2

MSE( X̄ ) = [ E( X̄ )−μ ] + Var ( X̄ )=0 + σ = σ

2 2

n n

Then,

• The estimator X̄ is unbiased for μ, whatever the sample size.

• The estimator X̄ is consistent (in mean of order two and therefore in probability) for μ, since

2

lim n →∞ MSE ( X̄) = lim n→ ∞ σ =0

n

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

By using the previous results,

E ( X̄ −Ȳ ) = E ( X̄ )−E ( Ȳ )=μ X −μ Y

1 2 1 2

Var ( X̄ −Ȳ ) = Var ( X̄ )+Var ( Ȳ )= σ + σ

n X X nY Y

2 1 2 1 2

MSE( X̄−Ȳ ) = [ E( X̄ −Ȳ )−(μ X −μ Y ) ] + Var ( X̄ −Ȳ )= σ + σ

n X X nY Y

The mean square error of X–Y is the sum of the mean square errors of X and Y. On the other hand,

• The estimator X̄ −Ȳ is unbiased for μX–μY, whatever the sample sizes.

• The estimator X̄ −Ȳ is consistent (in the mean-square sense and hence in probability) for μX–μY, as

σ2X σ 2Y

lim nX

→∞

nY →∞

MSE ( X̄−Ȳ ) = lim nX

→∞

nY →∞

( +

n X nY ) =0

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

Since η^ is a particular case of the sample mean,

E (η

^ ) = μ=η

2

1

Var ( η^ ) = σ = η(1−η)

n n

2 1

^ = [ E( η)−η

MSE( η) ^ ] + Var (η)=

^ η(1−η)

n

Then,

• The estimator η^ is unbiased whatever the sample size.

• It is consistent for η, being sufficient and necessary the sample size tending to infinite.

Again, this is a particular case of difference between sample means,

E (η

^ X −η^ Y ) = μ X −μ Y =η X −ηY

1 2 1 2 1 1

Var ( η

^ X −η

^ Y) = σ + σ = η (1−ηX )+ ηY (1−ηY )

n X X nY Y n X X nY

1 2 1 2 1 1

MSE( η^ X − η

^ Y) = σ X + σY = ηX (1−ηX )+ ηY (1−ηY )

nX nY nX nY

Then,

• The estimator η^ X − η^ Y is unbiased for ηX–ηY, whatever the sample sizes.

• It is also consistent for ηX–ηY, being sufficient and necessary the two sample sizes tending to infinite.

nV 2

By using T = 2

∼ χ2n and the properties of the chi-square distribution,

σ

( ) ( )

2 2 2 2 2

nV nV

E (V 2 ) = E σ =σ E = σ n = σ2

n σ2 n σ

2

n

nV2 2

( ) ( )

2

Var ( V 2 ) = Var σ = σ 4 Var nV = σ 4 2 n = 2 σ4

n σ2 n2 σ2 n2 n

2 2 2 2 2 2 4

MSE ( V ) = [ E (V )−σ ] + Var (V ) = σ

n

Then,

• The estimator V 2 is unbiased for σ2, whatever the sample size.

• The estimator V 2 is consistent (in mean of order two and therefore in probability) for σ2, since

2 2 σ4

lim n →∞ MSE (V ) = lim n →∞ =0

n

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

In another exercise, this estimator is compared with the other two estimators of the variance. (For the

expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not

necessarily normal.)

2

VX

(c2) For the quotient between the variances of the samples 2

VY

V 2X σY2

By using T = 2 2 ∼ F n , n and the properties of the F distribution,

VY σX X Y

E

( ) (

V 2Y ) ( )

=E

σ 2Y V Y2 σ 2X

=

σ2Y

E

V Y2 σ 2X

= 2

nY

=

σ Y nY −2 nY −2 σ2Y

( nY >2)

2

V 2X σ2X V 2X σ 2Y σ2X V 2X σ2Y 2 n 2Y (n X +n Y −2) σ 4X

Var

( ) ( )( ) ( )

V 2Y

= Var

σ 2Y V 2Y σ2X

=

σ2Y

Var

V Y2 σ 2X

=

n X (nY −2)2 ( nY −4) σ 4Y

(nY > 4)

( ) [( ) ] ( )[

2

] ( )

2

V 2X V 2X σ2X V 2X σ 2X

nY σ 2X σ2X 2n 2Y (n X +n Y −2)

MSE = E − + Var = 2 − +

V 2Y V 2Y σ 2Y V Y2 σY nY −2 σ 2Y σ 2Y n X (nY −2)2 (nY −4)

[( ]

2 2 4

nY 2n Y (n X +n Y −2) σ X

=

nY −2

−1 +

)

n X (nY −2)2 (nY −4) σ 4Y

(nY > 4)

Then,

• The estimator is V 2X /V 2Y biased for σX2/σY2, but it is asymptotically unbiased since

V 2X n Y σ2X σ 2X σ 2X

lim n X

nY →∞

→∞ E

( )

V 2Y

=lim n Y

→∞

( nY −2 σ 2Y

=

)

σ2Y

lim n →∞

1−

1

2

nY

=

σ2Y

Y

( )

Mathematically, only nY must tend to infinite. Statistically, since populations can be named and

allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is

sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

• The estimator V 2X /V 2Y is consistent (in mean of order two and therefore in probability) for σ X2/σY2,

since it is asymptotically unbiased and

V 2X σ 4X 2 n2Y (n X +nY −2)

lim n X

nY →∞

→∞ Var

( ) V 2Y

=

σY4

lim n X

→∞

nY →∞ nX (nY −2)2 (nY −4)

=0

=

σ

lim n

4

X

−3 −1

n n 2 n (n X +nY −2)

Y X σ 2

Y

= lim n

4

X

2 ( n1 + n1 − n 2n ) =0

Y X Y X

4 →∞ −3 −1 2 4 →∞ 2

n n n (nY −2) (nY −4) σ

σ

(1− n2 ) (1− n4 )

X X

n Y Y →∞ Y X X n Y Y →∞

Y Y

The numerator tends to zero if and only if so do both sample sizes. In short, it is sufficient and

necessary the two sample sizes tending to infinite—this limit has been studied in the mathematical

appendix.

In another exercise, this estimator is compared with the other two estimators of the quotient of variances.

n s2

By using T = 2

∼ χ 2n−1 and the properties of the chi-square distribution,

σ

n s2 2

( ( ) )

2

E (s 2) = E σ = σ 2 E n s = σ2 (n−1)= n−1 σ 2

n σ2 n σ

2

n n

2 2

2( n−1)

Var (s ) = Var ( σ ) = σ Var ( )

2 4 4

ns ns

= σ 2(n−1) =

2 4

2

σ 2 2 2 2

n σ n σ n n

2

MSE ( s 2 )= [E ( s 2 )−σ 2 ]2 + Var ( s 2 ) =

[ n−1 2

n

σ −σ 2 +

2( n−1) 4

n

2

2 1

]

σ = − 2 σ4

n n ( )

Then,

• The estimator s 2 is biased but asymptotically unbiased (for σ2), since

1

2

lim n →∞ E( s ) = lim n→∞

1

=σ ( n−1

n

σ )=σ lim

2 2

n→ ∞ ( )

1−

n 2

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

• The estimator s 2 is consistent (in mean of order two and therefore in probability) for σ2, since

2

[( ) ]

2 1 4

−

n n2

σ =0

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

In another exercise, this estimator is compared with the other two estimators of the variance. (For the

expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not

necessarily normal.)

2

sX

(c4) For the quotient between the sample variances 2

sY

S 2X σ2Y n X (n Y −1) s 2X σ 2Y

By using T = 2 2

= ∼ Fn −1 , nY −1 and the properties of the F distribution,

SY σ X nY ( n X −1) s 2Y σ 2X X

2 2 2 2

E

( ) sX

s2Y

=

n Y ( n X −1) σ X

n X ( nY −1) σY 2

n (n −1) s X σY

E X Y

(

nY (n X −1) sY2 σ 2X )

nY (n X −1) σ2X nY −1 nY (n X −1) σ 2X

= = ( nY −1> 2)

n X (nY −1) σ2Y ( nY −1)−2 n X (nY −3) σ 2Y

2 2 2 4 2 2

Var

( )

sX

sY2

=

nY (n X −1) σ X

n2X (nY −1)2 σ 4Y

Var

( n X (nY −1) s X σ Y

nY (n X −1) s 2Y σ2X )

nY2 (n X −1)2 σ 4X 2( nY −1)2 ( n X −1+n Y −1−2) 2 n 2Y (n X −1)( n X + nY −4) σ4X

= = (n Y −1>4)

n 2X (nY −1)2 σY4 (n X −1)(nY −1−2)2 ( nY −1−4) n 2X (nY −3)2 ( nY −5) σ4Y

2

( )[( ) ] ( )[

2

MSE

s2X

s 2Y

= E

s2X

s 2Y

−

σ2X

σ 2Y

+ Var 2 =

sY

s2X

n Y ( n X −1) σ 2X σ2X

−

n X (nY −3) σ2Y σ 2Y ] +

2 n2Y ( n X −1)( n X +n Y −4) σ 4X

n2X (nY −3)2 (nY −5) σ4Y

{[ }

2

2 n2Y (n X −1)(n X +nY −4) σ 4X

=

nY (n X −1)

n X (nY −3)

−1 +

]

n 2X (n Y −3)2 ( nY −5) σ4Y

(nY −1>4 )

Then,

• The estimator is s 2X / s 2Y biased for σX2/σY2, but it is asymptotically unbiased since

1

1−

[ ]

2 2 2 2

n X σ 2X

lim n →∞ E

X

n →∞

Y

s

s ( )

= lim n →∞

n →∞

X

2

Y

nY (n X −1) σ

n X (nY −3) σ

=

σ

σ

lim n →∞

X

Y

n X nY −n Y σ

n n −3 n X σ

n →∞ X Y

= lim n →∞

n →∞ 1−

nY

= X

2

3 σ2Y

Y

X

2

Y

X

Y

X

2

Y

X

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

• The estimator is s 2X / s 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as it

is asymptotically unbiased and

lim n X

nY →∞

→∞ Var

( ) s 2X

s2Y

= lim n X

→∞

nY →∞

[ 2 n2Y (n X −1)(n X +nY −4) σ 4X

n2X (nY −3)2 (nY −5) σ4Y ]

σ 4X 2

n−2 −3

X nY 2 nY (n X −1)( n X + nY −4)

= lim n →∞

σ4Y −3 2 2

n

X

Y →∞ n−2

X nY n X (nY −3) (nY −5)

1 1 1 1 4

=

σ 4

X

lim n

2 ( −

nY n X nY )( + −

nY n X n X nY ) =0

4 →∞ 2

3 5

σ

( n )( n )

X

Y

n Y →∞

1− 1−

Y Y

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

In another exercise, this estimator is compared with the other two estimators of the quotient of variances.

( n−1)S 2

By using T = 2

∼ χ 2n−1 and the properties of the chi-square distribution,

σ

( ) ( )

2

σ 2 (n−1) S 2

(n−1) S 2 2

= σ E = σ (n−1) = σ

2 2

E (S ) = E

σ2 2

n−1 n−1 σ n−1

2 2

Var ( S 2) = Var ( n−1 σ2 (n−1)2 σ2 )

σ2 (n−1) S = σ 4 Var ( n−1)S = σ 4 2 (n−1) = 2 σ 4

( n−1)2 n−1 ( )

2 2 2 2 2 2 4

MSE ( S ) = [ E ( S )−σ ] + Var ( S ) = σ

n−1

Then,

• The estimator S 2 is unbiased for σ2, whatever the sample size.

• The estimator S 2 is consistent (in mean of order two and therefore in probability) for σ2, since

2 2 σ4

lim n →∞ MSE ( s ) = lim n →∞ =0

n−1

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

In another exercise, this estimator is compared with the other two estimators of the variance. (For the

expectation, it is easy to find in literature direct calculations that lead to the same value for any variables—not

necessarily normal.)

S 2X

(c6) For the quotient between the sample quasivariances 2

SY

S 2X σ2Y

By using T = 2 2 ∼ F n −1 ,nY −1 and the properties of the F distribution,

SY σ X X

2 2 2 2 2 2

E

( ) ( )

SX

S 2Y

σX

= 2 E 2 2 = 2

σY SY σ X

S X σY

nY −1

=

nY −1 σ X

σ Y (nY −1)−2 nY −3 σ2Y

σX

(n Y −1>2)

2

S 2X σ 2X S 2X σ 2Y σ 4X 2( nY −1)2 ( n X −1+n Y −1−2)

Var

( )( ) ( )

S 2Y

=

σ 2Y

Var

S 2Y σ 2X

=

σY4 ( n X −1)(n Y −1−2) 2 (nY −1−4)

2 4

2 (nY −1) ( n X +nY −4) σ X

= 2 4

( nY −1> 4)

( n X −1)(nY −3) (nY −5) σ Y

2

( )[( ) ] ( )[

2

MSE

S2X

S 2Y

= E

S2X

SY2

−

σ2X

σ 2Y

+ Var 2 =

SY

S2X

n Y −1 σ2X σ 2X

−

nY −3 σ 2Y σ2Y

+

]

2(nY −1)2 (n X +nY −4 ) σ 4X

(n X −1)(nY −3)2 (nY −5) σ 4Y

[( ]

2

nY −1 2 (nY −1)2(n X + nY −4) σ 4X

=

nY −3

−1 +

)

( n X −1)( nY −3)2( nY −5) σ 4Y

(nY −1> 4)

Then,

• The estimator is S 2X /S 2Y biased for σX2/σY2, but it is asymptotically unbiased since

1

2 2 2 1−

n Y σ 2X

n →∞

S

lim n →∞ E X2 =lim n →∞ Y

SY n →∞

n −1 σ X σ X

nY −3 σ2Y

=

σ 2Y

X

Y

lim n →∞

( )

1−

=

3 σ2Y

nY

X

Y

( ) Y

Mathematically, only nY must tend to infinite. Statistically, since populations can be named and

allocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it is

sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

• The estimator is S 2X /S 2Y consistent (in mean of order two and therefore in probability) for σ X2/σY2, as

it is asymptotically unbiased and

[ ]

2 2 4

lim n X

→∞

nY →∞

Var

( )

SX

SY

2

= lim n X

→∞

nY →∞

2(nY −1) (nX +nY −4) σ X

2

(nX −1)(n Y −3) (nY −5) σY

4

2

1 1 1 4

=

σ 4

X

lim n

−1 −3

n n 2(nY −1) (n X +nY −4)

X Y

2

=

σ 4

X

lim n

2 1−

( )(

nY ) =0

+ −

n Y nX n X nY

4 X →∞ X →∞

n n (n −1)(n Y −3)2 (nY −5)

−1 −3 4

1 3 5 2

σ Y nY →∞ X Y X σ Y nY → ∞

( n )( n ) ( n )

1−

X

1− 1−

Y Y

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

In another exercise, this estimator is compared with the other two estimators of the quotient of variances.

Conclusion: For the most important estimators, the mean square error has been calculated either directly (in

few cases) or by making a proper statistic appear. The consistencies in mean square error of order two and in

probability have been proved. Some limits for functions of two variables arised. These kinds of limit are not

trivial in general, as there is an infinite amount of ways for the sizes to tend to infinite. Nevertheless, those

appearing here could be calculated directly of after doing some simple algebra transformation (multiplying

and dividing by the proper quantity, as they were limits of sequences of the indetermined form infinite-over-

-infinite).

On the other hand, it is worth noticing that there are in general several matters to be considered in

selecting among different estimators of the same quantity:

(a) The error can be measured by using a quantity different to the mean square error.

(b) For large sample sizes, the differences provided by the formulas above may be negligible.

(c) The computational or manual effort in calculating the quantities must also be taken into account—not

all of them requires the same number of operations.

(d) We may have some quantities already available.

My notes:

In the following situations, compare the mean square error of the following estimators when simple random

samples, taken from normal populations, are considered:

(A) V 2 s

2

S

2

V 2X s 2X S 2X

(B) (Consider only the case nX = n = nY)

V 2Y s 2Y S Y2

In the second section, suppose that the populations are independent.

Discussion: The expressions of the mean square error of these estimators have been calculated in other

exercise. Comparing the coefficients is easy in some cases, but sequences may sometimes cross one another

and the comparisons must be done analitically—by solving equalities and inequalities—or graphically. We

plot the sequences (lines between dots are used to facilitate the identification).

The mean square errors were found for static situations, but the idea of limit involves dynamic

situations. By using a computer, it is also possible to study—either analytically or graphically—the asymptotic

behaviour of the estimators (but it is not a “whole mathematical proof”). It is worth noticing that the formulas

and results of this exercise are valid for normal populations (because of the theoretical results on which they

are based); in the general case, the expressions for the mean square error of these estimators are more

complex. For two populations, there is an infinite amount of mathematical ways for the two sample sizes to

tend to infinite (see the figure); the case nX = n = nY, in the last figure, will be considered.

(A) For V 2 , s 2 and S 2

The expressions of their mean square error are:

2 4 2 1 2

4

MSE ( V ) =

n

σ

2

MSE ( s 2 ) = − 2 σ 4

n n

2

MSE ( S ) = ( n−1

σ

4

)

Since σ appears in all these positive quantities, by looking at the coefficients it is easy to see that, for n is

larger than two,

2 2 2

MSE ( s ) < MSE (V ) < MSE ( S )

That is, sequences—indexed by n—do not cross one another. We can plot the coefficients (they are also the

mean square errors when σ=1).

# Grid of values for 'n'

n = seq(from=2,to=10,by=1)

# The three sequences of coefficients

coeff1 = 2/n

coeff2 = 2/n - 1/(n^2)

coeff3 = 2/(n-1)

# The plot

allValues = c(coeff1, coeff2, coeff3)

yLim = c(min(allValues), max(allValues));

x11(); par(mfcol=c(1,4))

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')

plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')

plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')

points(n, coeff2, type='b')

points(n, coeff3, type='b')

2 1 2 2

Asymptotically, the three estimators behave similarly, since − 2≈ ≈ .

n n n n−1

V 2X s 2X S 2X

(B) For 2 , 2 and 2

V Y sY SY

The expressions of their mean square error, when nX = n = nY, are:

V 2X 4 4

( ) {[ } {[ ] }

2

2 n2 (n+ n−2) σ X 2

MSE

V 2Y

=

n

n−2

−1 + ]

n(n−2)2 (n−4) σ 4Y

=

n

n−2

−1 +

4 n(n−1) σ X

(n−2)2 (n−4) σ 4Y

(n>4)

( ) {[ } {[ ]

2

s2

}

4 2 4

2 n2 (n−1)( n+n−4) σ X

MSE X2 =

sY

n( n−1)

n(n−3)

−1 +

]

n2 (n−3)2 ( n−5) σ 4Y

=

n−1

n−3

−1 +

4 (n−1)(n−2) σ X

(n−3)2 ( n−5) σ4Y

(n−1>4 )

S2X 4 4

( ) {[ } {[ ] }

2

2 (n−1)2 (n+n−4 ) σ X 2

MSE

S 2Y

=

n−1

n−3

−1 + ]

( n−1)(n−3)2( n−5) σY4

=

n−1

n−3

−1 +

4(n−1)(n−2) σ X

(n−3)2 (n−5) σ 4Y

(n−1>4)

For equal sample sizes, the mean square error of the last two estimators is the same (but they may behave

differently under other criteria different to the mean square error, e.g. even their expectation). We can plot the

coefficients (they are also the mean square errors when σX = σY), for n > 5.

# Grid of values for 'n'

n = seq(from=6,to=15,by=1)

# The three sequences of coefficients

coeff1 = ((n/(n-2))-1)^2 + (4*n*(n-1))/(((n-2)^2)*(n-4))

coeff2 = (((n-1)/(n-3))-1)^2 + (4*(n-1)*(n-2))/(((n-3)^2)*(n-5))

coeff3 = coeff2

# The plot

allValues = c(coeff1, coeff2, coeff3)

yLim = c(min(allValues), max(allValues));

x11(); par(mfcol=c(1,4))

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')

plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')

plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')

points(n, coeff2, type='b')

points(n, coeff3, type='b')

This shows that, for normal populations and samples of sizes nX = n = nY, it seems that

V 2X s 2X S 2X

MSE

( )

V 2Y

?

≤ MSE

( ) s 2Y

= MSE

( )

S2Y

and the sequences do not cross one another. Really, a figure is not a mathematical proof, so we do the

following calculations:

2 2

n 4 n (n−1) ? n−1 4(n−1)(n−2)

n−2 (

−1 + 2 ) ≤

(n−2) ( n−4) n−3

−1 + 2 (

(n−3) ( n−5) )

4(n−4)+4 n(n−1) ? 4(n−5)+4 (n−1)(n−2) n−4+ n 2−n ? n2−2 n−3

≤ ↔ ≤

( n−2)2 (n−4) (n−3)2 (n−5) 2 2

(n−2) ( n−4) (n−3) (n−5)

(n−2)( n+2) ? (n−3)(n+1) ?

≤ ↔ (n+ 2)(n−3)(n−5)≤(n+1)(n−2)(n−4)

(n−2)2 (n−4) (n−3)2 (n−5)

? ?

n 3−6 n 2−n+ 30≤ n3−5 n 2+ 2 n+8 ↔ 22≤n (n+3)

This inequality is true for n≥4, since it is true for n=4 and the second side increases with n. Thus, we can

guarantee that, for n > 5,

V 2X s 2X S 2X

MSE 2 ≤ MSE 2 = MSE 2

VY ( )

sY SY ( ) ( )

Asymptotically, by using infinites

V 2X

{[( ] }

2

2nY2 ( nX +nY −2) σ 4X

X

n →∞

Y

VY ( )

lim n →∞ MSE 2 = lim n →∞

n →∞

X

Y

nY

nY −2

−1 +

)

n X (nY −2)2 (nY −4) σ 4Y

{[( ] } [ ]

2

nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X

=lim n X →∞

nY → ∞

nY )

−1 + Y X2 Y

nX nY nY σ 4Y

=lim n →∞

n →∞

n X n Y σ 4

Y

X

Y

=0

( ) {[ }

2

s 2X 2 n2Y (n X −1)(n X +nY −4) σ 4X

lim nX →∞

nY →∞

MSE

s 2Y

=

nY (n X −1)

n X (nY −3)

−1 +

]

n2X ( nY −3)2( nY −5) σY4

(nY −1>4 )

{[ } [ ]

2

2n 2Y n X (n X + nY ) σ 4X 2(n X +nY ) σ 4X

=lim nX →∞

nY → ∞

nY n X

nX nY

−1 +

] 2 2

n X nY nY σY

4

=lim n

n

X

Y

→∞

→∞

n X n Y σ4Y

=0

{[ }

2

lim n →∞

X

nY →∞

MSE

S 2X

( )

SY

2

= lim n X →∞

nY →∞

nY −1

nY −3

−1 +

]

2(nY −1)2(n X +n Y −4) σ 4X

2

(n X −1)(nY −3) (nY −5) σY

4

{[( ] } [ ]

2

nY 2 n2 (n +n ) σ4X 2( n X +n Y ) σ4X

=lim n X →∞

nY → ∞

nY )

−1 + Y X2 Y

nX nY nY σ 4Y

=lim n →∞

n →∞

n X n Y σ 4

Y

=0

X

The three estimators behave similarly, since the quantitative behaviour of their mean square errors is

characterized by the same limit, namely:

[ ]

4

2(n X +nY ) σ X

lim n →∞ =0 .

n →∞

n X n Y σ4Y X

(It is worth noticing that this asymptotic behaviour arises when the limits are solved by using infinites—this

cannot seen when the limits are solved by using other ways.)

Conclusion: The expression of the mean square error of these estimators allow us to compare then, to study

their consistency and even their rate of convergence. We have proved the following result:

Proposition

(1) For a normal population,

MSE ( s 2 ) < MSE (V 2 ) < MSE (S 2 )

(2) For two independent normal populations, when nX = n = nY

V 2X s 2X S 2X

MSE

( ) V 2Y

≤ MSE

( )

s 2Y

= MSE

( ) S2Y

Note: For one population, V 2 has higher error than s 2 , even if the information about the value of the

population mean μ is used by the former while it is estimated in the other two estimators. For two populations,

the information about the value of the two population means μX and μY is used in the first quotient while they

must be estimated in the other two estimators. Either way, the population mean in itself does not play an

important role in studying the variance, which is based on relative distances, but any estimation using the

same data reduces the amount of information available and the degrees of freedom in a unit.

Again, it is worth noticing that there are in general several matters to be considered in selecting among

different estimators of the same quantity:

(a) The error can be measured by using a quantity different to the mean square error.

(b) For large sample sizes, the differences provided by the formulas above may be negligible.

(c) The computational or manual effort in calculating the quantities must also be taken into account—not

all of them requires the same number of operations.

(d) We may have some quantities already available.

My notes:

Exercise 14pe-p (*)

For population variables X and Y, simple random samples of size n X and nY are taken. Calculate the mean

square error of the following estimators (use results of previous exercises).

1

(A) For two independent Bernoulli populations: ( η^ + η

^ ) η^ p

2 X Y

(B) For two independent normal populations:

1 2 2 1 2 2 1 2 2 2 2 2

(V +V Y ) (s + s ) (S + S ) Vp sp Sp

2 X 2 X Y 2 X Y

where

nX η

̂ X + nY η̂ Y 2 n X V 2X +nY V 2Y 2

2

n X s X + nY s Y

2

2

2 2

(n X −1) S X +( nY −1)S Y

η̂ p= V p= s =

p S =

p

n X + nY n X +nY n X + nY n X +nY −2

(Similarly for Y.) Try to compare the mean square errors. Study the consistency in mean of order two and then

the consistency in probability.

Discussion: The expressions of the mean square error of the basic estimators involved in this exercise has

been calculated in another exercise, and they will be used in calculating the mean square errors of the new

estimators. The errors are calculated for static situations, but limits are studied in dynamic situations

Comparing the coefficients is easy in some cases, but sequences can sometimes cross one another and the

comparisons must be done analitically—by solving equalities and inequalities—or graphically. By using a

computer, it is also possible to study—either analytically or graphically—the behaviour of the estimators. The

results obtained here are valid for two independent Bernoulli populations and two independent normal

populations, respectively. On the other hand, we must find the expression of the error for the new estimators

based on semisums:

2

1

( 1

) [( 1

)

MSE ( θ^1 + θ^2 ) = E ( θ^1 + θ^2) −θ +Var ( θ^1 + θ^2 )

2 2 2 ] ( )

and, for unbiased estimators,

1 1

( )

MSE ( θ^1 + θ^2 ) = 0+ [Var ( θ^1)+Var ( θ^2 )]

2 4

1

(A) For Bernoulli populations: ( η^ + η

^ ) and η^ p

2 X Y

1

(a1) For the semisum of the sample proportions ( η^ + η

^ )

2 X Y

By using previous results and that μ=η and σ2=η(1–η),

E ( 12 ( η^ + η^ )) = 12 [ E( η^ )+ E( η^ )]= 12 (η +η )=η

X Y X Y X Y

( 12 ( η^ + η^ )) == 12 [ Var ( η^ )+Var ( η^ )]= 14 ( η (1−η

Var X Y X Y

n

+

n )= 4 ( n + n ) η(1−η)

) η (1−η ) 1 1 1

X

X

X Y

Y

Y

X Y

2

MSE ( ( η^ + η^ )) = [ (η +η )−μ ] +

4( )= 4 ( n + n ) η(1−η)

1 1 1 η (1−η ) η (1−η ) 1 1 1

X X Y Y

X Y X Y +

2 2 n n X Y X Y

Then,

1

• The estimator ^ ) is unbiased for μ, whatever the sample sizes.

( η^ + η

2 X Y

1

• The estimator ^ ) is consistent (in the mean-square sense and therefore in probability) for η.

( η^ + η

2 X Y

n →∞

1

2 Y

X (

lim n →∞ MSE ( η^ X + η^ Y ) = lim n → ∞

n →∞

1 1 1

+

4 n X nY )

η(1−η) X

Y

[( ) ]

It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical

appendix.

1

Firstly, we write η^ p= ^ ). Now, by using previous results,

(n η^ +n η

n X + nY X X Y Y

1 n η +n η

E (η

^ p) = [nX E (η ^ Y )]= X X Y Y =η

^ X )+n Y E ( η

n X +n Y n X +n Y

1 n η (1−ηX )+ nY ηY (1−ηY ) 1

Var ( η

^ p) = 2

[n2X Var ( η^ X )+ nY2 Var ( η

^ Y )]= X X 2

= η(1−η)

(n X + nY ) (n X +nY ) n X + nY

2

n X ηX +nY ηY n η (1−ηX )+nY ηY (1−ηY )

MSE( η^ p ) =

( n X + nY )

−η + X X

(n X +nY )2

=

1

n X + nY

η(1−η)

Then,

• The estimator η^ p is unbiased for η, whatever the sample sizes.

• The estimator η^ p is consistent (in mean of order two and therefore in probability) for η, since

η(1−η)

lim n →∞ MSE ( η^ p ) = lim n →∞ =0

n →∞

X

n →∞

n X + nY X

Y Y

If the mean square error is compared with those of the two populations, we can see that the new

denominator is the sum of both sample sizes. Again, it is worth noticing that it is sufficient and

necessary at least one sample size tending to infinite, but not both. In this case, the denominator tends

to infinite. The interpretation of this fact is that, in estimating, one sample can do “the whole work.”

1

(a3) Comparison of ^ ) and η^ p

( η^ + η

2 X Y

Case nX = n = nY

MSE ( 12 ( η^ + η^ )) = η(1−η)

X Y

2n

= MSE ( η

^ ) p

1

In fact, by looking at the expressions of the estimators themselves, η^ p= ( η

^ + η^ ) in this case.

2 X Y

General case

The expressions of their mean square error are (the sample proportion is unbiased):

1 1 1 1 1

MSE ( 2

( η^ X + η) (

^ Y) = +

4 n X nY )

η(1−η) MSE( η^ p ) =

n X + nY

η(1−η)

Then

n +n

1 1 1

( + ≤

1

)

4 n X nY n X + nY ( )

↔ (n X +nY ) X Y ≤4 ↔ n 2X + n2Y + 2 n X n Y ≤4 n X nY ↔ (n X −nY )2≤0

n X nY

Then, the pooled estimator is always better or equal than the semisum of the sample proportions. Both

estimators have the same mean square error—their behaviour may be different under other criteria different to

the mean square error—only when nX=nY. Besides, Thus, (nX–nY)2 can be seen as a measure of the convenience

of using the pooled sample proportion, since it shows how different the two errors are. The inequality also

shows a symmetric situation, in the sense that it does not matter which sample size is bigger: the measure

depends on the difference. We have proved the following result:

Proposition

For two independent Bernoulli populations with the same parameter, the pooled sample proportion

has smaller or equal mean square error than the semisum of the sample proportions. Besides, both

are equivalent only when the sample sizes are equal.

We can plot the coefficients (they are also the mean square errors when η(1– η)=1) for a sequence of sample

sizes, indexed by k, such that nY(k)=2nX(k), for example (but this only one possible way for the sample sizes to

tend to infinite):

# Grid of values for 'n'

c = 2

n = seq(from=2,to=10,by=1)

# The sequences of coefficients

coeff1 = (1 + 1/c)/(4*n)

coeff2 = 1/((1+c)*n)

# The plot

allValues = c(coeff1, coeff2)

yLim = c(min(allValues), max(allValues));

x11(); par(mfcol=c(1,3))

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')

plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')

points(n, coeff2, type='b')

The reader can repeat this figure by using values closer to and farther from 1 than c(k) = 2.

1 2 2

(b1) For the semisum of the variance of the samples (V +V Y )

2 X

By using previous results,

E ( 12 (V +V )) = 12 [ E (V )+ E (V ) ]= 12 (σ +σ )=σ

2

X

2

Y

2

X

2

Y

2

X

2

Y

2

4 4

1 12 2 1 σ σ 1 1 1

2 2 X Y 4

X Y X Y

2 2 2

2 n n 2 n n X Y X Y

4 4

] ( ) (

2

MSE ( 1 2

2

2

) [

1 2 2 2

(V X +V Y ) = ( σ X +σ Y )−σ +

2

1 σX σY 1 1 1 4

+

2 nX nY

= +

2 n X nY

σ )

Then,

1 2 2

• The estimator (V +V Y ) is unbiased for σ2, whatever the sample sizes.

2 X

1 2 2

• The estimator (V X +V Y ) is consistent (in the mean-square sense and therefore in probability) for σ2

2

since,

1 1 1 4

lim n →∞ MSE ( η^ p ) = lim n →∞

n →∞ n →∞

+

2 nX nYX

Y

σ =0 X

Y

( )

It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematical

appendix.

1 2 2

(b2) For the semisum of the sample variances (s + s )

2 X Y

By using previous results,

1 n X −1 2 nY −1 2 1 n X −1 nY −1 2

E ( 1 2 2

2

1

)

2

[

( s X + s Y ) = E ( s 2X ) + E ( s 2Y ) =

2 nX

σX +

nY

]

σY =

2 nX (+

nY

σ

) ( )

Var ( 1 2 2

2 2

1

) [

(s X + sY ) = 2 Var ( s 2X ) +Var ( s 2Y ) =

2 n2X

σ X + 2 σY =

nY

]

2 n2X [

1 n X −1 4 nY −1 4 1 (n X −1) (nY −1) 4

+

n2Y

σ

] [ ]

2

1

(

MSE (s2X + s 2Y ) =

2 2 nX

σX+ ) [(

1 n X −1 2 n Y −1 2

nY

σY −σ 2 +

1 n X −1 4 n Y −1 4

2 n2X

σ X + 2 σY

nY ) ] [ ]

2

=

[(

1 n X −1 n Y −1 2

2 nX

+

nY

σ −σ 2 +

1 n X −1 n Y −1 4

2 n 2X

+ 2 σ

nY ) ] [ ]

{[ [ ]} [ ]

2

n X nY2 −nY2 + n2X nY −n2X (n X + nY )2 n X n2Y −n 2Y + n2X nY −n2X 4

= −

1 1 1

+

2 n X nY ( )] +2

4 n 2X n2Y

σ=

4

4 n 2X nY2

+2

4 n 2X n 2Y

σ

[ ] [ ]

2 2 2 2 2

2 n X n Y + 2 n X nY + 2 n X nY −n X −n Y 4 2 n X nY (n X + nY )−(n X −nY )

= 2 2

σ= 2 2

σ4

4n n Y X 4n n X Y

[ ] [ ]

2 2

1 n X + nY ( n X −n Y ) 4 1 1 1 ( n X −n Y )

= − σ= + − σ4

2 n X nY 2 2

2 n X nY 2 n X n Y

2 2

2 n X nY

Then,

1 2 2

• The estimator ( s + s ) is biased but asymptotically unbiased for σ2, since

2 X Y

1 n X −1 nY −1

X

Y

n →∞

1 2 2

2( 2

lim n →∞ E ( s X +s Y ) = σ lim n → ∞

n →∞

)

2 nX

+

nY

21

=σ (1+1 )=σ

2

2

X

Y

( )

57 Solved Exercises and Problems of Statistical Inference

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

1 2 2

• The estimator (s + s ) is consistent (in the mean-square sense and therefore in probability) for σ2,

2 X Y

because it is asymptotically unbiased and

1 n X −1 nY −1

n →∞

1

lim n →∞ Var (s 2X + s2Y ) = σ4 lim n → ∞

2 X

n →∞

Y

2 n 2

X nY (

+ 2 =0 ) X

Y

( )

Again, it is sufficient and necessary the two sample sizes tending to infinite—see the mathematical

appendix.

1 2 2

(b3) For the semisum of the sample quasivariances (S + S )

2 X Y

By using previous results,

E ( 12 (S + S )) = 12 [ E ( S )+ E ( S ) ]= 12 (σ + σ )=σ

2

X

2

Y

2

X

2

Y

2

X

2

Y

2

4 4

1 1 1 σ σ 1 1 1

2 n −1 n −1 2 n −1 n −1 )

= (

2 2 2 2 X Y 4

X Y + X + σ Y

2 2 2

X Y X Y

4

σ4

] ( ) (

2

MSE ( 1 2 2

2

1 2

2 ) [ 2 2

(S X + S Y ) = ( σ X + σY )−σ +

1 σX

+ Y =

1 1

+

1

2 n X −1 n Y −1 2 nX −1 nY −1

σ

4

)

Then,

1 2 2

• The estimator (S + S ) is unbiased for σ2, whatever the sample sizes.

2 X Y

1 2 2

• The estimator (S X + S Y ) is consistent (in the mean-square sense and therefore in probability) for σ2

2

since,

1 1 1 1

lim n →∞ MSE (S2X +S 2Y ) = lim n → ∞

n →∞

2 n →∞

+

2 n X −1 nY −1

X

Y

σ 4=0 ( ) X

Y

( )

It is sufficient and necessary both sample sizes tending to infinite—see the mathematical appendix.

2 2

2 n X V X +nY V Y 1

We can write V p= = ( n V 2 + n V 2 ) . By using previous results,

n X +nY n X +n Y X X Y Y

2 2 2 2

2 n X E(V X )+n Y E(V Y ) n X σ X +n Y σ Y 2

E (V p ) = = =σ

n X + nY n X + nY

2 2 2 2 4 4

n Var (V X )+ nY Var (V Y ) n σ +n σ 2

Var (V )= X 2

p 2

=2 X X Y 2 Y = σ4

( n X +n Y ) (n X + nY ) n X + nY

2

n X σ 2X +nY σ2Y n X σ4X + nY σ 4Y

MSE( V ) =

2

p

n X +n y ( 2

−σ + 2

(n X +n y )2

=

2

n X +n y

σ

4

)

Then,

• The estimator V 2p is unbiased for σ2, whatever the sample sizes.

• The estimator V 2p is consistent (in mean of order two and therefore in probability) for σ2, since

2 4 2

lim n →∞ MSE (V p) = σ lim n →∞ =0

X

nY →∞

X

nY → ∞

n X +nY

It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, but

not both. In this case, the denominator tends to infinite. The interpretation of this fact is that, in

estimating, one sample can do “the whole work.”

n X s 2X + nY s 2Y

2 1

We can write s = p = (n X s 2X + nY s 2Y ). By using previous results,

n X + nY n X + nY

2 2 2 2

2 n X E( s X )+ nY E ( sY ) (n X −1) σ X +(nY −1)σY n X + nY −2 2

E ( s p )= = = σ

n X + nY n X + nY n X + nY

2 2 2 2 4 4

2 n X Var (s X )+nY Var ( s Y ) (n X −1)σ X +( nY −1) σY n X + nY −2 4

Var ( s ) =

p 2

=2 2

=2 σ

(n X + nY ) ( n X + nY ) (n X + nY )2

[ ]

2 2

n + n −2 2 n +n −2 n +n −2

MSE(s ) = X Y 2

p

n X +n Y (

σ −σ 2 + 2 X Y 2 σ4 = X Y

(n X +nY ) )

(n +n −2−n X −nY )

(n X + nY )2

+ 2 X Y 2 σ 4=

(n X +nY ) n X

2

+ n Y

σ4

Then,

• The estimator s 2p is biased for σ2, but asymptotically unbiased

n +n −2 2 n +n

lim n →∞ X Y

n →∞

n X + nY n →∞

X

Y

(

σ = lim n →∞ X Y σ 2 = σ2

n X +nY ) X

Y

( )

(The calculation above for the mean suggests that a –2 in the denominator of the definition would

provide an unbiased estimator—see the estimator in the following section.)

• The estimatoris s 2p consistent (in mean of order two and therefore in probability) for σ2, since

2 4 2

lim n →∞ MSE (s p ) = σ lim n →∞ =0

n →∞

X

n

n →∞ X

+nY X

Y Y

It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, but

not both. In this case, the denominator tends to infinite. The interpretation of this fact is that, in

estimating, one sample can do “the whole work.”

2 2

(n X −1) S X +(nY −1) S Y 1

2

We can write S p =

n X +nY −2

= [ (n −1)S 2X +(n Y −1) S 2Y ] . By using previous results,

n X + nY −2 X

2 2 2 2

( n X −1) E (S X )+( nY −1) E ( S Y ) ( n X −1) σ X +( nY −1)σY

E ( S 2p) = = =σ 2

n X + nY −2 n X +nY −2

Var ( S ) =

p 2

=2 2

= σ4

(n X + nY −2) (n X + nY −2) n X +n Y −2

2

[ ]

2 2 4 4

(n X −1)σ X +(nY −1) σY

2 (n −1) σ X +(nY −1)σ Y 2

MSE( S ) = p −σ2 + 2 X 2

= σ4

n X +n Y −2 (n X + nY −2) n X +n Y −2

Then,

• The estimator S 2p is unbiased for σ2, whatever the sample sizes.

• The estimator S 2p is consistent (in mean of order two and therefore in probability) for σ2, since

4

2 2σ

lim n →∞ MSE (S p ) = lim n X →∞

=0

X

nY →∞ nY →∞

nX +nY −2

It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, but

not both. In this case, the denominator tends to infinite. The interpretation of this fact is that, in

estimating, one sample can do “the whole work.”

1 2 2 1 2 2 1 2 2

(b7) Comparison of (V X +V Y ) , (s + s ) , (S X + S Y ) , V 2p , s 2p and S 2p

2 2 X Y 2

Case nX = n = nY

MSE ( 12 (V + V )) = 12 ( 2 1n ) σ = 1n σ

2

X

2

Y

4 4

1 1 1 1

MSE ( ( s + s ) ) = (2 −0 ) σ = σ

2 2 4 4

X Y

2 2 n n

1 1 1 1

MSE ( (S + S ) ) = (2

2 n−1 )

2 2 4 4

X Y σ= σ

2 n−1

2 2 4 1 4

MSE( V p) = σ = σ

2n n

2 2 4 1 4

MSE(s p) = σ = σ

2n n

2 2 4 1 4

MSE(S p) = σ= σ

2 n−2 n−1

Since σ4 appears in all these positive quantities, by looking at the coefficients it is easy to see the relation

MSE ( 12 ( s + s )) = MSE ( 12 (V

2

X

2

Y

2

X

2

) 2

+V Y ) = MSE (V p) = MSE ( s p ) < MSE (S p) = MSE

2 2

( 12 (S 2

X

2

+ SY ) )

(For individual estimators, the order MSE ( s 2 ) < MSE (V 2 ) < MSE ( S 2 ) was obtained in other exercise.) This

relation has been obtained for the case nX = n = nY and (independent) normal populations. We can plot the

coefficients (they are also the mean square errors when σ=1).

# Grid of values for 'n'

n = seq(from=10,to=20,by=1)

# The three sequences of coefficients

coeff1 = 1/n

coeff2 = coeff1

coeff3 = 1/(n-1)

coeff4 = coeff1

coeff5 = coeff1

coeff6 = coeff3

# The plot

allValues = c(coeff1, coeff2, coeff3, coeff4, coeff5, coeff6)

yLim = c(min(allValues), max(allValues));

x11(); par(mfcol=c(1,7))

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')

plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='l')

plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')

plot(n, coeff4, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 4', type='l')

plot(n, coeff5, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 5', type='l')

plot(n, coeff6, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 6', type='b')

plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')

points(n, coeff2, type='l')

points(n, coeff3, type='b')

points(n, coeff4, type='l')

points(n, coeff5, type='l')

points(n, coeff6, type='b')

By using this code, it is also possible to study—either analytically or graphically—the asymptotic behaviour

of these estimators (but only with simulated data of some particular distributions for X, what would not be a

“whole mathematical proof”). It is worth noticing that the formulas obtained in this exercise are valid for

normal populations (because of the theoretical results on which they are based). In the general case, the

expressions for the mean square error of these estimators are more complex.

General case

The expressions of their mean square error are:

MSE ( 12 ( V + V )) = 12 ( n1 + n1 ) σ

2

X

2

Y

X Y

4

2 [n 2n n ]

2

1 1 1 1 (n −n )

MSE ( ( s + s )) =

2

X

2

Y + − σ X Y 4

2 n X Y

2

X

2

Y

1 1 1 1

MSE ( (S + S ) ) = (

2 n −1 n −1 )

2 2 4

X Y + σ

2 X Y

2 2 4

MSE(V p) = σ

n X +n y

2 2 4

MSE( s p) = σ

n X +n y

2 2 4

MSE(S p) = σ

n X +nY −2

We have simplified the expressions as much as possible, and now a general comparison can be tacked by

doing some pairwise comparisons. Firstly, by looking at the coefficients

MSE ( 12 ( s + s )) ≤ MSE ( 12 (V

2

X

2

Y

2

X

2

)

+V Y ) < MSE ( 12 (S 2

X

2

+ SY ) )

and the equality is reached only when nX = n = nY. On the other hand,

MSE (V 2p ) = MSE ( s 2p ) < MSE (S 2p)

Now, we would like to allocate V 2p , s 2p and S 2p in the first chain. To compare V 2p and s 2p with

1 2 2

(V +V Y ) ,

2 X

2 1 1 1

≤ +

n X +nY 2 n X nY ( ) ↔ 4 n n ≤(n +n ) X Y X Y

2

↔ 4 n X nY ≤ n 2X + n2Y +2 n X n Y ↔ 0≤(n X −n Y )2

That is,

2

MSE ( V p ) = MSE (s p ) ≤ MSE

2

( 12 (V 2

X

2

+V Y ) )

1 2 2

and the equality is attained only when nX = n = nY. To compare S 2p with (V +V Y )

2 X

2 1 1 1

≤ +

n X +nY −2 2 n X nY ( ) ↔ 4 n n ≤(n +n )(n + n −2) ↔ 2(n +n )≤(n −n )

X Y X Y X Y X Y X Y

2

That is,

{ ( 12 (V

MSE (S 2p ) ≤ MSE )

2

X +V 2Y ) if 2(n X + nY )≤( n X −n Y )2

1

MSE (S ) ≥ MSE ( (V +V ) )

2 2 2

p X Y if 2(n X + nY )≥( n X −n Y )2

2

Intuitively, in the region around the bisector line the difference of the sample means is small, and therefore the

pooled sample variance is worse; on the other hand, in the complementary region the square of the difference

is bigger than twice the sum of the sizes, and, therefore, the pooled sample variance is better. The frontier

seems to be parabolic. Some work can be done to find the frontier determined by the equality and the two

regions on both sided—this is done in the mathematical appendix. Now, we write some “force-based” lines for

the computer to plot these points in the frontier:

N = 100

vectorNx = vector(mode="numeric", length=0)

vectorNy = vector(mode="numeric", length=0)

for (nx in 1:N)

{

for (ny in 1:N)

{

if (2*(nx+ny)==(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) }

}

}

plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Frontier of the region'), type='p')

1 2 2

To compare S 2p with (S X + S Y )

2

2 1 1 1

≤ + (

n X +nY −2 2 n X −1 nY −1 ) ↔ 4 (n −1)(n −1)≤(n +n −2)

X Y X Y

2

That is,

2

MSE ( S p ) ≤ MSE ( 12 ( S + S ))

2

X

2

Y

and the equality is attained only if the sample sizes are the same.

We can summarize all the results of this section in the following statement:

Proposition

For two independent normal populations, when nX = n = nY

2

X

2

Y

2

X

2

) 2 2

+ V Y ) = MSE (V p )=MSE (s p) < MSE (S p)=MSE

2

( 12 (S 2

X

2

+ SY ) )

In the general case, when the sample sizes can be different,

2

X

2

Y

2

X

2

)

+V Y ) < MSE ( 12 (S 2

X

2

+ SY ) )

(c) MSE (V 2p ) = MSE (s 2p ) < MSE (S 2p)

2 2

( 12 (V +V ))

(d) MSE ( V p ) = MSE (s p ) ≤ MSE

2

X

2

Y

{

1

MSE ( S ) ≤ MSE ( (V +V ) ) if 2(n + n )≤(n −n )

2 2 2 2

p X Y X Y X Y

(e) 2

1

MSE ( S ) ≥ MSE ( (V +V ) ) if 2(n + n )≥(n −n )

2 2 2 2

p X Y X Y X Y

2

1

(f) MSE ( S ) ≤ MSE ( (S + S ))

2 2 2

p X Y

2

1 2 2

Note: I have tried to compare V 2p , s 2p and S 2p with

(s + s ) , but I have not managed to solve

2 X Y

the inequalities. On the other hand, these relations show that, for two independent normal populations,

there exist estimators with smaller mean square error than the pooled sample variance S 2p . Nevertheless,

there are other criteria different to the mean square error, and, additionally, the pooled sample variance has

also some advantages (see the advanced theory at the end).

Conclusion: For some pooled estimators, the mean square errors have been calculated either directly or

making a proper statistic appear. The consistencies in mean square error of order two and in probability have

been proved. By using theoretical expressions for the mean square error, the behaviour of the pooled

estimators for the proportion (Bernoulli populations) and for the variance (normal populations) have been

compared with “natural” estimators consisting in the semisum of the individual estimators for each

population.

Once more, it is worth noticing that there are in general several matters to be considered in selecting

among different estimators of the same quantity:

(a) The error can be measured by using a quantity different to the mean square error.

(b) For large sample sizes, the differences provided by the formulas above may be negligible.

(c) The computational or manual effort in calculating the quantities must also be taken into account—not

all of them requires the same number of operations.

(d) We may have some quantities already available.

Advanced Theory: The previous estimators can be written as a sum ω X θ^ X +ω Y θ^ Y with weights

ω=(ω X ,ω Y ) such that ω X + ωY =1. As regards the interpretation of the weights, they can be seen as a

measure of the importance that each estimator is given in the global formula. For some weights that depends

on the sample sizes, it is possible for one estimator to adquire all the importance when the sample sizes

increase in the proper way. On the contrary, when the weights are constant the possible effect—positive or

negative—due to each estimator is bounded. The errors were calculated when the data are representative of

the population, but if the quality of one sample is always small, the other sample cannot do the whole

estimation if the weights do not depend on the sizes.

My notes:

Exercise 1pe

We have reliable information that suggests the probability distribution with density function

2

f ( x ; θ) = 2 (θ−x ) , x ∈[0, θ] ,

θ

as a model for studying the population quantity X. Let X = (X1,...,Xn) be a simple random sample.

(a) Apply the method of the moments to find an estimator θ̂ M of the parameter θ.

(b) Calculate the bias and the mean square error of the estimator θ̂ M .

(c) Study the consistency of θ̂ M .

(d) Try to apply the maximum likelihood method to find and estimator θ^ ML of the parameter θ.

(e) Obtain estimators of the mean and the variance.

Hint: (i) Use that μ = E(X) = θ/3 and E(X 2) = θ2/6.

Discussion: This statement is mathematical. The assumptions are supposed to have been checked. We are

given the density function of the distribution of X (a dimensionless quantity). The exercise involves two

methods of estimation, the definition of the bias, the mean square error and the sufficient condition for the

consistency (in probability). The two first population moments are provided.

Note: If E(X) and E(X2) had not been given in the statement, they could have been calculated by applying the definition and solving the integrals,

+∞ θ 2(θ−x ) 2 θ θ

E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x

θ2

dx= 2

θ

(∫ 0

x θ dx−∫0 x 2 dx )

2 θ 3 θ

θ

2

= 2 θ ( [ ] [ ]) (

x

−

2 0 3

x

0

=

2 θ 2 θ3

θ 2

2 3

1 1

6 3)

θ − =θ 2 = θ

+∞ θ 2(θ− x) 2 θ θ

E ( X 2)=∫−∞ x 2 f (x ; θ)dx=∫0 x 2

θ 2

dx= 2

θ

(∫ 0

x 2 θ dx−∫0 x 3 dx )

θ θ

2

= 2 θ

θ ( [ ] [ ]) (

x3

−

3 0 4

x4

0

=

2 θ3 θ4

θ 2

θ − =2 θ 2

3 4

4

)

3⋅4

−

3

4⋅3( 2

12

1

= θ 2= θ 2

6 )

(a) Method of the moments

(a1) Population and sample moments

The distribution has only one parameter, so one equation suffices. By using the information in the hint:

1 1 n

n ∑ j =1 j

μ1 (θ )= θ and m1 (x 1 , x 2 ,... , x n )= x = x̄

3

1 1 n 3 n

θ= ∑ j=1 x j = x̄

n ∑ j =1 j

μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → → θ0 = x =3 x̄

3 n

It is obtained after substituting the lower case letters xj by upper case letters Xi:

3 n

θ^ M = ∑ j =1 X j =3 X̄

n

(b1) Bias

E ( θ^ M ) =E ( 3 X̄ )=3 E ( X̄ )=3 E ( X )=3 θ =θ

3

where we have used the properties of the expectation, a property of the sample mean and the information in

the statement. Now

b ( θ^ M ) = E ( θ^ M ) −θ = θ−θ = 0

and we can see that the estimator is unbiased (we could see it also from the calculation of the expectation).

We do not usually apply the definition MSE ( θ̂ M ) = E ( ( θ̂ M −θ)2 ) but a property derived from it, for which

we need to calculate the variance:

2 2

2 E ( X )− E ( X ) 3 2 θ2 θ 3 2 θ2 1 1 3 2 θ2 1

2 Var ( X )

[ ( )]

2 2

^

Var ( θ M ) = Var ( 3 X̄ ) = 3

n

=3

n

=

n 6

−

3

=

n 6 9(− =) =θ

n 18 2 n

where we have used the properties of the variance, a property of the sample mean and the information in the

statement. Then

2 2

MSE ( θ^ M ) = b ( θ^ M ) + Var ( θ^ M ) = 0 + θ = θ

2 2

2n 2n

(c) Consistency

̂

̂

We try applying the sufficient condition lim n →∞ MSE ( θ)=0 or, equivalently,

2

{ lim n →∞ b( θ)=0

̂

lim n →∞ Var ( θ)=0

. Since

2n

it is concluded that the estimator is consistent (in mean of order two and hence in probability) for estimating θ.

2

(d1) Likelihood function: The density function is f ( x ; θ)= (θ−x ) for 0≤x ≤θ , so

θ2

n 2n n

L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f (x j ; θ)= 2 n ∏ j =1

(θ−x j )

θ

(d2) Optimization problem: First, we try to find the maximum by applying the technique based on the

derivatives. The logarithm function is applied,

n

log[ L( x 1 , x 2 , ... , x n ; θ)]=n log(2)−2 n log(θ)+ ∑ j =1 log(θ−x j )

0=

d

dθ

1 n

log[ L(x 1 , x 2 , ... , x n ; θ)]=0−2 n θ + ∑ j=1

1

θ− x j

→ ?

Then, we realize that global minima and maxima cannot always be found through the derivatives (only if they

are also local extremes). In this case, it is difficult even to know whether L monotonically decreases with θ or

not, since part of L increases and another decreases—which one changes more? We study the j-th element of

the product, that is, f(xj;θ). Its first derivative is

2

θ −(θ− x j)2 θ θ(2 x j −θ)

f ' ( x j ; θ)=2 4

=2 4

so it has an extreme in θ=2 x j

θ θ

This implies that L is the product of n terms having the extreme in a different way, so L does not change

monotonically with the parameter θ.

(e) Estimators of the mean and the variance

To obtain estimators of the mean, we take into account that μ=E ( X )= θ and apply the plug-in principle:

3

^θ M 3 X̄

μ^ M = = = X̄ θ^

μ^ = =

max {X } ML j j

3 3

ML

3 3

2

To obtain estimators of the variance, since σ =Var ( X )= θ

2

6

θ^

2

(2 X̄ )2 2( X̄ )2

σ^ 2M = M = = σ^ 2ML = ?

6 6 3

Conclusion: The method of the moment is applied to obtain an estimator that is unbiased for any sample

size n and has good behaviour when used with for large n (many data). The maximum likelihood method

cannot be applied since it is difficult to optimize the likelihood function by considering either its expression or

the behaviour of the density function.

My notes:

Exercise 2pe

Let X be a random variable following the Rayleigh distribution, whose with probability function is

2

x

x − 2

f (x ; θ) = 2 e 2 θ , x ≥ 0, (θ> 0)

θ

4−π 2

such that E ( X )=θ π and Var ( X )=

2 √ 2

θ . Let X = (X1,...,Xn) be a simple random sample.

(a) Apply the method of the moments to find an estimator θ^ M of the parameter θ.

(b) For θ^ M , calculate the bias and the mean square error, and study the consistency.

(c) Apply the maximum likelihood method to find and estimator θ^ MV of the parameter θ.

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for positive-valued random variables.

A Rayleigh distribution is often observed when the overall magnitude of a vector is related to its directional components. One example

where the Rayleigh distribution naturally arises is when wind velocity is analyzed into its orthogonal 2-dimensional vector components.

Assuming that the magnitudes of each component are uncorrelated, normally distributed with equal variance, and zero mean, then the

overall wind speed (vector magnitude) will be characterized by a Rayleigh distribution. A second example of the distribution arises in the

case of random complex numbers whose real and imaginary components are i.i.d. (independently and identically distributed) Gaussian

with equal variance and zero mean. In that case, the absolute value of the complex number is Rayleigh-distributed. The distribution is

named after Lord Rayleigh.

Discussion: This is a theoretical exercise where we must apply two methods of point estimation. The basic

properties must be considered for the estimator obtained through the first method.

Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials and

exponentials are functions “of different type”):

∞

[ ]

2 2 2

x x x

+∞ ∞ x − − − 2 2 2

θ

2

x

( )

∞

=0+∫ e √ dx=∫

−

2θ

2 ∞

e−t

2

√ 2 θ2 dt=√ 2 θ2 √2π =θ √ π2

0 0

where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with

• u=x → u ' =1

2 2 2

x x x

x − 2

x − − 2 2

• v '= 2 e 2θ

→ v=∫ 2 e 2 θ dx=−e 2 θ

θ θ

Then, we have applied the change

x

=t → x=t √ 2 θ2 → dx=dt √ 2θ 2

√2 θ2

We calculate the variance by using the first two moments. For the second moment, we can apply integration by parts twice (as the

exponent decreases one unit each time)

∞

[ ]

2 2 2 2

x x x x

∞ x − − 2

− ∞ x − 2 2 2

2 2

θ θ

where ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' ( x)⋅v (x )dx has been used with

2

• u=x → u ' =2 x

2 2 2

x x x

x − 2

x − − 2 2

θ θ

2 4−π

Var ( X )=E ( X )− E ( X ) =2 θ −θ π =θ

2 2 2 2

The variance is . (In substituting, that ex changes faster than xk for any

2 2

k has been taken into account. On the other hand, in an advanced table of integrals like those physicists or engineers use, one can

+∞ 2 +∞ 2

(a) Method of the moments

Since there appears only one parameter in the density function, one equation suffices; moreover, since the

expression of μ = E(X) involves θ, the equation and the solution are:

μ1 (θ )= x̄ → θ π = x̄

2 √

1 2

→ θ= π x̄= π x̄ √ 2

→ θ^ M = π X̄ √

2 √

(b) Bias, mean square error and consistency

Bias: b ( θ^ M ) =E ( θ^ M ) −θ=θ−θ=0 → θ^ M is an unbiased estimator of θ.

2n

θ=

(4−π)

πn

θ 2 2

2 (4−π) 2 (4−π) 2

Mean square error: ECM ( θ^ M ) =b ( θ^ M ) +Var ( θ^ M ) =0+ θ= θ

πn πn

(4−π) 2

Consistency: lim n →∞ MSE ( θ^ M ) =lim n →∞ θ =0 and therefore σ̂ M is consistent (for θ).

πn

Likelihood function:

x1

2 2

xn 1 2

2∑

n − xj

n

L( X ; θ)=∏ j=1 f ( x j ; θ)= f ( x 1 ; θ)⋯ f ( x n ; θ)=

x1 e

−

2 θ2

⋯

xn e

−

2 θ2

=

(∏ j=1

xj e ) 2θ

2 2 2n

θ θ θ

Log-likelihood function:

To facilitate the differentiation, θ2n is moved to the numerator and a property of the logarithm is applied.

n

n 1 n ∑ j=1 x 2j 1

log ( L( X ; θ) ) =log (∏ j =1 )

xj −

2θ 2∑ j

x 2 + log ( θ−2 n )=log (∏ j=1

xj −) 2 θ2

−2 n log (θ)

n n n n

d ∑ j=1 x 2j −1 1 ∑ j=1 x 2j 2n 2n ∑j=1 x 2j ∑j =1 x 2j

0= log ( L( X ; θ) )=0− 4

2 θ−2 n θ = 3

− θ → θ = 3

→ θ2=

dθ 2 θ θ θ 2n

Now we prove the condition on the second derivative.

n n

( ∑ j=1 x 2j

) ∑ x j 2n 2

d2 d 2n n −1 −1

log ( L( X ; θ) ) = − =∑ j =1 x 2j 6 3 θ2 −2 n 2 =−3 j =14 + 2

dθ 2

dθ θ3 θ θ θ θ θ

The first term is negative and the second is positive, but it is difficult to check qualitatively whether the

second is larger in absolute value than the first. Then, the extreme obtained is substituted:

n n

d2

d σ2

( 2

log L( X ; σ =

∑ j=1 x 2j

2n

) =−3 n )

(∑ j=1 x 2j)2 2 n2 (2 n)2

(∑ j=1 x 2j )

2

+ n

∑ j =1 j

x 2

4 n2

=−3 n

4 n2

+ n

∑ j=1 j ∑ j=1 j

x 2

x 2

4 n2

=−2 n

∑ j=1 j

x 2

<0

The estimator:

√

n

∑j=1 X 2j

θ^ ML=

2n

Discussion: The Rayleigh distribution is one of the few cases for which the two methods provide different

estimators of the parameter. In the first case, we could easily calculate the mean and the variance, as the

estimator was linear in Xj; nevertheless, in the second case the nonlinearities Xj2 and the square root make

those calculations difficult.

My notes:

Exercise 3pe

Before commercializing a new model of light bulb, a deep statistical study on its duration (measured in days,

d) must be carried out. The population variable duration is expected to follow the exponential probability

model:

(a) Apply the method of the moments to find an estimator of the parameter λ.

(b) Apply the maximum likelihood method to find an estimator of the parameter λ.

(c) Find a sufficient statistic (see the hint below).

(d) Prove that X is not an efficient estimator of λ.

(e) Prove that X is a consistent estimator of λ–1.

(f) Prove that X is an efficient estimator of λ–1. To cope with this, use the following alternative, equivalent

notation in terms of θ = λ–1

d

Now you must prove that X is an efficient estimator of θ and you can easily calculate , while

dθ

d

only experts can calculate .

d λ−1

(g) The empirical part of the study, based on the measurement of 55 independent light bulbs, has yielded a

55

total sum of ∑ j=1 x j = 598 d . Introduce this information in the expressions obtained in previous

sections to give final estimates of λ.

(h) Give an estimate of the mean μ = E(X).

Hint: For section (c), apply the factorization theorem and make it clear how the two parts are. In the theorem: (1) g and h are

nonnegative; (1) T cannot depend on θ; (2) g depends only on the sample and the parameter, and it depends on the sample through

T; (3) h can be 1; and (4) since h is any function of the sample, it may involve T.

Discussion: First of all, the supposition that the exponential distribution can reasonably be used to model the

variable duration should be tested. One aim of this exercise is to show how many methods and properties

involved in previous exercises can be involved in the same statistical analysis. The quality of the estimators

obtained is studied. (See the appendixes to see how the mean and the variance of this distribution could be

calculated, if necessary.)

(a1) Population and sample moments: The population distribution has only one parameter, so one equation

suffices. The first-order moments of the model X and the sample x are, respectively,

1 1 n

μ1 (λ)= E( X )= and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

λ n

(a2) System of equations: Since the parameter of interest λ appears in the first moment of X, the solution is:

−1

1 1 n 1 n 1

μ1 (λ)=m1 ( x 1 , x 2 , ... , x n) → = ∑ j=1 x j = x̄

λ n

→

n (

λ= ∑ j =1 x j ) =

x̄

(a3) The estimator:

−1

1 n 1

(

λ^ M = ∑ j =1 X

n j ) =

X̄

(b1) Likelihood function: For an exponential random variable the density function is f (x ; λ)=λ e−λ x , so

we write the product and join the terms that are similar

n

n n −λ ∑ j=1 x j

L( x 1 , x 2 , ... , x n ; λ)=∏ j=1 f ( x j ; λ)=∏ j=1 λ e−λ x =λ e−λ x ⋅λ e−λ x ⋯λ e−λ x =λ n e

j 1 2 n

(b2) Optimization problem: The logarithm function is applied to make calculations easier

n

−λ ∑ j=1 x j n

log [ L( x 1 , x 2 , ... , x n ; λ)]=log[λ n ]+log[e ]=n⋅log[λ]−λ⋅∑ j=1 x j

The population distribution has only one parameter, and hence a onedimensional function must be maximized.

To find the local or relative extreme values, the necessary condition is:

−1

d n 1 n 1

( )

n

0= log[ L ( x 1, x 2,. .. , x n ; λ)]= λ −∑ j =1 x j → λ 0= ∑ j =1 x j =

dλ n x̄

To verify that the only candidate is a (local) maximum, the sufficient condition is:

2

d n

2

log[ L( x 1, x 2,. .. , x n ; λ)]=− 2 < 0

dλ λ

1

which holds for any value, particularly for λ 0= .

̄x

(b3) The estimator:

−1

1 n 1

(

λ^ ML= ∑ j =1 X j

n ) =

X̄

Both to prove that a given statistic is sufficient and to find a sufficient statistic, we apply the factorization

theorem (see the hint).

n

−λ ∑ j=1 X j

(c1) Likelihood function: Computed previously, it is L( X 1 , X 2 , ... , X n ; λ)=λ n e .

(c2) Theorem: L( X 1 , X 2 , ... , X n ; λ)= g (T ( X 1 , X 2 , ... , X n); λ )⋅h( X 1 , X 2 , ... , X n). We must analise each

term of the likelihood function.

n

➔ λ depends only on the parameter, so it would be part of g.

n

−λ ∑ j=1 X

➔ e depends on both the parameter and the sample, and it is not possible to separate

j

mathematically both types of information; then, this term would be part of g too. Moreover, the only

n

candidate to be a sufficient statistic is T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j .

n

n −λ ∑ j=1 X

Since the condition holds for g (T ( X 1 , X 2 , ... , X n) ; λ)=λ e j

and h( X 1 , X 2 , ... , X n)=1 , the

n

statistic T ( X )=T ( X 1 , ... , X n)=∑ j=1 X j is sufficient. This means that it “summarizes the important

information (about the parameter)” contained in the sample. The previous statistic contains the same

1 n

information as any one-to-one transformation of it, concretely the sample mean T ( X )= ∑ j =1 X j = X̄ .

n

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter is

checked by comparing the variance with the Cramér-Rao's bound).

(d1) Unbiasedness: By applying a property of the sample mean and the information of the statement,

1 1

E ( X̄ )= E ( X )= λ → b ( X̄ )=E ( X̄ )−λ= λ −λ ≠ 0

The first condition does not hold for all values of λ, and hence it is not necessary to check the second one.

1

Note: The previous bias is zero when −λ=0 ↔ λ=±√ 1 → λ=1 (for f(x) to be a probability function, λ must be positive, so

λ

the solution –1 is not taken into account). Thus, when λ = 1, the estimator may still be efficient if the second condition holds.

To prove the consistency (in probability), we will apply any of the following sufficient conditions (consistency

in mean of order two)

̂

lim n →∞ MSE ( θ)=0 ↔

lim n→∞ b( θ̂ )=0

lim n →∞ Var ( θ̂ )=0{

(e1) Bias: By applying a property of the sample mean and the information of the statement,

1

E(X

̄ )=E ( X )= ̄ )−λ−1= 1 − 1 =0

→ b( X̄ )=E ( X ̄ )=lim n →∞ 0 = 0

→ lim n →∞ b( X

λ λ λ

(e2) Variance: By applying a property of the sample mean and the information of the statement,

Var ( X ) 1 1

̄ )=

Var ( X = 2 → lim n →∞ Var ( X

̄ )=lim n→ ∞ =0

n λ ⋅n λ 2⋅n

As a conclusion, the mean square error (MSE) tends to zero, which is sufficient—but not necessary—for the

consistency (in probability).

where θ=λ–1.

(f1) Unbiasedness: By applying a property of the sample mean and the information of the statement,

̄ )=E ( X )=θ

E(X → ̄ )−λ−1=θ−θ = 0

b( X̄ )=E ( X

The first condition holds, and hence it is necessary to check the second one.

(f2) Mininum variance: We compare the variance and the Cramér-Rao's bound. The variance is:

̄ )= Var ( X ) = θ

2

Var ( X

n n

On the other hand, the bound is calculated step by step:

i. Function (with X in place of x)

1 −X

f ( X ; θ)= e θ

θ

ii. Logarithm of the function:

X

− X

log[ f ( X ; μ)]=log(θ−1 )+log(e θ

)=−log(θ)− θ

∂ log [ f ( X ; θ)]) =− 1 − X⋅−1 =− 1 + X

∂θ ( θ θ2 θ θ2

iv. Expectation of the squared partial derivative of the logarithm of the function: We rewrite the

expression so as to make σ 2=Var ( X )=E ( ( X −E ( X ))2 ) =E ( ( X −μ)2 ) appear. In this case, it also

holds that σ 2=Var ( X )=E ( ( X −θ)2 ) =θ2 . Then

[( ) ] [( )] [ ]

2

∂ log[ f ( X ; θ)] 1 X 2

( X −θ )2 Var ( X ) θ 2 1

E =E − ⋅θ + 2 =E = = 4= 2

∂θ θ θ θ θ4 θ4 θ θ

v. Theoretical Cramér-Rao's lower bound:

1 1 2

= =θ

1 n

[( )]

2

∂ log [ f ( X ;θ )] n⋅ 2

n⋅E ∂θ θ

The variance of the estimator attains the bound, so the estimator has minimum variance. The fulfillment of the

two conditions proves that X is an efficient estimator of λ–1 = θ.

(g) Estimation of λ

55

It is necessary to use the only information available: ∑ j=1 x j = 598 d .

−1 −1

1 n 1

From the method of the moments: λ^ M = ∑ x j =

n j =1( 55 ) (

598 d =0.09197 d .

−1

)

From the maximum likelihood method, since the same estimator was obtained: λ^ ML=0.09197 d −1 .

(h) Estimation of μ

1

Since μ=E ( X )= , an estimator of λ induces, by applying the plug-in principle, an estimator of μ:

λ

−1 598 d

From the method of the moments: μ^ M =( λ^ M ) = =10.87 d .

55

From the maximum likelihood method: μ^ ML =10.87 d .

Note: From the numerical point of view, calculating 598/55 is expected to have smaller error than calculating 1/0.091973.

Conclusion: We can see that for the exponential model the two methods provide the same estimator for λ.

The estimator obtained has been used to obtain an estimator of the population mean. The mean duration

estimate of the new model of light bulb was 10.87 days. On the other hand, some desirable properties of the

estimator have been proved. A different, equivalent notation has been used to facilitate the proof of one of

these properties, which emphasizes the importance of the notation in doing calculations.

My notes:

Confidence Intervals

[CI] Methods for Estimating

Remark 1ci: Confidence can be interpreted as a probability (so it is, although we sometimes use a 0-to-100 scale). See remark 1pt,

in the appendix of Probability Theory, on the interpretation of the concept of probability.

Remark 2ci: Since there is an infinite number of pairs of quantiles a1 and a2 such that P (a 1≤T ≤a2 )=1−α , those

determining tails of probability α/2 are considered by convention. This criterion is also applied for two-tailed hypothesis tests.

Remark 3ci: When the Central Limit Theorem can be applied, asymptotic results on averages are relatively independent of the

initial population. Therefore, in some exercises there are not suppositions on the distribution of the population variables.

Exercise 1ci-m

To forecast the yearly inflation (in percent, %), a simple random sample has been gathered:

1.5 2.1 1.9 2.3 2.5 3.2 3.0

It is assumed that the variable inflation follows a normal distribution.

(a) By using these data, construct a 99% confidence interval for the mean of the inflation.

(b) Experts have the opinion that the previous interval is too wide, and they want a total length of a unit.

Find the level of confidence for this new interval.

(c) Construct a confidence interval of 90% for the standard deviation.

Discussion: The intervals will be built by applying the method of the pivot, and then the expression of the

margin of error is determined. Since variances are nonnegative by definition and the positive branch of the

square root function is strictly increasing, the interval for the standard deviation is obtained by applying the

square root to the inteval for the variance.

X ≡ Predicted inflation (of one country) X ~ N(μ,σ2)

Sample information

Theoretical (simple random) sample: X1,..., X7 s.r.s. → n = 7

Empirical sample: x1,..., x7 → 1.5 2.1 1.9 2.3 2.5 3.2 3.0

In this exercise, we know the values of the sample xi. This allows calculating any quantity we want.

(a) Confidence interval for the mean: To choose the proper pivot, we take into account:

• The variable of interest follows a normal distribution.

• The population variance σ2 is unknown, so it must be estimated by the sample (quasi)variance.

• The sample size is small, n = 7, so we should not think about the asymptotic framework.

From a table of statistics (e.g. in [T]), the pivot

̄ −μ

X X̄ −μ

T ( X ; μ)= = ∼ t n−1

S

is selected. Then

√ S2

n √n

(

X

̄ −μ

)

≤+r α/ 2 =P (−r α/2

√ S2 ̄

n √

≤ X −μ ≤+r α / 2

S2

n

)

√ S2

n

√ √ √ √

2 2 2 2

̄ −r α /2 S ≤−μ ≤− X

=P (− X ̄ + r α/ 2 S )=P ( X

̄ +r +α / 2 S ≥ μ ≥ X

̄ −r α / 2 S )

n n n n

so

[ √ √ ]

2 2

̄ −r α / 2 S , X

I 1−α = X ̄ +r α/2 S

n n

where r α / 2 is the quantile such that P(T > r α/2 )=α /2. Let us calculate the quantities in the formula:

1 7

7 ∑ j =1 j

• x̄= x =2.36

• The level of confidence is 99%, and hence α = 0.01. The quantile is found in the table of the t distribution with κ = 7–1

degrees of freedom r α / 2=r 0.01 /2=r 0.005 =3.71

2 1 7 2 1 2 2 2

• By using the data, S=

7−1 ∑ j=1

( x j− x̄ ) =

7−1

[ (1.5 %−2.35 %) +⋯+(3.0 %−2.35 %) ]=0.36 %

• Finally, n = 7

[

I 0.99 = 2.35 %−3.71

√ 0.36 %2

7

, 2.35 %+3.71

0.36 %2

7 √

= [1.51 % , 3.20 %] ]

whose length is 3.20%–1.51% = 1.69%.

(b) Confidence level: The length of the interval, the distance between the two endpoints, is twice the margin

of error when T follows a symmetric distribution.

(

L= X +r α/ 2

̄ S2

n √ )(

− X −r α /2

̄ S2

n

=2 r α /2

S2

n √) √

In this section L is given and α must be found; nevertheless, it is necessary to find r α / 2 previously.

L √ n 1⋅√ 7 % > 1-pt(2.20, 7-1)

r α / 2= = =2.20 [1] 0.03505109

2 S 2⋅0.6 %

In the table of the t law it is found that α/2 = 0.035, so α = 0.07 and 1–α = 0.93. The confidence level is 93%.

(c) Confidence interval for the standard deviation: To choose the new statistic:

• The variable of interest follows a normal distribution.

• The quantity of interest it the standard deviation σ.

• The population mean μ is unknown.

• The sample size is small, n = 7, so we should not think about the asymptotic framework

From a table of statistics (e.g. in [T]), the proper pivot

(n−1) S 2

T ( X ; σ)= 2

∼ χ 2n−1

σ

is selected. Then

(n−1) S 2 ( n−1)S 2 (n−1) S 2

(

1−α=P l α/ 2≤

σ2

≤r α/ 2 =P ) (

l α/2

≤

1

≤

rα / 2

( n−1) S 2 σ2 (n−1) S 2

=P

l α/2

2

≥σ ≥

rα/ 2 ) ( )

and hence the interval is

[ ]

2 2

( n−1) S ( n−1)S

I 1−α = ,

rα / 2 l α/ 2

• Sample size n = 7, so n–1= 6 > qchisq(c(0.05, 1-0.05), 7-1)

2 2

[1] 1.635383 12.591587

• S =0.36 %

• Since α = 0.1 and κ = n–1= 6, the quantiles are l 0.05=1.64 and r 0.05=12.6

By substituting and applying the square root function, the interval is

I 0.9 = [√ 6⋅0.36 %2

12.6

,

√ 6⋅0.36 %2

1.64 ]

= [ 0.414 % , 1.148 %]

Conclusion: The length in section (b) is smaller than in section (a), that is, the interval is narrower and the

confidence is smaller.

My notes:

Exercise 2ci-m

In the library of a university, the mean duration (in days, d) of the borrowing period seems to be 20d. A simple

random sample of 100 books is analysed, and the values 18d and 8d 2 are obtained for the sample mean and

the sample variance, respectively. Construct a 99% confidence interval for the mean duration of the

borrowings to check if the initial population value is inside.

Discussion: For so many data, asymptotic results are considered. The method of the pivotal quantity can

also be applied. The dimension of the variable duration is time, while the unit of measurement is days.

X ≡ Duration (of one borrowing) X~?

Sample information:

Theoretical (simple random) sample: X1,...,X100 s.r.s. → n = 100

2 2

Empirical sample: x1,...,x100 → x̄=18 d , s =8 d

The values xj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities

̄ and S 2 .

must be sufficient for the calculations, and, therefore, formulas must be written in terms of X

Confidence interval: To select the pivot, we take into account:

• Nothing is said about the probability distribution of the variable of interest

• The sample size is big, n = 100 (>30), so an asympotic expression can be considered

• The population variance is unknown, but it is estimated through the sample variance

̄ −μ

X

T ( X ; μ)= → N ( 0,1)

√ S2

n

is chosen, where S2 is the sample quasivariance. By applying the method of the pivotal quantity:

(

1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤

X

̄ −μ

)

≤+ r α/ 2 =P (−r α / 2

√ S2 ̄

n √

≤ X −μ ≤+ r α/ 2

S2

n

)

√ S2

n

=P (− X̄ −r α /2

Then, the interval is

√ S2

n

≤−μ ≤− X̄ + r α/ 2

S2

n √

)=P ( X̄ + r α/ 2

S2

n √

≥ μ ≥ X̄ −r α/ 2

S2

n

)

√

[

I 1−α = X̄ −r α / 2

√ S2

n

, X̄ +r α/2

S2

n √ ]

where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.

• Sample mean x̄=18 d

• For a confidence of 99%, α = 0.01 and r α / 2=2.58

2 100 2 2 2 n 2 100 2 2

• To calculate S 2 the property (n−1)S =∑ j=1 ( x j− x̄) =n s is used: S= s= 8 d =8.1d

n−1 99

• n = 100

The interval is

[

I 0.99 = 18 d−2.58

√ 8.1d 2 , 18 d +2.58 √8.1 d 2

√100 √ 100 ] = [17.27 d , 18.73 d ]

Conclusion: The mean duration estimate of the borrowings belongs to the interval obtained with 99%

confidence. The initial value μ = 20d is not inside the high-confidence interval obtained, that is, it is not

supported by the data. (Remember: statistical results depend on: the assumptions, the methods, the certainty

and the data.)

My notes:

Exercise 3ci-m

The accounting firm Price Waterhouse periodically monitors the U.S. Postal Service's performance. One

parameter of interest is the percentage of mail delivered on time. In a simple random sample of 332,000

mailed items, Price Waterhouse determined that 282,200 items were delivered on time (Tampa Tribune, March

26, 1995.) Use this information to estimate with 99% confidence the true percentage of items delivered on

time by the U.S. Postal Servece.

(Taken from: Statistics. J.T. McClave and T. Sincich. Pearson.)

Discussion: The population is characterized by a Bernoulli variable, since for each item there are only two

possible values. We must construct a confidence interval for the proportion (a percent is a proportion

expressed in a 0-to-100 scale). Proportions have no dimension.

X ≡ Delivered on time (one item) ? X ~ B(η)

Confidence interval

For this kind of population and amount of data, we use the statistic:

̂

η−η d

T ( X ; η)= → N (0,1)

√ ?(1−? )

n

where ? is substituted by η or η. ̂ For confidence intervals η is unknown and no value is supposed, and

hence it is estimated through the sample proportion. By applying the method of the pivot:

( )

η

̂ −η

1−α=P (l α/ 2≤ T ( X ; η) ≤r α/ 2 )=P −r α /2≤ ≤+ r α / 2

√ η(1−

̂

n

η

̂)

( √

=P −r α /2

η

̂ (1− η)

n

̂

̂ −η ≤+ r α / 2

≤η

η(1−

̂

n√η

̂)

̂ α/2

=P −η−r

η

) (

̂ (1− η

n

̂)

̂ rα / 2

≤ −η ≤−η+

√

η(1−

̂

n

η)

̂

√ )

( √

̂ +r +α/2

=P η

η

̂ (1− η)

n

̂

̂ α/ 2

≥ η≥ η−r

η(1−

̂

n

η

̂)

√ )

Then, the interval is

[

I 1−α = η

̂ −r +α/ 2

√ η(1−

̂

n

η)

̂

, η+

̂ r +α / 2

η(1−

̂

n

η)

̂

√ ]

Substitution: We calculate the quantities in the formula,

• n = 332000

282200

• η=

̂ =0.850

332000

• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → r α /2=r 0.005=l 0.995=2.58

So

[

I 0.99= 0.850−2.58

√ 0.850( 1−0.850)

332000

, 0.850+2.58

0.850 (1−0.850)

332000 √

=[0.848 , 0.852] ]

Conclusion: With a confidence of 0.99, measured in a 0-to-1 scale, the value of η will be in the interval

obtained. In average, 99% times the method applied provides a right interval. Nonetheless, frequently we do

not know the real η and therefore we will never know if the method has failed or not. (Remember: statistical

results depend on: the assumptions, the methods, the certainty and the data.)

My notes:

Exercise 4ci-m

Two independent groups, A and B, consist of 100 people each of whom have a disease. A serum is given to

group A but not to group B, which are termed treatment and control groups, respectively; otherwise, the two

groups are treated identically. Two simple random samples have yielded that in the two groups, 75 and 65

people, respectively, recover from the disease. To study the effect of the serum, build a 95% confidence

interval for the difference ηA–ηB. Does the interval contain the case ηA = ηB?

Discussion: There are two independent Bernoulli populations. The interval for the difference of proportion

is built by applying the method of the pivot. Proportions are, by definition, dimensionless quantities.

A ≡ Recuperating (an individual of the treatment group)? A ~ B(ηA)

B ≡ Recuperating (an individual of the control group)? B ~ B(ηB)

• There are two independent Bernoulli populations

• Both sample sizes are large, 100, so an asymptotic approximation can be applied

From a table of statistics (e.g. in [T]), the following pivot is selected

( η̂ A −η̂ B )−(ηA−ηB ) d

T ( A , B ; ηA , ηB )= → N ( 0,1)

√ η

̂ A (1− η

nA

̂ A) η

̂ (1− η

+ B

nB

̂ B)

(η

̂ A− η

̂ B )−(ηA−ηB )

1−α=P (l α/ 2≤ T (A , B ; ηA , ηB) ≤r α/ 2)≈ P −r α / 2≤

( √ η̂ A (1−η̂ A ) η̂ B (1−η̂ B )

nA

+

nB

≤+r α/2

)

( √

=P −r α/2

̂ A (1− η

η

nA

̂ A) η̂ B (1−η̂ B )

+

nB

̂ (1− η

η

̂ B )−(ηA−ηB) ≤+r α/ 2 A

≤( η̂ A− η

nA

+

√

̂ A ) η̂ B (1−η̂ B )

nB )

(

=P −( η̂ A− η

̂ B )−r α /2

√ ̂ A (1− η

η

nA

̂ A) η

̂ (1− η̂ B )

+ B

nB

≤ −(ηA−ηB ) ≤−( η

̂ A− η

̂ B )+r α /2

̂ A (1− η

η

nA

+

√

̂ A) η̂ B (1−η̂ B )

nB )

(

=P ( η̂ A−η̂ B )+r α /2

√ ̂ A (1− η

η

nA

̂ A) η

̂ (1− η

+ B

nB

̂ B)

≥ ηA −ηB ≥( η̂ A− η

̂ B )−r α /2

̂ A (1− η

η

nA √

̂ A) η

̂ (1− η

+ B

nB

̂ B)

)

79 Solved Exercises and Problems of Statistical Inference

(3) The interval:

[

I 1−α = ( η̂ A −η̂ B )−r α / 2

√ ̂ A (1− η̂ A ) η̂ B (1−η̂ B )

η

nA

+

nB √

, ( η̂ A −η̂ B )+r α /2

̂ A (1− η

η

nA

̂ A) η

̂ (1− η

+ B

nB

̂ B)

]

where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2.

• nA = 100 and nB = 100.

• Theoretical (simple random) sample: A1,...,A100 s.r.s. (each value is 1 or 0).

100 1 100 75

Empirical sample: a1,...,a100 → ∑j =1 a j =75 → η^ A =

100 ∑ j=1

a j=

100

=0.75

Theoretical (simple random) sample: B1,...,B100 s.r.s. (each value is 1 or 0)

100 1 100 65

Empirical sample: b1,...,b100 → ∑j =1 b j =65 → η^ B=

100 ∑ j=1

b j=

100

=0.65 .

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96 .

Then,

I 0.95=(0.75−0.65)∓1.96

√ 0.75(1−0.75) 0.65(1−0.65)

100

+

100

=[−0.0263 , 0.226 ]

Conclusion: The lack-of-effect case (ηA = ηB) cannot be excluded when the decision has 95% confidence.

Since η ∈( 0,1), any “reasonable” estimator of η should provide values in this range or close to it. Because

of the natural uncertainty of the sampling process (randomness and variability), in this case the smaller

endpoint of the interval was –0.0263, which can be interpreted as being 0. When an interval of high

confidence is far from 0, the case ηA = ηB can clearly be discarded or rejected. Finally, it is important to notice

that a confidence interval can be used to make decisions about hypotheses on the parameter values—it is

equivalent to a two-sided hypothesis test, as the interval is also two-sided. (Remember: statistical results

depend on: the assumptions, the methods, the certainty and the data.)

Advanced theory: When the assumption ηA = η = ηB seems reasonable (notice that this case is included in

the 95% confidence interval just calculated), it makes sense to try to estimate the common variance of the

n η^ + n η^

estimator as well as possible. This can be done by using the pooled sample proportion η^ p= A A B B in

n A+ n B

estimating η(1– η) for the denominator; nonetheless, the pooled estimator should not be considered in the

numerator, as ( η^ p− η^ p)=0 whatever the data are. The statistic would be:

~ ( η^ A− η^ B )−(ηA−ηB ) d

T ( A , B)= → N (0,1)

√ η^ p (1− η

nA

^ p) η

^ (1− η

+ p

nB

^ p)

I~

[

1−α = ( η

^ A −η^ B )−r α/ 2

√ ^ p (1− η

η

nA

^ p) η

^ (1−η^ p )

+ p

nB √

, ( η^ A −η^ B )+ r α/ 2

^ p (1−η^ p ) η^ p (1− η

η

nA

+

nB

^ p)

]

The quantities involved in the previous formula are

• nA = 100 and nB = 100

• η^ A =0.75 and η^ B=0.65 , the pooled estimate is

Since

n A η^A + n B η^B n( η^A + η^B ) 0.75+ 0.65

η^ p = = = = 0.70

nA+ nB 2n 2

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96

Then,

I~

0.95 =(0.75−0.65)∓1.96 2

100 √

0.70(1−0.70)

=[−0.0270, 0.227]

One way to measure how different the results are consists in directly comparing the length—twice the margin

of error—in both cases:

~

L=0.226−(−0.0263)=0.2523 L=0.227−(−0.0270)=0.254

Even if the latter length is larger, it is theoretically more trustable than the former when ηA = η = ηB is true.

The general expressions of these lengths can be found too:

L=2 r α/ 2

√ η

^ A (1−η^ A ) η^ B (1− η

nA

+

nB

^ B) ~

L=2 r α/2

η

√

^ p (1− η

nA

^ p ) η^ p (1− η

+

nB

^ p)

Another way to measure how different the results are can be based on comparing the statistics:

~

T ( A , B)=

( η^ A− η^ B )−(ηA−ηB )

=

( η^ A −η^ B )−(ηA−ηB ) √ η

^ A (1− η

nA

^ A) η

^ (1− η

+ B

nB

^ B)

√ η^ p (1− η

nA

^ p) η

^ (1− η

+ p

nB

^ p)

√ η

^ A (1− η

nA

^ A) η

^ (1− η

+ B

nB

^ B)

√ η^ p (1− η

nA

^ p) η

^ (1− η

+ p

nB

^ p)

= T ( A , B)

√ η

^ A (1− η

nA

^ A) η^ B (1− η^ B )

+

nB

→

L

~=

√ η^ A (1−η^ A) η

nA

^ (1− η

+ B

nB

^ B)

=

~

T

T

~~

( so L⋅T = L⋅T )

√ √

η^ p (1− η^ p) η^ p (1− η^ p) η L

^ p (1− η ^ p) η ^ (1−η^ p )

+ + p

nA nB nA nB

Thus, the quantity

√

η^ A (1−η^ A ) η

n

^ (1− η

+ B

n

^ B)

√ η^ (1− η^ A)+ η^ B (1− η^ B ) =0.994

= A

η

√

^ p (1− η

n

^ p) η ^ (1−η^ p )

+ p

n

√ 2 η^ p (1− η^ p )

can be seen as a measure of the effect of using the pooled sample proportion. This effect is little in this

exercise, but it could be higher in other situations. As regards the case ηA = η = ηB, it is also included in this

interval, which is not worthy as it has been used as an assumption; nevertheless, the exclusion of this case

would have contradicted the initial assumption.

My notes:

Remark 4ci: In calculating the minimum sample size to guarantee a given precision by applying the method based on the margin of

error, the result is obtained using other results: theorem giving the sampling distribution of the pivot T and the method of the pivot.

When the proper statistic T is based on the supposition that the population variable X follows a given parametric probability

distribution, the whole process can be seen at a parametric approach; when T is based on an asymptotic result, the nonparametric

Central Limit Theorem is indirectly being applied. On the other hand, the method based on the Chebyshev's inequality is valid

whichever the probability distribution of the population variable X and nonnegative function h(x). The Central Limit Theorem, being

a nonparametric result, seems more powerful than the Chebyshev's inequality, based on a rough binding (see the appendixes). As a

consequence, we expect the method based on the this inequality to overestimate the minimum sample size. On the contrary, the

number provided by the method based on the margin of error may be less trustable if the assumptions on which it is based are false.

Remark 5ci: Once there is a discrete quantity in an equation, the unknown cannot take any possible value. This implies that, strictly

speaking, equalities like

√

2 2

E=r α / 2 σ σ =α

2

n nE

may be never fulfilled for continuous E, α, σ and discrete n. Solving the equality and rounding the result upward is a way alternative

to solving the inequalities

√

2

E g≥ E=r α/2 σ σ2 ≤ α

2

n n Eg

where the purpose is to find the minimum n for which the (possible discrete values of the) margin of error is smaller than or equal to

the given precision Eg.

Exercise 1ci-s

The lengths (in millimeters, mm) of metal rods produced by an industrial process are normally distributed

with a standard deviation of 1.8mm. Based on a simple random sample of nine observations from this

population, the 99% confidence interval was found for the population mean length to extend from 194.65mm

to 197.75mm. Suppose that a production manager believes that the interval is too wide for practical use and,

instead, requires a 99% confidence interval extending no further than 0.50mm on each side of the sample

mean. How large a sample is needed to achieve such an interval? Apply both the method based on the

confidence interval and the method based on the Chebyshev's inequality.

(From: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

Discussion: There is one normal population with known standard deviation. By using a sample of nine

elements, a 99% confidence interval was built, I1 = [194.65mm, 197.75mm], of length 197.75mm – 194.65mm

= 3.1mm and margin of error 3.1mm/2 = 1.55mm. A narrower interval is desired, and the number of data

necessary in the new sample must be calculated. More data will be necessary for the new margin of error to be

smaller (0.50 < 1.55) while the other quantities—standard deviation and confidence—are the same.

X ≡ Length (of one metal rod) X ~ N(μ, σ2=1.82mm2)

Sample information:

Theoretical (simple random) sample: X1,..., Xn s.r.s. (the lengths of n rods are taken)

Margin of error:

We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivot

to take the expression from the formula of the interval.

[ 2

√

̄ −r α / 2 σ , X

I 1−α = X

n

̄ + rα/ 2 σ

n

2

√ ]

If we remembered the expression, we can use it. Either way, the margin of error (for one normal population

with known variance) is:

√

2

E=r α / 2 σ

n

Sample size

Method based on the confidence interval: We want the margin of error E to be smaller or equal than the

given Eg,

√

2 2 2

2 1.8 mm

2

E g≥ E=r α/2 σ → E g≥r α / 2 σ → n≥z α / 2 σ =2.58

n

2 2

n

2

Eg ( )

0.5 mm (

=86.27 → n≥87 )

since r α/ 2=r 0.01 /2=r 0.005 =2.58 . (The inequality does not change neither when multiplying or dividing by

positive quantities nor squaring, while it changes when inverting.)

Method based on the Chebyshev's inequality: For unbiased estimators, it holds that:

^

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α

( θ) 2

E

2

^

so Var ( θ)=Var ( X̄ )= σ

n

2

1 2

1 1.82 mm2

σ ≤α → n≥ α σ = =1296 → n≥1296

2

n Eg Eg ( )

0.01 0.52 mm2

Conclusion: At least n data are necessary to guarantee that the margin of error is equal to 0.50 (this margin

can be thought of as “the maximum error in probability”, in the sense that the distance or error ∣θ−θ̂ ∣ will

be smaller that Eg with a probability of 1–α = 0.99, but larger with a probability of α = 0.01). Any number of

data larger than n would guarantee—and go beyond—the precision desired. As expected, more data are

necessary (86 > 9) to increase the accuracy (narrower interval) with the same confidence. The minimum

sample sizes provided by the two methods are quite different (see remark 4ci). (Remember: statistical results

depend on: the assumptions, the methods, the certainty and the data.)

My notes:

Exercise 1ci

The mark of an aptitude exam follows a normal distribution with standard deviation equal to 28.2. A simple

random sample with nine students yields the following results:

9 9

∑j =1 x j=1,098 ∑j =1 x 2j=138,148

a) Find a 90% confidence interval for the population mean μ.

b) Discuss without calculations whether the length of a 95% confidence interval will be smaller, greater

or equal to the length of the interval of the previous section.

c) How large must the minimum sample size be to obtain a 90% confidence interval with length (distance

between the endpoints) equal to 10? Apply the method based on the confidence interval and also the

method based on the Chebyshev's inequality.

Discussion: The supposition that the normal distribution is an appropriate model for the variable mark

should be evaluated. The method of the pivot will be applied. After obtaining the theoretical expression of the

interval, it is possible to reason on the relation confidence-length. Given the length of the interval, the

expression also allows us to calculate the minimum number of data necessary. The mark can be seen as a

quantity without any dimension. Finally, it is worth noticing that an approximation is used, since the mark is a

discrete variable while the normal distribution is continuous.

X ≡ Mark (of one student) X ~ N(μ, σ2=28.22)

Sample information:

Theoretical (simple random) sample: X1,..., X9 s.r.s. (the marks of nine students are to be taken) → n = 9

9 9

Empirical sample: x1,...,x9 → ∑j =1 x j=1,098 ∑j =1 x 2j=138,148 (the marks have been taken)

We can see that the sample values xj themselves are unknown in this exercise; instead, information calculated

from them is provided; this information must be sufficient for carrying out the calculations.

a) Method of the pivotal quantity: To choose the proper statistic with which the confidence interval is

calculated, we take into account that:

• The variable follows a normal distribution

• We are given the value of the population standard deviation σ

• The sample size is small, n = 9, so asymptotic formulas cannot be applied

̄ −μ

X

T ( X ; μ)= ∼ N (0,1)

is selected. Then

√ σ2

n

X

̄ −μ

√ √

2 2

(

1−α=P (l α/ 2≤ T ( X ;μ ) ≤r α/ 2)= P −r α/ 2≤

√ σ2

n )

≤+ r α/ 2 =P (−r α / 2 σ ≤ X

n

̄ −μ ≤+r α/2 σ )

n

√ √ √ √

2 2 2 2

̄ −r α /2 σ ≤−μ ≤− X

=P (− X ̄ + r α/ 2 σ )=P ( X

̄ + r +α / 2 σ ≥ μ ≥ X

̄ −r α / 2 σ )

n n n n

so

[ 2

̄ −r α / 2 σ , X

I 1−α = X

n √

̄ + rα/ 2 σ

n

2

√ ]

where r α / 2 is the value of the standard normal distribution verifying P( Z>r α /2 )=α / 2 , that is, the value

such that an area equal to α /2 is on the right (upper tail).

1 9 1

• x̄=

9 ∑ j=1

x j= 1,098=122

9

• A 90% confidence level implies that α = 0.1, and the quantile r α / 2=r 0.05=1.645 is in the table.

• From the statement, σ=28.2

• Finally, n = 9

[

I 0.9= 122−1.645

28.2

√9

, 122+1.645

28.2

√9 ]

= [ 106.54 , 137.46 ]

b) Length of the interval: To answer this question it is possible to argue that, when all the parameters but the

length are fixed, if higher certainty is desired it is necessary to widen the interval, that is, to increase the

distance between the two endpoints. The formal way to justify this idea consists in using the formula of the

interval:

( √ )( √ ) √

2 2 2

̄ +r α / 2 σ − ̄X −r α/ 2 σ =2⋅r α / 2 σ

L= X

n n n

Now, if σ and n remain unchanged, to study how L changes with α it is enough to see how the quantile

“moves”. For the 95% interval:

• α = 0.05 → α decreases with respect to the value in section (a)

• Now r α/ 2 must leave less area (probability) on the right → r α/ 2 increases → L increases

In short, when the tails (α) get smaller the interval (1–α) gets wider, and vice versa.

c) Sample size:

Method based on the confidence interval: Now the 90% confidence interval of the first section is revisited.

For given α and Lg, the value of n must be found. From the expression of the length,

√ 28.2 2

2 2 2

L g ≥L=2 r α /2 σ → L g ≥2 r α / 2 σ → n≥ 2 z α/ 2 σ = 2⋅1.645

) ( )

2 2 2

=86.08 → n≥87

n n Lg ( 10

(Only when inverting the inequality must be changed.)

^

Var ( θ)

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤

( θ) ≤α

2

E

2

^

so Var ( θ)=Var ( X̄ )= σ

n

σ2 ≤ α

2

28.2 2

2 → n ≥ σ 2= =318.10 → n≥319

n Eg α Eg 10 2

0.1⋅

2 ( )

Conclusion: Given the other quantities, confidence grows with the length, and vice versa. If a value greater

than n were considered, a higher accuracy interval would be obtained; nevertheless, in practice usually this

would also imply higher expense of both time and money. The minimum sample sizes provided by the two

methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions, the

methods, the certainty and the data.)

My notes:

Exercise 2ci

A 64-element simple random sample of petrol consumption (litres per 100 kilometers, u) in private cars has

been taken, yielding a mean consumption of 9.36u and a standard deviation of 1.4u. Then:

a) Obtain a 96% confidence interval for the mean consumption.

b) Assume both normality (for the consumption) and variance σ2 = 2u2. How large must the sample be if,

with the same confidence, we want the maximum error to be a quarter of litre? Apply the method

based on the confidence interval and the method based on the Chebyshev's inequality.

(From 2007's exams for accessing to the Spanish university.)

Discussion: For 64 data, asymptotic results can be applied. The method of the pivotal quantity will be

applied. The role of the number 100 is no other than being part of the units in which the data are measured.

For the second section, additional suppositions—added by myself—are considered; in a real-world situation

they should be evaluated.

C ≡ Consumption (of one private car, measured in litres per 100 kilometers) C~?

Sample information:

Theoretical (simple random) sample: C = (C1,...,C64) s.r.s. → n = 64

Empirical sample: c = (c1,...,c64) → c̄=9.36u , s=1.4 u

The values cj of the sample are unknown; instead, the evaluation of some statistics is given. These quantities

̄ and s 2 .

must be sufficient for the calculations, so formulas must involve C

• Nothing is said about the probability distribution of the variable of interest

• The sample size is big, n = 64 (>30), so an asympotic expression can be used

• The population variance is unknown, but it is estimated by the sample variance

̄ −μ

C

T (C ;μ)= → N (0,1)

√ S2

n

where S2 will be calculated by applying the relation n s 2=(n−1) S 2 . By applying the method of the pivot:

(

1−α=P (l α/ 2≤ T (C ; μ) ≤r α /2 )=P −r α/2 ≤

̄

C−μ

)

≤+ r α/ 2 =P (−r α/ 2

√ S2 ̄

n

≤ C−μ ≤+r α /2

√S2

n

)

√ S2

n

=P (−C̄−r α/ 2

√ S2

n

Then, the confidence interval is

≤−μ ≤−C̄ +r α /2

S2

n √

)=P ( C̄ +r α / 2

S2

n √

≥ μ ≥C̄−r α/ 2

S2

n

)

√

86 Solved Exercises and Problems of Statistical Inference

[ √ √ ]

2 2

̄ −r α/ 2 S , C

I 1−α = C ̄ +r α/2 S

n n

where r α / 2 is the quantile such that P(Z> r α /2 )=α /2.

• Sample mean c̄=9.36u .

• For a confidence of 96%, α = 0.04 and r α / 2=r 0.04 /2 =r 0.02=l 0.98=2.054 .

2 n 2 64 2 2 2

• The sample quasivariance is S= s = 1.4 u =1.99 u .

n−1 63

• Finally, n = 64.

The interval is

[

I 0.96 = 9.36 u−2.054

√ 1.99 u2

64

, 9.36 u+2.054

64 √ ]

1.99 u2

= [9.00 u , 9.72u]

Method based on the confidence interval: To select the pivot, we take into account the new suppositions:

• The variable of interest follows a normal distribution

• The population mean is being studied

• The population variance is known

From a table of statistics (e.g. in [T]), the following pivot is selected (now the exact sampling distribution is

known, instead of the asympotic distribution)

C̄ −μ

T (C ;μ)= ∼ N (0,1)

√ σ2

n

By doing calculations similar to those of the previous section or exercise, the interval is

[ √ √ ]

2 2

I 1−α = C̄ −r α/ 2 σ , C̄ + r α / 2 σ

n n

√

2

from which the expression of the margin of error is obtained, namely: E=r α /2 σ . Values can be

n

substituted either before or after breaking an inequality; this time let us use numbers from the beginning:

2 u2

√

2

1 1 2 2 2u

E g= u≥E =2.054 → 2 u ≥2.054 → n≥4 2⋅2.054 2⋅2=135.01 → n≥136

4 n 4 n

(When inverting, the inequality must be changed.)

^

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α

( θ) 2

E

2

^

so Var ( θ)=Var ( X̄ )= σ

n

2 2

σ2 ≤ α 2u

2 → n ≥ σ 2= 2

=800

α Eg 1

n Eg 0.04⋅ u

4( )

Conclusion: The unknown mean petrol consumption of the population of private cars belongs to the

interval obtained with 96% confidence. For 64 data, the margin of error were 2.055

1.99 u 2

64

=0.36 u , while

136 data are needed for the margin to be 1/4= 0.250. The minimum sample sizes provided by the two methods

√

are quite different (see remark 4ci). (Remember: statistical results depend on the assumptions, the methods,

the certainty and the data.)

My notes:

Exercise 3ci

You have been hired by a consortium of dairy farmers to conduct a survey about the consumption of milk.

Based on results from a pilot study, assume that σ = 8.7oz. Suppose that the amount of milk is normally

distributed. If you want to estimate the mean amount of milk consumed daily by adults:

(a) How many adults must you survey if you want 95% confidence that your sample mean is in error by no

more than 0.5oz? Apply both the method based on the confidence interval and the method based on the

Chebyshev's inequality.

(b) Calculate the margin of error if the number of data in the sample were twice the minimum (rounded)

value that you obtained. Is now the margin of error half the value it was?

(Based on an exercise of: Elementary Statistics. Triola M.F. Pearson.)

A fluid ounce (abbreviated fl oz, fl. oz. or oz. fl., old forms ℥, fl ℥, f℥, ƒ ℥) is a unit of volume (also called capacity) typically used for

measuring liquids. It is equivalent to approximately 30 millilitres. Whilst various definitions have been used throughout history, two

remain in common use: the imperial and the United States customary fluid ounce. An imperial fluid ounce is 1⁄20 of a imperial pint, 1⁄160

of an imperial gallon or approximately 28.4 ml. A US fluid ounce is 1⁄16 of a US fluid pint, 1⁄128 of a US fluid gallon or approximately

29.6 ml. The fluid ounce is distinct from the ounce, a unit of mass; however, it is sometimes referred to simply as an "ounce" where

context makes the meaning clear.

Discussion: There is one normal population with known standard deviation. In both sections, the answer can

be found by using the expression of the margin of error.

X ≡ Amount of milk (consumed daily by an adult) X ~ N(μ, σ2=8.72oz2)

Sample information:

Theoretical (simple random) sample: X1,...,Xn s.r.s. (the amount is measured for n adults)

We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivot

to take the expression from the formula of the interval.

[ √

2

̄ −r α / 2 σ , X

I 1−α = X

n

̄ + rα/ 2 σ

n √ ]

2

If we remembered the expression, we can directly use it. Either way, the margin of error (for one normal

population with known variance) is:

√

2

E=r α / 2 σ

n

(a) Sample size

Method based on the confidence interval: The equation involves four quantities, and we can calculate any

of them once the others are known. Here:

√

2 2 2 2

2 8.7 oz

E g≥ E=r α /2 σ → E g≥r α/ 2 σ → n≥z α/ 2 σ =1.96

n

2 2

n

2

Eg 0.5 oz( ) (

=1163.08 → n≥1164 )

since r α/ 2=r 0.05 /2=r 0.025 =1.96 . (The inequality does not change neither when multiplying or dividing by

positive quantities nor squaring, while it changes when inverting.)

^

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α

( θ) 2

E

2

^

so Var ( θ)=Var ( X̄ )= σ

n

2 2 2

σ2 ≤ α 1

→ n≥ α σ =

1 8.7 oz → n≥6056

2

n Eg Eg ( )

0.05 0.5 2 oz 2

=6055.2

Way 1: Just by substituting.

√ √

8.7 2 oz 2

2

E=r α / 2 σ =1.96 =0.3534 oz

n 2⋅1164

When the sample size is doubled, the margin of error is not reduced by half but by less than this amount.

√ √σ2 = 1 r

√

σ 2 = E = 0.5 oz =0.3535 oz

2

~

E =r α/ 2 σ

~ =r α / 2 α /2

n 2 n √2 n √2 √2

Now it is easy to see that if the sample size is multiplied by 2, the margin of error is divided by √2. Besides,

more generally:

Proposition

For the confidence interval estimation of the mean of a normal population with known variance,

based on the method of the pivot, when the sample size is multiplied by any scalar c the margin of

error is divided by √c.

(Notice that 0.5 is slightly smaller than the real margin of error after rounding n upward; that is why there is a

small different between the results of both ways.)

Conclusion: At least 1164 or 6056 data are necessary to guarantee that the margin of error is equal to 0.50

(this margin can be thought of as “the maximum error in probability”, in the sense that the distance or error

̂ ∣ will be smaller that Eg with a probability of 1–α = 0.95, but larger with a probability of α = 0.05).

∣θ−θ

When the sample size is multiplied by c, the margin of error is divided by √c. Using more data would also

guarantee the precision desired. The minimum sample sizes provided by the two methods are quite different

(see remark 4ci). (Remember: statistical results depend on: the assumptions, the methods, the certainty and the

data.)

My notes:

Exercise 4ci

A company makes two products, A and B, that can be considered independent and whose demands follow the

distributions N(μA, σA2=702u2) and N(μB, σB2=602u2), respectively. After analysing 500 shops, the two simple

random samples yield a = 156 and b = 128.

(a) Build 95 and 98 percent confidence intervals for the difference between the population means.

(b) What are the margin of errors? If sales are measured in the unit u = number of boxes, what is the unit

of measure of the margin of error?

(c) A margin of error equal to 10 is desired, how many shops are necessary? Apply both the method based

on the confidence interval and the method based on the Chebyshev's inequality.

(d) If only product A is considered, as if product B had not been analysed, how many shops are necessary

to guarantee a margin of error equal to 10? Again, apply the two methods.

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)

company. an organization that makes or sells goods or that sells services: 'My father works for an insurance company.' 'IBM is one of the

biggest companies in the electronics industry.'

factory. a place where goods such as furniture, carpets, curtains, clothes, plates, toys, bicycles, sports equipment, drinks and packaged

food are produced: 'The company's UK factory produces 500 golf trolleys a week.'

industry. (1) all the people, factories, companies etc involved in a major area of production: 'the steel industry', 'the clothing industry'

(2) all industries considered together as a single thing: 'Industry has developed rapidly over the years at the expense of agriculture.'

mill. (1) a place where a particular type of material is made: 'a cotton mill', 'a textile mill', 'a steel mill', 'a paper mill' (2) a place where

flour is made from grain: 'a flour mill'

plant. a factory or building where vehicles, engines, weapons, heavy machinery, drugs or industrial chemicals are produced, where

chemical processes are carried out, or where power is generated: 'Vauxhall-Opel's UK car plants', 'Honda's new engine plant at

Microconcord. Swindon', 'a sewage plant', 'a wood treatment plant', 'ICI's ₤100m plant', 'the Sellafield nuclear reprocessing plant in

Cumbria'

works. an industrial building where materials such as cement, steel, and bricks are produced, or where industrial processes are carried

out: 'The drop in car and van sales has led to redundancies in the country's steel works.'

Discussion: It should statistically be proved the supposition that the normal distribution is appropriate to

model both variables. The independence of the two populations should be tested as well. The method of the

pivot will be applied. After obtaining the theoretical expression of the interval, it is possible to argue about the

relation confidence-length. Given the length of the interval, the expression allows us to calculate the minimum

number of data necessary. The number of units demaned can be seen as dimensionless quantities. An

approximation is implicitly being used in this exercise, since the number of units demanded is a discrete

variable while the normal distribution is continuous.

The variables are

A ≡ Number of units of product A sold (in one shop) A ~ N(μA, σA2=702u2)

B ≡ Number of units of product B sold (in one shop) B ~ N(μB, σB2=602u2)

• There are two independent normal populations

• We are interested in μA – μB

• Variances are known

( ̄A− B

̄ )−(μ A−μ B )

T ( A , B ; μ A ,μ B )= ∼ N (0,1)

√

2 2

σ σ A B

+

nA nB

(a2) Event rewriting

1−α=P (l α/ 2≤ T (A , B ;μ A μ B ) ≤r α /2 )=P −r α/2 ≤

( √

2

σ σ

A

+

n A nB

2

B

≤+ r α / 2

)

( √

=P −r α /2

σ 2A σ 2B

+ ≤ ( ̄A− B

nA nB

̄ )−(μ A−μ B ) ≤+r α / 2

σ 2A σ2B

+

n A nB √ )

( √ √ )

2 2 2 2

σA σB σA σ B

=P −( ̄A− B

̄ )−r α/ 2 + ≤−(μ A−μ B ) ≤−( ̄A− B

̄ )+ r α / 2 +

nA nB nA nB

( √

=P ( ̄A − ̄B )+ r α/ 2

σ 2A σ2B

+ ≥ μ A −μ B ≥( ̄A− ̄B )−r α/ 2

nA nB

σ2A σ 2B

+

n A nB √ )

(a3) The interval

[ √ √ ]

2 2 2 2

σ A σB σ A σB

I 1−α = ( ̄A− ̄B )−r α /2 + , ( ̄A− B

̄ )+r α/2 +

nA nB nA nB

• ā=156 u and b̄=128 u

• σ 2A =702 u 2 and σ 2B =602 u 2

• n A =500 and n B =500

• At 95%, 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=r 0.025=l 0.975=1.96

• At 98%, 1–α = 0.98 → α = 0.02 → α/2 = 0.01 → r α/ 2=r 0.01=l 0.99=2.326

Thus, at 95%

[

I 0.95= (156−128)−1.96

√ 702 60 2

+

500 500

, (156−128)+1.96

702 60 2

+

500 500 √

=[19.92, 36.08] ]

and at 98%

[

I 0.98= (156−128)−2.326

√ 70 2 60 2

+

500 500

, (156−128)+ 2.326

702 602

+

500 500 √

=[18.41, 37.59] ]

(b) Margin of error: Regarding the units, they can be treated as any other algebraic letter representing a

numerical quantity. The quantile and the sample sizes are dimensionless, while the variances are expressed in

the unit u2—because of the square in the definition σ2 = E([X–E(X)]2)—when data X are measured in the unit

u. At 95%

√ σ2A σ 2B

√ √

2 2 2 2 2 2

70 u 60 u 70 60

E 0.95=r α / 2 + =1.96

n A nB 500

+

500

=1.96 +

500 500

√ 2

u =8.08u

and at 98%

√ σ 2A σ 2B

√ √

2 2 2 2 2 2

70 u 60 u 70 60

E 0.98=r α/ 2 + =2.326

n A nB 500

+

500

=2.326 +

500 500

√ u2=9.59 u

Method based on the confidence interval: Since here both samples sizes are equal to the number of shops,

( ) ( )

2 2 2

E g≥ E=r α/2 + → E ≥r

g α/ 2 → n≥r α /2 =r α/ 2 + r α/ 2

n n n E 2g Eg Eg

2 2 2 2 2 2 2 2

70 u +60 u 70 u +60 u

n≥1.962 =326.54 → n≥327 and n≥2.326 2 =459.87 → n≥460

10 2 u2 102 u 2

Method based on the Chebyshev's inequality: For unbiased estimators:

^

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α

( θ) 2

E

σ 2A σ2B

2 2 + 2 2

σ 2A +σ 2B 1 σ A 2 1 σB 2

σ σ n n σ A+ σ B

^

( ) ( )

A B

If Var ( θ)=Var ( Ā)+Var ( B̄)= + → = ≤α → n≥ =α +α

n n E 2g n E 2g α E2g Eg Eg

so

702 u2+602 u 2 2 2 2 2

70 u + 60 u

n≥ =1700 and n≥ 2 2

=4250

0.05⋅102 u2 0.02⋅10 u

Method based on the confidence interval: In this case, when the method of the pivotal quantity is applied

(we do not repeat the calculations here), the interval and the margin of error are, respectively,

[ √ √ ] √ σ2A

2 2

σA σA

I 1−α = Ā−r α/2 , Ā+ r α/ 2 and E=r α/ 2

nA nA nA

(Note that this case can be thought of as a particular case where the second population has values B = 0, μB=0

and σB2=0.) Then,

σ 2A

√

2 2

2 2 σA 2 σA

E g≥ E=r α/2 → E g≥r α/ 2 → n A ≥r α / 2 2

nA nA Eg

and hence at 95% and 98%, respectively,

70 2 u2 702 u 2

n A ≥1.962 2 2

=188.24 → n A ≥189 and n A ≥2.326 2 2 2

=265.10 → n A ≥266

10 u 10 u

^

^ |≥E )=P (|θ−E

P (|θ−θ ^ ^ |≥ E)≤ Var ( θ) ≤ α

( θ) 2

E

σ 2A

^ σ 2A nA σ2A σ2A

If Var ( θ)=Var ( Ā)= → = ≤α → nA ≥

nA E 2g n A E 2g α E 2g

so

2 2

702 70 u

nA ≥ =980 and nA ≥ =2450

0.05⋅102 2 2

0.02⋅10 u

Conclusion: As expected, when the probability of the tails α decreases the margin of error—and hence the

length—increases. For either one or two products and given the margin of error, the more confidence (less

significance) we want the more data we need. Since 500 shops were really considered to attain this margin of

error, there has been a waste of time and money—fewer shops would have sufficed for the desired accuracy

(95% or 98%). When two independent quantities are added or subtracted, the error or uncertainty of the result

can be as large as the total of the two individual errors or uncertainties; this also holds for random quantities

(if they are dependent, a correction term—covariance—appears); for this reason, to guarantee the same

margin of error, more data are necessary in each of the two samples—notice that for two populations the

minimum value is larger than or equal to the sum of the minimum values that would be necessary for each

population individually (for the same precision and confidence). The minimum sample sizes provided by the

two methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions,

the methods, the certainty and the data.)

My notes:

Hypothesis Tests

Remark 1ht: Like confidence, the concept of significance can be interpreted as a probability (so they are, although we sometimes

use a 0-to-100 scale). See remark 1pt, in the appendix of Probability Theory, on the interpretation of the concept of probability.

Remark 2ht: The quantities α, p-value, β, 1–β and φ are probabilities, so their values must be between 0 and 1.

Remark 3ht: For two-tailed tests, since there is an infinite number of pairs of quantiles such that P (a 1≤T 0≤a2 )=1−α ,

those that determine tails of probability α/2 are considered by convention. This criterion is also applied for confidence intervals.

Remark 4ht: To apply the second methodology, binding the p-value is sometimes enough to compare it with α. To do that, the

proper closest value included in the table is used.

Remark 5ht: In calculating the p-value for two-tailed tests, by convention the probability of the tail determined by T0(x,y) is

doubled. When T0(X,Y) follows an asymmetric distribution, it is difficult to identify the tail if the value of T0(x,y) is close to the

median. In fact, knowing the median is not necessary, since if we select the wrong tail, twice its probability will be greater than 1

and we will realize that the other tail must have been considered. Alternatively, it is always possible to calculate the two

probabilities (on the left and on the right) and double the minimum of them (this is useful in writing code for software programs).

Remark 6ht: When more than one test can be applied to make a decision about the same hypotheses, the most powerful should be

considered (if it exists).

Remark 7ht: After making a decision, it is possible to evaluate the strengh with which it was made: for the first methodology, by

comparing the distance from the statistic to the critical values—or, better, the area between this set of values and the density

function of T0—and, for the second methodology, by looking at the magnitude of the p-value.

Remark 8ht: For small sample sizes, n=2 or n=3, the critical region—obtained by applying any methodology—can be plotted in the

two- or threedimensional space.

[HT] Parametric

Remark 9ht: There are four types of pair of hypotheses:

(1) simple versus simple

(2) simple versus one-sided composite

(3) one-sided composite versus one-sided composite

(4) simple versus two-sided composite

We will directly apply Neyman-Pearson's lemma for the first case. When the solution of the first case does not depend upon any

particular value of the parameter θ1 under H1, the same test will be uniformly most powerful for the second case. In addition, when

there is a uniformly most powerful test for the second case, it will also be uniformly most powerful for the third case.

Remark 10ht: Given H0 and α, different decisions can be made for one- and two-tailed tests. That is why: (i) describing the details

of the framework is of great important in Statistics; and (ii) as a general rule, all trustworthy information must be used, which

implies that a one-sided test should be used when there is information that strongly suggests so—compare the estimate calculated

from the sample with the hypothesized values.

Remark 11ht: For parametric tests,α (θ)= P (Reject H 0 ∣ θ∈Θ0) and 1−β(θ) = P ( Reject H 0 ∣ θ∈Θ1) , so to plot

the power function ϕ(θ) = P ( Reject H 0 ∣ θ∈Θ0 ∪Θ1) it is usually enough to enter θ∈Θ0 in the analytical expression of

1−β(θ). This is the method that we have used in some exercises where the computer has been used.

Remark 12ht: A reasonable testing process should verify that

1−β(θ1 )=P (T 0 ∈Rc ∣ θ∈Θ1) > P (T 0 ∈ Rc ∣ θ∈Θ0 ) = α(θ 0 )

with 1–β(θ1) ≈ α(θ0) when θ1 ≈ θ0. This can be noticed in the power functions plotted in some exercises, where there is a local

minimum at θ0.

Remark 13ht: Since one-sided tests are, in its range of parameter values, more powerful than the corresponding two-sided test, the

best way of testing an equality consists in accepting it when it is compared with the two types of inequality. Similarly, the best way

to test an inequality consists in accepting it when it is allocated either in the null hypothesis or in the alternative hypothesis. (This

ideas, among others, are rigurously explained in the materials of professor Alfonso Novales Cinca.)

[HT-p] Based on T

Exercise 1ht-T

The lifetime of a machine (measured in years, y) follows a normal distribution with variance equal to 4y 2. A

simple random sample of size 100 yields a sample mean equal to 1.3y. Test the null hypothesis that the

population mean is equal to 1.5y, by applying a two-tailed test with 5 percent significance level. What is the

type I error? Calculate the type II error when the population mean is 2y. Find the general expression of the

type II error and then use a computer to plot the power function.

Discussion: First of all, the supposition that the normal distribution reasonably explains the lifetime of the

machine should be evaluated by using proper statistical techniques. Nevertheless, the purpose of this exercise

is basically to apply the decision-making methodologies.

Statistic: Since

• There is one normal population

• The population variance is known

the statistic

̄ −μ

X

T ( X ; μ)= ∼ N (0,1)

√ σ2

n

is selected from a table of statistics (e.g. in [T]). Two particular cases of T will be used:

X̄ −μ 0 ̄ −μ1

X

T 0 ( X )= ∼ N (0,1) and T 1 ( X )= ∼ N (0,1)

σ2

n√ σ2

n √

To apply any of the two methodologies, the value of T0 at the specific sample x = (x1,...,x100) is necessary:

̄x −μ 0 1.3−1.5 −0.2⋅10

T 0 ( x)= = = =−1

2

√ √

2

σ 4

n 100

H 0 : μ = μ 0 = 1.5 and H 1 : μ = μ 1 ≠ 1.5

For these hypotheses

Decision: To make the final decision about the hypotheses, two main methodologies are available. To apply

the first one, the critical values a1 and a2 that determine the rejection region are found by applying the

definition of type I error, with α = 0.05 at μ0 = 1.5, and the criterion of leaving half the probability in each tail:

α (1.5) = P (Type I error )= P (Reject H 0 ∣ H 0 true)= P (T ( X ; μ)∈Rc ∣ H 0 )

= P ( {T 0 ( X )<a 1 }∪{T 0 ( X )> a 2 })

{

α (1.5)

= P(T 0 ( X )< a1) → a1=l α / 2=−1.96

2

α (1.5)

=P (T 0 (X )>a 2 ) → a2=r α/ 2=+1.96

2

→ Rc ={T 0 ( X )<−1.96 }∪{T 0 ( X )>+1.96 }={∣T 0 ( X )∣>+1.96 }

The decision is: T 0 ( x)=−1 → T 0 ( x)∉ Rc → H0 is not rejected.

=P (∣T 0 ( X )∣>∣−1∣)=2⋅P (T 0 (X )<−1)=2⋅0.1587=0.32

→ pV =0.32> 0.05=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1, that is, with T1. Nonetheless, the critical region is

expressed in terms of T0. Thus, the mathematical trick of adding and subtracting the same quantity is applied:

β(μ 1) = P(Type II error) = P ( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 )= P (∣T 0 ( X )∣≤1.96 ∣ H 1)

∣)

X

̄ −μ 0

= P (−1.96≤T 0 ( X )≤+1.96 ∣ H 1 ) = P −1.96≤

( √ σ2

n

≤+1.96 H 1

∣) (

X

̄ −μ 1 +μ 1−μ 0 μ 1−μ 0 μ 1−μ0

(

= P −1.96≤

σ2

n √

≤+1.96 H 1 = P −1.96−

σ2

n √

≤T 1 ( X )≤+1.96−

√ )

σ2

n

μ −μ μ −μ

( σ

n √ ) (

= P T 1 ( X )≤+1.96− 1 0 − P T 1 (X )<−1.96− 1 0

2

σ

n

2

√ )

For the particular value μ1 = 2,

> pnorm(-0.54,0,1)-pnorm(-4.46,0,1)

β(2) = P ( T 1 ( X )≤−0.54 )− P ( T 1 ( X )<−4.46 ) =0.29 [1] 0.2945944

By using a computer, many more values μ1 ≠ 2 can be considered so as to numerically determine the power

curve 1–β(μ1) of the test and to plot the power function.

ϕ(μ) = P ( Reject H 0 )=

{ α (μ ) if μ∈Θ0

1−β(μ) if μ∈Θ1

# Population

variance = 4

# Sample and inference

n = 100

alpha = 0.05

theta0 = 1.5 # Value under the null hypothesis H0

q = qnorm(1-alpha/2,0,1)

theta1 = seq(from=0,to=+3,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pnorm(+q-(paramSpace-theta0)/sqrt(variance/n),0,1) + pnorm(-q-(paramSpace-theta0)/sqrt(variance/n),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Conclusion: The hypothesis that 1.5y is the mean of the distribution of the lifetime is not rejected. As

expected, when the true value is supposed to be 2, far from 1.5, the probability of rejecting 1.5 is 1–β(2) =

0.71, that is, high. This value has been calculated by hand; additionally, after finding the analytical expression

of the curve 1–β, also by hand, the computer allows the power function to be plotted. This theoretical curve,

not depending on the sample information, is symmetric with respect to μ0 = 1.5. (Remember: statistical results

depend on: the assumptions, the methods, the certainty and the data.)

My notes:

Exercise 2ht-T

A company produces electric devices operated by a thermostatic control. The standard deviation of

the temperature at which these controls actually operate should not exceed 2.0ºF. For a simple

random sample of 20 of these controls, the sample quasi-standard deviation of operating

temperatures was 2.39ºF. Stating any assumptions you need (write them), test at the 5% level the null

hypothesis that the population standard deviation is not larger than 2.0ºF against the alternative that

it is. Apply the two methodologies and calculate the type II error at σ2=4.5ºF2. Use a computer to plot

the power function. On the other hand, between the two alternative hypothesis H 1 : σ=σ 1 > 2 or

H 1 : σ=σ 1 ≠ 2 , which one would you have selected? Why?

Hint: Be careful to use S2 and σ2 wherever you work with a variance instead of a standard deviation.

(Based on an exercise of Statistics for Business and Economics. Newbold, P., W.L. Carlson and B.M. Thorne. Pearson.)

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B. Heaton. Longman.)

actual = real (as opposed what is believed, planned or expected): 'People think he is over fifty but his actual age is forty-eight.' 'Although

buses are supposed to run every fifteen minutes, the actual waiting time can be up to an hour.'

present/current = happening or existing now: 'No one can drive that car in its present condition.' 'Her current boyfriend works for Shell.'

LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)

“Device” is a noun. A can-opener is a device. “Devise” is a verb. You can devise a plan for opening a can with a sharp rock instead. Only

in law is “devise” properly used as a noun, meaning something deeded in a will.

Discussion: Because of the mathematical theorems available, we are able to study the variance only for

normally distributed random variables. Thus, we need the supposition that the temperature follows a normal

distribution. In practice, this normality should be evaluated.

• There is one normal population

• The population mean is unknown

and hence the following (dimensionless) statistic, involving the sample quasivariance, is chosen

(n−1) S 2

T ( X ; σ)= 2

∼ χ 2n−1

σ

We will work with the two following particular cases:

( n−1) S 2 (n−1) S 2

T 0 ( X )= 2

∼ χ 2n−1 and T 1 ( X )= 2

∼ χ2n−1

σ0 σ1

To make the decision, we need to evaluate the statistic T0 at the specific data available x:

(20−1) 2.39 2 F 2

T 0 ( x)= =27.13

22 F 2

Hypothesis test

Then,

Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition

of type I error, with α = 0.05 at σ02 = 4ºF2 :

α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)

→ a=r α=r 0.05=30.14 → Rc = {T 0 ( X )>30.14 }

To make the final decision: T 0 ( x)=27.13 < 30.14 → T 0 ( x)∉ Rc → H0 is not rejected.

The second methodology requires the calculation of the p-value:

> 1 - pchisq(27.13, 20-1)

→ pV =0.102> 0.05=α → H0 is not rejected. [1] 0.1016613

Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is already

expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:

β(σ12) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P (T 0 ( X )∉ Rc | H 1 ) = P (T 0 ( X )≤30.14 | H 1 )

| ) (

2

30.14⋅σ20

=P

( n s2

σ 20

≤30.14 H

| ) (

1 = P

n s2 σ 1

σ12 σ 20

≤30.14 H 1 = P T 1 (X )≤

σ12 )

For the particular value σ12 = 4.5ºF2,

30.14⋅4

(

β(4.5) = P T 1 ( X )≤

4.5 )

= P ( T 1 ( X )≤26.79 ) = 0.89

> pchisq(26.79, 20-1)

[1] 0.8903596

By using a computer, many other values σ12 ≠ 4.5ºF2 can be considered so as to numerically determine the

power curve 1–β(σ12) of the test and to plot the power function.

ϕ(σ 2 ) = P ( Reject H 0) =

{ α( σ2 ) if σ ∈Θ0

1−β(σ 2) if σ∈Θ1

# Sample and inference

n = 20

alpha = 0.05

theta0 = 4 # Value under the null hypothesis H0

q = qchisq(1-alpha,n-1)

theta1 = seq(from=4,to=15,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Conclusion: The null hypothesis H 0 : σ=σ 0 ≤ 2 is not rejected. When any of these factors is different,

the decision might be the opposite. As regards the most appropriate alternative hypothesis, the value of S

suggests that the test with σ1 > 2 is more powerful than the test with σ1 ≠ 2 (the test with σ1 < 2 against

the equality would be the least powerful as both the methodologies—H0 is the default hypothesis—and the

data “tend to help H0”). (Remember: statistical results depend on: the assumptions, the methods, the certainty

and the data.)

My notes:

Exercise 3ht-T

Let X = (X1,...,Xn) be a simple random sample with 25 data taken from a normal population variable X. The

sample information is summarized in

25 25

∑ j=1 x j=105 and ∑ j=1 x 2j=579.24

(a) Should the hypothesis H0: σ2 = 4 be rejected when H1: σ2 > 4 and α = 0.05? Calculate β(5).

(b) And when H1: σ2 ≠ 4 and α = 0.05? Calculate β(5).

Use a computer to plot the power function.

Discussion: The supposition that the normal distribution is appropriate to model X should be statistically

proved. This statement is theoretical.

• There is one normal population

• The population mean is unknown

n s2

T ( X ; σ)= 2

∼ χ 2n−1

σ

We will work with the two following particular cases:

n s2 n s2

T 0 ( X )= 2

∼ χ 2n−1 and T 1 ( X )= 2

∼ χ2n−1

σ0 σ1

To make the decision, we need to evaluate the statistic at the specific data available x:

[ ) ] = 25⋅5.53 =34.56

2

1 1

T 0 ( x)=

25

25

∑ (

x 2j −

25

∑ xk

4 4

2

1 n 1 n

where to calculate the sample variance, the general property s 2= ∑ X

2

−

n j =1 j n j=1 (

∑ X j ) has been used.

2 2 2 2

Hypotheses: H 0 : σ = σ 0 = 4 and H 1: σ = σ1 > 4

Decision: To determine the rejection region, under H0, the critical value a is found by applying the definition

of type I error, with α = 0.05 at σ02 = 4:

α (4) = P (Type I error ) = P ( Reject H 0 ∣ H 0 true)= P (T ( X ;θ)∈ Rc ∣ H 0 ) = P (T 0 (X )>a)

→ a=r α=r 0.05=36.4 → Rc = {T 0 ( X )>36.4 }

To make the final decision: T 0 ( x)=34.56 < 36.4 → T 0 ( x)∉ Rc → H0 is not rejected.

The second methodology requires the calculation of the p-value:

> 1 - pchisq(34.56, 25-1)

→ pV =0.075> 0.05=α → H0 is not rejected. [1] 0.07519706

Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is expressed

in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:

β(σ12) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉R c ∣ H 1) = P (T 0 ( X )≤36.4 ∣ H 1)

∣ ) (

2

36.4⋅σ20

=P

n s2

σ 20 (

≤36.4 H 1 = P

∣ ) (

n s2 σ 1

σ12 σ 20

≤36.4 H 1 = P T 1 (X )≤

σ12 )

For the particular value σ12 = 5,

36.4⋅4

(

β(5) = P T 1 ( X )≤

5 )

= P ( T 1 ( X )≤29.12 ) = 0.78

> pchisq(29.12, 25-1)

[1] 0.7843527

By using a computer, many other values σ12 ≠ 5 can be considered so as to numerically determine the power

curve 1–β(σ12) of the test and to plot the power function.

ϕ(σ 2 ) = P ( Reject H 0) =

{ α( σ2 ) if σ ∈Θ0

1−β(σ 2) if σ∈Θ1

# Sample and inference

n = 25

alpha = 0.05

theta0 = 4 # Value under the null hypothesis H0

q = qchisq(1-alpha,n-1)

theta1 = seq(from=4,to=15,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

2 2 2 2

Hypotheses: H 0 : σ = σ 0 = 4 and H 1: σ = σ1 ≠ 4

For these hypotheses,

Decision: Now there are two tails, determined by two critical values a1 and a2 that are found by applying the

definition of type I error, with α = 0.05 at σ02 = 4, and the criterion of leaving half the probability in each tail:

α (4)= P(Type I error )=P ( Reject H 0 ∣ H 0 true)= P(T ( X ; θ)∈R c ∣ H 0 )=P (T 0 ( X )<a 1)+ P (T 0 ( X )>a 2 )

We always consider two tails with the same probability,

{

α (4)

=P (T 0 ( X )< a1) → a1=r 1−α/ 2=12.4

2 → Rc ={T 0 ( X )<12.4 }∪{T 0 ( X )> 39.4 }

α (4)

=P (T 0 ( X )>a 2) → a 2=r α / 2=39.4

2

To make the final decision: T 0 ( x)=34.56 → T 0 ( x)∉Rc → H0 is not rejected

To base the decision on the p-value, we calculate twice the probability of the tail:

=2⋅P (T 0 ( X )>34.56)=2⋅0.075=0.15

> 1 - pchisq(34.56, 25-1)

[1] 0.07519706

→ pV =0.15> 0.05=α → H0 is not rejected

Note: The wrong tail would have been selected if we had obtained a p-value bigger than 1.

β(σ12) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true) = 1−P (T ( X ; θ)∈ Rc ∣ H 1)

[( n s2

σ 20

<12.4

| ) (

H 1 + P

n s2

σ02 | )]

>39.4 H 1

[( ∣ ) ( ∣ )]

2 2

n s2 12.4⋅σ 0 n s 2 39.4⋅σ0

= 1− P < H 1 +1−P ≤ H1

σ 21 σ 21 σ 21 σ 21

| ) ( | ) (

2 2 2 2

=−P

σ 21(

n s 2 12.4⋅σ 0

<

σ21

H 1 + P

n s 2 39.4⋅σ0

σ 21

≤

σ12

H 1 = P T 1 ( X )≤

39.4⋅σ0

σ21

−P T 1 ( X )<

12.4⋅σ0

σ12 ) ( )

For the particular value σ12 = 5,

> pchisq(c(9.92, 31.52), 25-1)

β(5) = P ( T 1 ( X )≤31.52 ) −P ( T 1 ( X )< 9.92 ) = 0.86−0.0051 = 0.85 [1] 0.00513123 0.86065162

# Sample and inference

n = 25

alpha = 0.05

theta0 = 4 # Value under the null hypothesis H0

q = qchisq(c(alpha/2,1-alpha/2),25-1)

theta1 = seq(from=0,to=15,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pchisq(q[2]*theta0/paramSpace, n-1) + pchisq(q[1]*theta0/paramSpace, n-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Comparison of the power functions: For the one-tailed test, the power of the test at σ12 = 5 is 1–β(5) =

1–0.78 = 0.22, while for the two-tailed test it is 1–β(5) = 1–0.85 = 0.15. As expected, this latter test has

smaller power (higher type II error), since in the former test additional information is being used when one tail

is previously discarded. Now we compare the power functions of the two tests graphically, for the common

values (> 4), by using the code

# Sample and inference

n = 25

alpha = 0.05

theta0 = 4 # Value under the null hypothesis H0

q = qchisq(c(alpha/2,1-alpha/2),25-1)

theta1 = seq(from=0,to=15,0.01)

paramSpace1 = sort(unique(c(theta1,theta0)))

PowerFunction1 = 1 - pchisq(q[2]*theta0/paramSpace1, n-1) +

pchisq(q[1]*theta0/paramSpace1, n-1)

q = qchisq(1-alpha,n-1)

theta1 = seq(from=4,to=15,0.01)

paramSpace2 = sort(unique(c(theta1,theta0)))

PowerFunction2 = 1 - pchisq(q*theta0/paramSpace2, n-1)

plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta',

ylab='Probability of rejecting theta0', main='Power Function', type='l')

lines(paramSpace2, PowerFunction2, lty=2)

It can be noticed that the curve of the one-sided test is over the curve of the two-sided test for any σ2 > 4,

which makes it uniformly more powerful. In this exercise, from the sample information we could have

calculated the estimator S2 of σ2 so as to see if its value is far from 4 and therefore one of the two one-sided

tests should be considered better.

Conclusion: The hypothesis that the population variance is equal to 4 is not rejected in either of the two

sections. Although it has not happened in this case, different decisions may be made for the one- and two-

-tailed cases. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the

data.)

My notes:

Exercise 4ht-T

Imagine that you are hired as a cook. Not an ordinary one but a “statistical cook.” For a normal population,

in testing the two hypotheses

{

2 2

H 0 : σ = σ0 =4

H 1 : σ 2 = σ21 >4

the data (sample x of size n = 11 such that S2=7.6u2) and the significance (α=0.05) have led to rejecting the

null hypothesis because

1−α

r 0.05=18.31 T 0 ( x)=19

Methodology

Statistic T0

Form of the alternative hypothesis H1

Significance α

Data x (edu.glogster.com/)

Since the chef—your boss—wants the null hypothesis H0 not to be rejected, find three different ways to

scientifically make the opposite decision by changing any of the previous factors. Give qualitative

explanations and, if possible, quantitative ones.

Discussion: Metaphorically, Statistics can be thought of as the kitchen with its utensils and appliances, the

first two factors as the recipe, and the next three items as the ingredients—if H1, α or x are inappropriate, there

is little to do and it does not matter how good the kitchen, the recipe and you are. Our statistical knowledge

allows us to change only the last three elements. The statistic to study the variance of a normal population is

(n−1) S 2 (n−1) S 2 (11−1)7.6 u2 76

T ( X )= 2

∼ χ2n −1 so, under H0, T 0 ( x)= = = =19.

σ σ 20 4 u2 4

Qualitative reasoning: By looking at the figure above, we consider that:

A) If a two-tailed test is considered (H1: σ2 = σ12 ≠ 4), the critical value would be r α / 2 instead of r α

and, then, the evaluation T 0 (x) may not lie in the rejection region (tails).

B) Equivalently, for the original one-tailed test, the critical value r α increases when the significance α

decreases, perhaps with the same implication as in the previous item.

C) Finally, for the same one-sided alternative hypothesis and significance, that is, for the same critical

value r α , the evaluation T 0 (x) would lie out ot the critical region (tail) if the data x—the values

themselves or only the sample size—are such that T 0 (x) < r α=18.31 .

D) Additionally, a fourth way could consist of some combinations of the previous ways.

Quantitative reasoning: The previous qualitative explanations can be supported with calculations.

A) For the two-tailed test, now the critical value would be r 0.05 /2=r 0.025=20.48 . Then

T 0 ( x)=19 < 20.48=r 0.025 → T 0 ( x)∉Rc → H0 is not rejected.

B) The same effect is obtained if, for the original one-tailed H1, the significance is taken to be 0.025

instead of 0.05. Any other value smaller than 0.025 would lead to the same result. Is 0.025—suggested

by the previous item—the smallest possible value? The answer is made by using the p-value, since it is

sometimes defined as the smallest significance level at which the null hypothesis is rejected. Then,

since

> 1 - pchisq(19, 11-1)

pV =P ( X more rejecting than x | H 0 true)=P (T 0 ( X )> 19)=0.0403 [1] 0.04026268

0.0403= pV > α → H0 is not rejected

C) Finally, for the original test and the same value for n, since

~

(n−1) S 2 ~ 2 ~2

~ S 2 (n−1) S S

T 0 ( x)= 2

= 2 2

= 19 < 18.31=r α

σ0 S σ0 7.6 u 2

the opposite decision would be made for any sample quasivariance such that

2

~2 7.6 u 2

S < 18.31 =7.324 u → T 0 ( x)∉ Rc → H0 is not rejected

19

On the other hand, for the original test and the same value for S, since

(~

n −1) S ( ~ (~

2 2

n −1) ( n−1) S n −1)

T~0 ( x)= 2

= 2

= 19 < 18.31=r α

σ0 (n−1) σ0 (11−1)

the opposite decision would be made for any sample size such that

~ (11−1)

n < 18.31 +1=10.63684 ↔ ~

n ≤ 10 → T 0 ( x)∉ Rc → H0 is not rejected

19

D) Some combinations can easily be proved to lead to rejecting H0.

Conclusion: This exercise highlights how much careful one must be in either writing or reading statistical

works.

My notes:

Exercise 5ht-T

The distribution of a variable is supposed to be normally distributed in two independent biological

populations. The two population variances must be compared. After gathering information through simple

random samples of sizes nX = 11, nY = 10, respectively, we are given the value of the estimators

2 1 n 2 2 1 n 2

∑ ∑

X Y

SX= ( x j− x̄ ) =6.8 sY = ( y j− ȳ ) =7.1

n X −1 j=1 nY j=1

(a) H0: σX = σY against H1: σX < σY

(b) H0: σX = σY against H1: σX > σY

(c) H0: σX = σY against H1: σX ≠ σY

In each section, calculate the analytical expression of the type II error and plot the power function by using a

computer.

Discussion: In a real-world situation, suppositions should be proved. We must pay careful attention to the

details: the sample quasivariance is provided for one group, while the sample variance is given for the other.

• There are two independent normal populations

• The population means are unknown

the statistic

S 2X

σ2X S 2X σ2Y

T ( X , Y ; σ X , σ Y )= = ∼ Fn −1 ,nY −1

S 2Y S 2Y σ 2X X

σ2Y

is selected from a table of statistics (e.g. in [T]). It will be used in two forms (we can write σX2/ σY2 = θ1):

S 2X SX

2

σ 2X S 2X 2

θ 1⋅σY

2

1 SX

T 0 ( X ,Y )= = ∼ Fn −1 ,nY −1 and T 1 ( X , Y )= = ∼ Fn

S 2Y S 2Y X

S 2Y θ1 S 2 X −1 , nY −1

Y

σ2Y σ2Y

(On the other hand, the pooled sample variance Sp2 should not be considered even under H0: σX = σ = σY, as

T 0=( S 2p /S 2p )=1 whatever the data are.) To apply any of the two methodologies we need to evaluate T0 at

the samples x and y:

2 2

SX SX 6.8

T 0 ( x , y )= 2

= = =0.86

SY nY 2 10

s 7.1

nY −1 Y 10−1

Since we were given the sample quasivariance of population X, but the sample variance of population Y, the

general property n s 2 = (n−1) S 2 has been used to calculate SY2.

(a) One-tailed alternative hypothesis σX < σY

σ 2X σ2X

Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : 2 = θ1 < 1

σY σY

For these hypotheses,

Decision: To determine the critical region, under H0, the critical value a is found by applying the definition of

type I error, with α = 0.1 at θ0 = 1:

α (1)= P (Type I error )=P (Reject H 0 ∣ H 0 true)= P (T ( X , Y )<a ∣ H 0 )=P (T 0 ( X , Y )< a)

1 1 1 (From the definition of the F distribution, it is easy to see

→ 0.1= P(T 0 ( X , Y )< a)= P

( >

T 0 ( X ,Y ) a ) →

a

=2.35 that if X follows a Fk1,k2 then 1/X follows a Fk2,k1. We use

this property to consult our table.)

1

→ a=r 1−α= =0.43 → Rc = {T 0 ( X , Y )< 0.43}

2.35

To make the final decision about the hypotheses:

T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.

The second methodology requires the calculation of the p-value:

pV =P ( X ,Y more rejecting than x , y ∣ H 0 true)

=P (T 0 (X , Y )<T 0 ( x , y))=P (T 0 ( X , Y )< 0.86)=0.41

> pf(0.86, 11-1, 10-1)

→ pV =0.41> 0.1=α → H0 is not rejected. [1] 0.406005

Power function: To calculate β, we have to work under H1, that is, with T1. Since in this case the critical

region is already expressed in terms of T0, the mathematical trick of multiplying and dividing by the same

quantity is applied:

β(θ1 ) = P (Type II error) = P( Accept H 0 ∣ H 1 true) = P (T 0 ( X )∉ Rc ∣ H 1 ) = P (T 0 ( X )≥0.43 ∣ H 1 )

∣ ) ( ∣ )

2 2

=P

( SX

2

SY

1 S 1 0.43

( 0.43

≥0.43 H 1 = P θ X2 ≥ θ 0.43 H 1 = P T 1 ( X )≥ θ = 1−P T 1 ( X )< θ

1 SY 1 1 1

) ( )

By using a computer, many values θ1 can be considered so as to determine the power curve 1–β(θ1) of the test

and to plot the power function.

ϕ(θ) = P ( Reject H 0 ) = α (θ) if θ∈Θ0

1−β(θ) if θ ∈Θ1 {

# Sample and inference

nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(alpha,nx-1,ny-1)

theta1 = seq(from=0,to=1,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = pf(q/paramSpace, nx-1, ny-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

(b) One-tailed alternative hypothesis σX > σY

2 2

σX σ

Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : X2 = θ1 > 1

σY σY

For these hypotheses,

Decision: To apply the methodology based on the rejection region, the critical value a is found by applying

the definition of type I error, with α = 0.1 at θ0 = 1:

α (1)= P (Type I error )=P ( Reject H 0 ∣ H 0 true)= P (T ( X , Y )>a ∣ H 0 )=P (T 0 ( X , Y )> a)

→ a=r α=2.42 → Rc = {T 0 ( X , Y )> 2.42 }

The final decision is: T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.

The second methodology requires the calculation of the p-value:

pV =P ( X ,Y more rejecting than x , y | H 0 true)= P(T 0 ( X , Y )>T 0 ( x , y ))

=P (T 0 (X , Y )> 0.86)= 1−0.41=0.59 > pf(0.86, 11-1, 10-1)

[1] 0.406005

→ pV =0.59> 0.1=α → H0 is not rejected.

β(θ1 )= P (Type II error )= P ( Accept H 0 | H 1 true) = P(T 0 ( X )∉ Rc | H 1) = P (T 0 (X )≤2.42 | H 1)

| ) ( | )

2 2

=P

( SX

2

SY

≤2.42 H 1 = P

1 SX 1

Y

≤ 2.42 H 1 = P T 1 ( X )≤

θ 1 S 2 θ1 (2.42

θ1 )

By using a computer, many values θ1 can be considered so as to plot the power function.

# Sample and inference

nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(1-alpha,nx-1,ny-1)

theta1 = seq(from=1,to=15,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pf(q/paramSpace, nx-1, ny-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

2 2 2 2

Hypotheses: H 0 : σ X =σ Y and H 1 : σ X ≠σY

σ 2X σ2X

Or, equivalently, H 0 : 2 = θ0 = 1 and H 1 : 2 = θ1 ≠ 1

σY σY

For these hypotheses,

Decision: For the first methodology, the critical region must be determined by applying the definition of type I

error, with α = 0.1 at θ1 = 1, and the criterion of leaving half the probability in each tail:

α (1)= P (Type I error )= P( Reject H 0 | H 0 true)=P (T 0 ( X ,Y )<a 1)+ P (T 0 ( X , Y )>a2 )

{

α (1)

= P(T 0 ( X , Y )<a1 ) → a 1=l α /2 =0.33

→ 2

α(1)

=P (T 0 ( X ,Y )>a 2 ) → a2=r α / 2=3.14

2

> qf(c(0.05, 0.95), 11-1, 10-1)

→ Rc ={T 0 ( X , Y )<0.33 }∪{T 0 (X , Y )> 3.14 } [1] 0.3310838 3.1372801

T 0 ( x , y )=0.86 → T 0 ( x)∉ Rc → H0 is not rejected.

To apply the methodology based on the p-value, we calculate the median qf(0.5, 11-1, 10-

1)=1.007739; thus, since T(x,y) is in the left-hand tail:

pV =P ( X ,Y more rejecting than x , y | H 0 true)=2⋅P (T 0 ( X , Y )<T 0 (x , y))

=2⋅P (T 0 ( X ,Y )<0.86)=2⋅0.41=0.82

→ pV =0.82> 0.1=α → H0 is not rejected.

If you cannot calculate the median, try the tail you trust most and change it if a value bigger than 1 is obtained

after doubling the probability.

β(θ1 ) = P (Type II error ) = P ( Accept H 0 | H 1 true) = P(T 0 ( X )∉ Rc | H 1)

S 2X

| ) ( | )

2

( S 2Y

0.33 1 S X 3.14

≤3.14 H 1 = P θ ≤ θ 2 ≤ θ H 1

1 1 S

Y

1

=P ( 0.33

θ 1

≤T ( X )≤

1

3.14

θ )

= P (T ( X )≤

1

3.14

θ )1−P (T ( X )<

1

0.33

θ ) 1

1

By using a computer, many values θ1 can be considered in order to plot the power function.

# Sample and inference

nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)

theta1 = seq(from=0,to=15,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pf(q[2]/paramSpace, nx-1, ny-1) + pf(q[1]/paramSpace, nx-1, ny-1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Comparison of the power functions: Now we compare the power functions of the three tests

graphically, by using the code

# Sample and inference

nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)

theta1 = seq(from=0,to=15,0.01)

paramSpace1 = sort(unique(c(theta1,theta0)))

PowerFunction1 = 1 - pf(q[2]/paramSpace1, nx-1, ny-1) + pf(q[1]/paramSpace1, nx-1, ny-1)

q = qf(alpha,nx-1,ny-1)

theta1 = seq(from=0,to=1,0.01)

paramSpace2 = sort(unique(c(theta1,theta0)))

PowerFunction2 = pf(q/paramSpace2, nx-1, ny-1)

q = qf(1-alpha,nx-1,ny-1)

theta1 = seq(from=1,to=15,0.01)

paramSpace3 = sort(unique(c(theta1,theta0)))

PowerFunction3 = 1 - pf(q/paramSpace3, nx-1, ny-1)

plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta',

ylab='Probability of rejecting theta0', main='Power Function', type='l')

lines(paramSpace2, PowerFunction2, lty=2)

lines(paramSpace3, PowerFunction3, lty=2)

It can be seen that the curves of the one-sided tests are over the curve of the two-sided test for any θ1—in its

region each one-sided test has more power than the two-sided test, since additional information is used when

one tail is discarded. Then, any of the two one-sided tests is uniformly more powerful than the two-sided test

in their respective common domains.

Conclusion: The hypothesis that the population variance is equal in the two biological populations is not

rejected when tested against any of the three alternative hypotheses. Although it has not happened in this case,

different decisions can be made for the one- and two-tailed tests. In this exercise, the empirical value T0(x) =

SX2/SY2 = 0.86 suggests the alternative hypothesis H1: σX2/σY2 < 1. (Remember: statistical results depend on: the

assumptions, the methods, the certainty and the data.)

My notes:

Exercise 6ht-T

Two simple random samples of 700 citizens of Italy and Russia yielded, respectively, that 53% of Italian

people and 47% of Russian people wish to visit Spain within the next ten years. Should we conclude, with

confidence 0.99, the Italians' desire is higher than the Russians'? Determine the critical region and make a

decision. What is the type I error? Calculate the p-value and apply the methodology based on the p-value to

make a decision.

1) Allocate the question in the alternative hypothesis. Calculate the type II error for the value –0.1.

2) Allocate the question in the null hypothesis. Calculate the type II error for the value +0.1.

Use a computer to plot the power function.

Discussion: After reading the statement (possibly twice, if necessary), we realize that there are two

independent populations whose citizens have been set a question with two possible answers (dichotomic

situation). Then, each individual can be thought of as—modeled through—a Bernoulli variable. In practice,

the implicit supposition that the same parameter η governs the behaviour of all the individuals should still be

evaluated for each population (a sort of homogeneity to analyse whether or not several subpopulations should

be considered). The independence of the two populations should be studied as well. Either way, in this

exercise we will merely apply the testing methodologies.

The sample proportions of those who said 'yes' are given: η^ I =0.53 and η^ R =0.47, respectively. If

ηI and ηR are the theoretical proportions of the populations, that is, the quantities we want to compare, we need

to test the hyphothesis ηI > ηR (one-tailed test).

Should this hypothesis be written as a null or as an alternative hypothesis? In general, since we fix the

type I error in our methodologies, a strong sample evidence is necessary to reject H0. Thus, the decision of

allocating the condition to be tested in H0 or H1 depends on our choice (usually on what “making a wrong

decision” means or implies for the specific framework we are working in). We are going to solve both cases.

From a theoretical point of view, H0: ηI ≥ ηR is essentialy the same as H0: ηI = ηR.

As a final remark, in this exercise it holds that 0.53 + 0.47 = 1; this happens just by chance, since these

two quantities are independent and can take any value in [0,1]. On the other hand, proportions are always

dimensionless.

• There are two independent Bernoulli populations

• The sample sizes are larger than 30

^ I − η^ R )−(ηI −ηR )

(η d

T ( I , R)= → N (0,1)

√ ? I (1−? I ) ? R (1−? R )

nI

+

nR

where each ? must be substituted by the best possible information: supposed or estimated. Two particular

versions of this statistic will be used:

^ I −η

(η ^ R)−θ0 d (η^ I − η

^ R )−θ 1 d

T 0 ( I , R)= → N (0,1) and T 1 (I , R)= → N (0,1)

√ η

^ I (1−η^ I ) η

nI

^ (1− η

+ R

nR

^ R)

√ η

^ I (1− η

nI

^ I) η

^ (1−η^ R )

+ R

nR

To determine the critical region or to calculate the p-value, both under H0, we need the value of the statistic for

the particular samples available:

(0.53−0.47)−0

T 0 (i , r )= =2.25

700 √

0.53(1−0.53) 0.47(1−0.47)

+

700

1) Question in H0

Hypotheses: If we want to allocate the question in the null hypothesis to reject it only when the data strongly

suggest so,

H 0 : ηI −ηR = θ 0 ≥ 0 and H 1 : ηI −ηR = θ1 < 0

By looking at the alternative hypothesis, we deduce the form of the critical region:

The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holds

while values slightly smaller than θ0 are due to mere random effects.

Decision: To apply the first methodology, the critical value a that determines the rejection region is found by

applying the definition of type I error, with the value α = 1 – 0.99 = 0.01 at θ0 = 0:

α (0) = P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)<a)

→ a=l 0.01=−2.326 → Rc = {T 0 ( I , R)<−2.326}

The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.

As regards the value of the type I error, it is α by definition. The second methodology is based on the

calculation of the p-value:

pV =P ( I , R more rejecting than i , r | H 0 true)= P (T 0 ( I , R) < T 0 (i , r ))

=P (T 0 (I , R) < 2.25)=0.988

→ pV =0.988 > 0.01=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1. Since the critical region is expressed in terms of T0

and we must use T1, we are going to apply the mathematical trick of adding and subtracting the same quantity:

β(θ1 )= P (Type II error) = P( Accept H 0 ∣ H 1 true)

|)

(η

^ I −η

^ R)−θ0

= P (T 0 ( I , R)∉ Rc | H 1) = P

(√ ^ I (1− η^ I ) η

η

nI

^ (1− η

+ R

nR

^ R)

≥−2.326 H 1

|)

^ I − η^ R )+ 0−θ 1

(η θ1

=P

(√ ^ I (1− η

η

nI

^ ( 1− η

^ I) η

+ R

nR

^ R)

+

√ ^ I (1− η

η

nI

^ I) η

^ (1−η^ R )

+ R

nR

≥−2.326 H 1

θ1

(

= P T 1 ( I , R)≥−2.326−

√ 0.53(1−0.53) 0.47(1−0.47)

700

+

700 )

For the particular value θ1 = –0.1,

( )

−0.1

β(−0.1) = P T 1( I , R)≥−2.326− = P ( T 1( I , R)≥1.42 )=0.078

700

+

700

By using a computer, many other values θ1 ≠ –0.1 can be considered so as to numerically determine the power

of the test curve 1–β(θ1) and to plot the power function.

ϕ(θ) = P ( Reject H 0 ) =

{ α (θ) if θ∈Θ0

1−β(θ) if θ ∈Θ1

# Sample and inference

ni = 700; nr = 700

sPi = 0.53; sPr = 0.47

alpha = 0.01

theta0 = 0 # Value under the null hypothesis H0

q = qnorm(alpha,0,1)

theta1 = seq(from=-0.25,to=0,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

2) Question in H1

Hypotheses: If we want to allocate the question in the alternative hypothesis to accept it only when the data

strongly suggest so,

H 0 : ηI −ηR = θ 0 ≤ 0 and H 1 : ηI −ηR = θ1 > 0

By looking at the alternative hypothesis, we deduce the form of the critical region:

The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holds

while values slightly larger than θ0 are due to mere random effects.

Decision: To apply the first methodology, the critical value a is calculated as follows:

α (0)= P (Type I error) = P( Reject H 0 | H 0 true) = P( T (I , R)∈R c | H 0 )= P (T 0 ( I , R)>a)

→ a=r 0.01=2.326 → Rc = {T 0 ( I , R)> 2.326 }

The decision is: T 0 ( i , r )=2.25 → T 0 (i , r )∉ Rc → H0 is not rejected.

The second methodology consists in doing:

pV =P ( I , R more rejecting than i , r | H 0 true)= P (T 0 (I , R) > T 0 (i , r ))

=P (T 0 (I , R) > 2.25)=1−P (T 0 ( I , R) ≤ 2.25)=0.0122

→ pV =0.0122 > 0.01=α → H0 is not rejected.

β(θ1 )= P (Type II error) = P( Accept H 0 ∣ H 1 true)

|)

(η

^ I −η

^ R)−θ0

= P (T 0 ( I , R)∉ Rc | H 1) = P

(√ ^ I (1− η^ I ) η

η

nI

^ (1− η

+ R

nR

^ R)

≤2.326 H 1

|)

^ I − η^ R )+ 0−θ 1

(η θ1

=P

(√ ^ I (1− η

η

nI

^ ( 1− η

^ I) η

+ R

nR

^ R)

+

√ ^ I (1− η

η

nI

^ I) η

^ (1−η^ R )

+ R

nR

≤2.326 H 1

θ1

(

= P T 1 ( I , R)≤2.326−

700

+

700 )

For the particular value θ1 = 0.1,

0.1

(

β(0.1)= P T 1 ( I , R)≤2.326−

√ 0.53(1−0.53) 0.47(1−0.47)

700

+

700

)

= P ( T 1 ( I , R)≤−1.42 ) =0.078

By using a computer, many more values θ1 ≠ 0.1 can be considered so as to numerically determine the power

of the test curve 1–β(θ1) and to plot the power function.

ϕ(θ) = P ( Reject H 0 ) =

{α (θ) if θ∈Θ0

1−β(θ) if θ ∈Θ1

ni = 700; nr = 700

sPi = 0.53; sPr = 0.47

alpha = 0.01

theta0 = 0 # Value under the null hypothesis H0

q = qnorm(1-alpha,0,1)

theta1 = seq(from=0,to=+0.25,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Conclusion: The hypothesis that the two proportions are equal is not rejected when the question is allocated

in either the alternative or the null hypothesis (the best way of testing an equality). That is, it seems that both

populations wish to visit Spain with the same desire. The sample information η^ I =0.53 and η^ R =0.47

suggested the alternative hypothesis H1: ηI – ηR > 0. The two power functions show how symmetric the

situations are. (Remember: statistical results depend on: the assumptions, the methods, the certainty and the

data.)

Advanced theory: Under the hypothesis H0: ηI = η = ηR, it makes sense to try to estimate the common

variance η(1–η) of the estimator—in the denominator—as well as possible. This can be done by using the

n η^ + n η^

pooled sample proportion η^ p= I I R R . Nevertheless, the pooled estimator should not be considered in

n I + nR

the numerator, since ( η^ p− η^ p)=0 whatever the data are. Now, the statistic under the null hypothesis is:

T~0 ( I , R)=

(η

^ I − η^ R )−θ0

=

( η^ I −η^ R )−θ 0 √ η

^ I (1−η^ I ) η

nI

^ (1− η

+ R

nR

^ R)

√ η

^ p (1− η

nI

^ p ) η^ p (1−η^ p )

+

nR √ η^ I (1− η

nI

^ I ) η^ R (1−η^ R )

+

nR √ η

^ p (1− η

nI

^ p) η

^ (1−η^ p )

+ p

nR

= T 0 ( I , R)

√ η^ I (1− η

nI

^ I ) η^ R (1−η^ R )

+

nR d

→ N (0,1)

Then,

√ η^ p (1− η

nI

^ p) η

^ (1− η

+ p

nR

^ p)

η^ p =

700⋅0.53+ 700⋅0.47 0.53+0.47 1

700+ 700

=

1+1

= =0.5 →

2

√ η^ I (1− η

nI

^ I ) η^ R (1−η^ R )

+

nR

= 0.9981983

√ η^ p (1− η^ p) η

nI

^ (1− η

+ p

nR

^ p)

~

→ T 0 ( i , r )=2.25⋅0.9981983=2.24 .

The same decisions are made with T 0 and T~0 because of the little effect of using η^ p in this exercise (see

the value of the quotient of square roots above); in other situations, both ways may lead to paradoxical results.

As regards the calculations of the type II error, both the mathematical trick of multiplying and dividing by the

same quantity and the mathematical trick of adding and subtracting the same quantity should be applied now.

For section (a):

β(θ1 ) = P (Type II error )= P ( Accept H 0 | H 1 true)

|)

(η

^ I− η

^ R)−θ0

= P ( T~0 ( I , R)∉ Rc | H 1) = P

(√ ^ p ( 1−η^ p ) η^ p (1−η^ p)

η

nI

+

nR

≥−2.325 H 1

(√ √

|)

η^ p (1− η

^ p) η

^ (1− η

^ p)

+ p

^ I −η

(η ^ R)−0−θ 1+ θ1 nI nR

=P ≥−2.325⋅ H1

η

^ I (1− η^ I ) η

nI

^ (1− η

+ R

nR

^ R)

√ η^ I (1− η

nI

^ I ) η^ R (1−η^ R )

+

nR

|)

(η

^ I−η

^ R)−θ1 θ1

=P

(√ ^ I (1− η

η

nI

^ ( 1− η

^ I) η

+ R

nR

^ R)

+

√ ^ I (1− η

η

nI

^ I) η

^ (1−η^ R )

+ R

nR

≥−2.325⋅1.002 H 1

θ1

(

= P T 1 ( I , R)≥−2.330−

√ 0.53(1−0.53) 0.47(1−0.47)

700

+

700 )

For the particular value θ1 = –0.1,

( )

−0.1

β(−0.1) = P T 1( I , R)≥−2.330− = P ( T 1( I , R)≥1.41 )=0.079 .

√ 0.53(1−0.53) 0.47(1−0.47)

700

+

700

Similarly for section (b).

My notes:

[HT-p] Based on Λ

Exercise 1ht-Λ

A random quantity X follows a Poisson distribution. Let X = (X1,...,Xn) be a simple random sample. By

applying the results involving Neyman-Pearson's lemma and the likelihood ratio, study the critical region

(estimator that arises and form) for the following pairs of hypotheses.

{ H 0: λ = λ0

H 1: λ = λ1 { H 0 : λ = λ0

H 1 : λ = λ1 > λ0 { H 0 : λ = λ0

H 1 : λ = λ1 < λ0 { H 0 : λ ≤ λ0

H 1 : λ = λ 1> λ0 { H 0 : λ ≥ λ0

H 1 : λ = λ 1< λ 0

Discussion: This is a theoretical exercise where no assumption should be evaluated. First of all, Neyman-

-Pearson's lemma will be applied. We expect the maximum-likelihood estimator of the parameter—calculated

in a previous exercise—and the “usual” critical region form to appear. If the critical region does not depend on

any particular value θ1, the uniformly most powerful test will have been found.

For the Poisson distribution,

Hypothesis test

{ H 0: λ = λ0

H 1: λ = λ1

n n

∑ j=1 X j L ( X ; λ 0) λ0 ∑ X

( )

j

j=1

e−n(λ −λ )

0 1

n

L( X ; λ 1) λ1

∏ j=1 X j !

Rejection region:

{( }

n

λ0 ∑ X

) ( ) {( ∑ n

} λ

j

)

j=1

−n(λ 0−λ 1)

Rc = { Λ < k } = e <k = X j ⋅log λ0 −n (λ 0−λ 1) < log (k )

λ1 j=1

1

=

{( ∑n

j=1 j

λ

λ

0

1 }{ λ λ

X )⋅log ( ) < log (k )+ n( λ −λ ) = n X̄⋅log ( ) < log (k )+n (λ −λ )

0 1

} 0

1

0 1

{ }

log(k )+n (λ 0−λ 1)

• if λ 1< λ 0 then log ( ) λ0

λ1

̄ <

> 0 and hence Rc = X

λ

n log 0

λ1 ( )

{ }

log(k )+n (λ 0−λ 1)

• if λ 1> λ 0 then log ( ) λ0

λ1

< 0 and hence Rc = X

̄ >

λ

n log 0

λ1 ( )

This suggests the estimator X̄ =λ̂ ML (calculated in a previous exercise) and regions of the form

Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }

Hypothesis tests

{ H 0 : λ = λ0

H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0

H 1 : λ = λ 1< λ 0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it

only depends upon λ0 through λ^ ML or T0:

α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)

This implies that the uniformly most powerful test has been found.

Hypothesis tests

{ H 0 : λ ≤ λ0

H 1 : λ = λ 1> λ 0 { H 0 : λ ≥ λ0

H 1 : λ = λ 1< λ 0

A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .

Hypothesis test

{ H 0: λ = λ0

H 1: λ = λ1

n

L ( X ; λ 0) λ0 n −(λ −λ )∑

( )

n

n −λ ∑ j=1 X X

L( X ; λ) = λ e j

and Λ ( X ; λ 0 , λ1 ) = = e 0 1 j=1 j

L( X ; λ 1) λ1

Rejection region:

{( λ0 n −(λ −λ )∑

) }{ ( λλ )−(λ −λ ) ∑ }

n

X n

0

Rc = { Λ < k } = e 0 1 j=1 j

< k = n log 0 1 X j < log(k )

λ1 1

j=1

{ n λ

( )} {

λ

= (λ1−λ 0) ∑ j=1 X j < log(k )−n log λ 0 = (λ 1−λ 0) n X̄ < log( k )−n log λ 0

1 1

( )}

Now it is necessary that λ 1≠λ 0 and

{ }{

λ

log (k )−n log λ 0( )= 1<

}

1 n (λ 1−λ 0)

• ̄ >

if λ 1< λ 0 then (λ 1−λ 0 )< 0 and Rc = X

n(λ 1−λ0 ) ̄

X λ0

log (k )−n log ( )

λ1

{ }{

λ0

log(k )−n log (λ ) = 1 >

}

1 n (λ 1−λ 0)

• ̄ <

if λ 1> λ 0 then (λ 1−λ 0 )> 0 and Rc = X

n(λ 1−λ0 ) ̄

X λ0

log(k )−n log ( )

λ1

1 ̂

This suggests the estimator =λ ML (calculated in a previous exercise) and regions of the form

X̄

Rc = {Λ< k } = ⋯= { λ̂ ML <c }= ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { λ^ ML >c }= ⋯= {T 0 > a }

Hypothesis tests

{ H 0 : λ = λ0

H 1 : λ = λ 1> λ 0 { H 0 : λ = λ0

H 1 : λ = λ 1< λ 0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it

only depends upon λ0 through λ^ ML or T0:

α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)

This implies that the uniformly most powerful test has been found.

Hypothesis tests

{ H 0 : λ ≤ λ0

H 1 : λ = λ 1>λ 0 { H 0 : λ ≥ λ0

H 1 : λ = λ 1<λ 0

A uniformly most powerful test for H 0 : λ = λ0 is also uniformly most powerful for H 0 : λ ≤ λ0 .

Hypothesis test

{ H 0 : η= η0

H 1 : η= η1

n n

L( X ; η0 ) η0 ∑ X

1−η0 n− ∑ j=1 X

( )

n n j j

∑ j=1 X j n−∑ j=1 X

( )

j=1

L( X ; η) = η (1−η)

j

and Λ ( X ; η0 , η1) = =

L( X ; η1) η1 1−η1

Rejection region:

{ ( ) }{ ) ( )(

n n

η ∑

}

X n− ∑ j=1 X

1−η0 1−η0

η0

( )

j j n n

( ) (∑ )

j=1

Rc = { Λ < k }= η01

1−η1

<k = j =1

X j log η1 + n− ∑ j =1

X j log

1−η1

< log(k )

=

{(∑ )[ ( ) ( )]

n

j=1

X( )} j

η0

log η1 −log

1−η0

1−η1

< log( k )−n log

1−η0

1−η1

{ ( )

̄ log

= nX ( )} η0 (1−η1 )

η1 (1−η0 )

< log (k )−n log

1−η0

1−η1

{ }

1−η0

( )

log ( k )−n log

1−η1

• if η1 < η0 then log

( η0 (1−η1)

η1( 1−η0) )

> 0 and Rc = X

̄ <

n log ( )

η0 (1−η1)

η1(1−η0)

{ ) }

1−η0

log( k )−n log ( )

1−η1

• if η1 > η0 then log

( η0 (1−η1)

η1( 1−η0) ) ̄ >

<0 and Rc = X

n log ( η0 (1−η1)

η1(1−η0)

̄ =η̂ ML (calculated in a previous exercise) and regions of the form

Rc = {Λ< k } = ⋯= { η̂ ML <c } = ⋯= {T 0 < a } or Rc = {Λ< k } = ⋯= { η^ ML >c } = ⋯= {T 0 >a }

Hypothesis tests

{ H 0 : η = η0

H 1 : η = η1 >η0 { H 0 : η= η0

H 1 : η= η1<η0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any η1 since it

only depends upon η0 through η^ ML or T0:

α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)

This implies that the uniformly most powerful test has been found.

Hypothesis tests

{ H 0 : η ≤ η0

H 1 : η = η1 >η0 { H 0 : η≥ η0

H 1 : η= η1<η0

A uniformly most powerful test for H 0 : η = η0 is also uniformly most powerful for H 0 : η ≤ η0 .

Hypothesis test

{ H 0 : μ = μ0

H 1 : μ = μ1

n / 2 − 1 ∑ ( X −μ)2

n

1

( )

2 j=1 j

2σ

L( X ; μ) = e

2 π σ2

and

1 1

[∑ ]

n n n

L( X ; μ 0)

2 2 2 2 2 2

− 2 j=1

( X j −μ 0 ) −∑ j=1 (X j −μ 1) − 2 ∑ j=1 (

X j −2 μ 0 X j +μ 0− X j +2 μ 1 X j−μ 1)

Λ ( X ; μ0 ,μ 1) = = e 2σ =e 2σ

L (X ;μ 1)

2 2

1 1 (μ 0−μ1 ) n (μ 0−μ 1)

[ ]

n n n

− 2 ∑ j=1

( μ 20−μ 21−2 μ 0 X j +2 μ 1 X j ) − 2

2 2

n (μ 0−μ 1)−2 (μ 0−μ 1) ∑ j=1 X j 2 ∑ j=1 X j − 2

2σ 2σ 2σ

=e =e =e σ

e

Rejection region:

R = { Λ < k } = {e < k }=

2 2

n (μ 0−μ 1)

{(μ σ−μ ) ∑ }

(μ 0−μ 1) n 2 2

2 ∑ j=1 X j −

2σ 2 0 1

n n (μ 0 −μ 1 )

c

σ

e 2 j=1

X j− 2

< log ( k )

2σ

{

= (μ 0−μ 1) (∑

n

j=1

X j ) < log (k ) σ + n2 (μ −μ )} = {(μ −μ )n X̄ < log( k )σ + n2 (μ −μ )}

2 2

0

2

1 0 1

2 2

0

2

1

{ }

n

log(k ) σ2 + (μ 20−μ21 )

2

• if μ1 <μ 0 then (μ 0−μ 1)>0 and Rc = X

̄ <

n(μ 0−μ 1)

{ }

n

log(k ) σ2 + (μ 20−μ21 )

2

• if μ1 >μ 0 then (μ 0−μ 1)<0 and Rc = X

̄ >

n(μ 0−μ 1)

This suggests the estimator X

Rc = {Λ< k } = ⋯= {μ̂ ML < c }= ⋯ = {T 0 <a } or Rc = {Λ< k } = ⋯= {μ^ ML > c }= ⋯ = {T 0 >a }

Hypothesis tests

{ H 0 : μ = μ0

H 1 : μ = μ1 >μ 0 { H 0 : μ = μ0

H 1 : μ = μ 1<μ 0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any μ1 since it

only depends upon μ0 through μ^ ML or T0:

α=P (Type I error)= P (T 0 <a) or α=P (Type I error)=P (T 0 > a)

This implies that the uniformly most powerful test has been found.

Hypothesis tests

{ H 0 : μ ≤ μ0

H 1 : μ = μ1 >μ 0 { H 0 : μ ≥ μ0

H 1 : μ = μ 1<μ 0

A uniformly most powerful test for H 0 : μ = μ 0 is also uniformly most powerful for H 0 : μ ≤ μ 0 .

Conclusion: Well-known theoretical results have been applied to study the optimal form for the critical

region of different pairs of hypothesis. Since both the likelihood ratio and the maximum likelihood estimator

use the likelihood function, the critical region of the tests can be expressed in terms of this estimator.

My notes:

Exercise 1ht-av

The fog index is used to measure the reading difficulty of a written text: The higher the value of the index, the

more difficult the reading level. We want to know if the reading difficulty index is different for three

magazines: Scientific American, Fortune, and the New Yorker. Three independent random samples of 6

advertisements were taken, and the fog indices for the 18 advertisements were measured, as recorded in the

following table

SCIENTIFIC AMERICAN FORTUNE NEW YORKER

15.75 12.63 9.27

11.55 11.46 8.28

11.16 10.77 8.15

9.92 9.93 6.37

9.23 9.87 6.37

8.20 9.42 5.66

Apply an analysis of variance to test whether the average level of difficulty is the same in the three magazines.

(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

Discussion: The analysis of variance can be applied when populations are normally distributed and their

variances are equal, that is, X p ∼ N (μ p , σ2p ) with σ p = σ , ∀ p . These suppositions should be evaluated

(this will be done at the end of the exercise). If the equality of the means is rejected, additional analyses would

be necessary to identify which means are different—this information is not provided by the analysis of

variance. On the other hand, the calculations involved in this analysis are so tedious that almost everybody

uses the computer. Finally, the unit of measurement of the index u is unknown for us.

Statistic: There is one factor identifying the population out of the three possible ones (we do not consider

other magazines), so a one-factor fixed-effects analysis will be applied. The statistic is

MSG MSG

T ( X SA , X FO , X NY ) = with T 0 = ∼ F P −1, n−P ≡ F 3−1, 18−3 ≡ F 2, 15

MSW MSW

Some calculations are necessary to evaluate of the statistic T ( x SA , x FO , x NY ) . First of all, we look at the

three sample means:

1 6 15.75 u +⋯+8.20 u

X̄ SA = ∑ j=1 X SA , j = =10.97 u

6 6

1 6 12.63 u+⋯+ 9.42 u

X̄ FO = ∑ j =1 X FO , j = =10.68 u

6 6

1 6 9.27 u+⋯+5.66 u

X̄ NY = ∑ j=1 X NY , j = =7.35u

6 6

The magnitude of the first and the third seems quite different, which suggests that the population means may

be different. Nevertheless, we should not trust intuition.

1 n 15.75 u+⋯+ 5.66 u

X̄ = ∑ j =1 X j = =9.67 u

18 18

P

̄ p− X̄ )2 = n SA⋅( X

SSG = ∑ p =1 n p ( X ̄ )2 + n FO⋅( X̄ FO− X̄ )2+ n NY⋅( X

̄ SA− X ̄ )2

̄ NY − X

= 6⋅(10.97 u−9.67 u)2 +6⋅(10.68 u−9.67 u)2 +6⋅(7.35 u−9.67 u)2=48.53 u2

1 48.53 u2 2

MSG = SSG= =24.26 u

P−1 3−1

P np 2 6 2 6 2 6 2

SSW = ∑ p=1 ∑ j=1 ( X p , j− X̄ p ) = ∑ j =1 ( X SA , j− X̄ SA) + ∑ j =1 ( X FO , j− X̄ FO ) + ∑ j =1 ( X NY , j − X̄ NY )

2 2

=(15.75 u−10.97 u) +⋯+(8.20 u−10.96 u)

+ (12.63 u−10.68 u)2 +⋯+(9.42 u−10.68 u)2

+ (9.27 u−7.35 u)2 +⋯+(5.66 u−7.35 u)2

2

= 52.22 u

1 52.22 u 2 2

MSW = SSW = =3.48 u

n−P 18−3

and, finally,

MSG 24.26 u2

T 0 ( x SA , x FO , x NY ) = = =6.97

MSW 3.48u 2

H 0 : μ 1 = μ 2 =⋯ = μ P and H 1 : ∃a , b / μ a ≠ μ b

For this statistic,

→ a=r α=6.359 → Rc = {T 0 ( X SA , X FO , X NY ) > 6.359 } > qf(0.99, 2, 15)

[1] 6.358873

Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value that

the data provide for the statistic. If they are not compatible because the value seems extreme when the

hypotheses is true, we will trust the data and reject the hypothesis H0.

Since T 0 ( x SA , x FO , x NY )=6.97 > 6.359 → T 0 ( x)∈Rc → H0 is rejected.

The second methodology is based on the calculation of the p-value:

pV =P (( X SA , X FO , X NY ) more rejecting than (x SA , x FO , x NY ) ∣ H 0 true)

=P (T 0 ( X SA , X FO , X NY )>T 0 ( x SA , x FO , x NY ))=P (T 0 >6.97)=0.0072

→ pV =0.007243< 0.01=α → H0 is rejected. > 1-pf(6.97, 2, 15)

[1] 0.007235116

Conclusion: As suggested by the sample means, the population means of the three magazines are not equal

with a confidence of 0.99, measured in a 0-to-1 scale. Pairwise comparisons could be applied to identify the

differences.

We have not done the calculations by hand but using the programming language R. The code is:

# To enter the three samples

SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)

FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)

NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)

# To join the samples in a unique vector

Data = c(SA, FO, NY)

# To calculate the sample mean of the three groups and the total sample mean

mean(SA) ; mean(FO) ; mean(NY) ; mean(Data)

# To calculate the measures and the statistic (for large datasets, the previous means should have been saved)

SSG = 6*((mean(SA) - mean(Data))^2) + 6*((mean(FO) - mean(Data))^2) + 6*((mean(NY) - mean(Data))^2)

MSG = SSG/(3-1)

SSW = sum((SA - mean(SA))^2) + sum((FO - mean(FO))^2) + sum((NY - mean(NY))^2)

MSW = SSW/(18-3)

T0 = MSG/MSW

# To find the quantile 'a' that determines the critical region

a = qf(0.99, 2, 15)

# To calculate the p-value

pValue = 1 - pf(T0, 2, 15)

(In the console, write the name of a quantity to print its value.)

Statistical software programs have many built-in functions to apply the most basic methods. Now we use R to

obtain the analysis of variance table. As regards the syntaxis, it is based on the linear regression framework,

X p , j = μ p + ϵ p , j , where this linear dependence of X on the factor effect μp is denoted by Data ~ Group

(see the call to the function aov below).

## After running the first block of lines of the previous code:

# To create a vector with the membership labels

Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))

# To apply a one-factor analyis of variance

objectAV = aov(Data ~ Group)

# To print the table with the results

summary(objectAV)

Group 2 48.53 24.264 6.97 0.00723 **

Residuals 15 52.22 3.481

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Compare these quantities with those obtained in the previous calculations.) An equivalent way of applying

the analysis of variance with R consists in substituting the lines

# To apply a one-factor analyis of variance

objectAV = aov(Data ~ Group)

# To print the table with the results

summary(objectAV)

by the lines

# To fit a linear regression model

Model = lm(Data ~ Group)

# To apply and print the analysis of variance

anova(Model)

By using a computer it is also easy to evaluate the fulfillment of the assumptions.

# To enter the three samples

SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)

FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)

NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)

# To join the samples in a unique vector

Data = c(SA, FO, NY)

# To create a vector with the membership labels

Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))

# To test the normality of the sample SA by applying two different hypothesis tests

shapiro.test(SA)

ks.test(SA, "pnorm", mean=mean(SA), sd=sd(SA))

# To test the normality of the sample FO by applying two different hypothesis tests

shapiro.test(FO)

ks.test(FO, "pnorm", mean=mean(FO), sd=sd(FO))

# To test the normality of the sample NY by applying two different hypothesis tests

shapiro.test(NY)

ks.test(NY, "pnorm", mean=mean(NY), sd=sd(NY))

# To test the equality of the variances

bartlett.test(Data ~ Group)

My notes:

[HT] Nonparametric

Remark 14ht: Nonparametric methods involve questions not based on parameters, and therefore it is not usually necessary to

evaluate some kinds of supposition that were present in the parametric hypothesis tests.

Exercise 1ht-np

Occupational Hazards. The following table is based on data from the U.S. Department of Labor, Bureau of

Labor Statistics.

Taxi

Police Cashiers Guards

Drivers

Homicide 82 107 70 59

Cause of death other

than homicide

92 9 29 42

490

A) Use the data in the table, coming from a simple random sample, to test the claim that occupation is

independent of whether the cause of death was homicide. Use a significance α = 0.05 and apply a

nonparametric chi-square test.

B) Does any particular occupation appear to be most prone to homicides? If so, which one?

(Based on an exercise of Essentials of Statistics, Mario F. Triola, Pearson)

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)

job. Your job is what you do to earn your living: 'You'll never get a job if you don't have any qualifications.' 'She'd like to change her job

but can't find anything better.' Your job is also the particular type of work that you do: 'John's new job sounds really interesting.' 'I know

she works for the BBC but I'm not sure what job she does.' A job may be full-time or part-time (NOT half-time or half-day): 'All she

could get was a part-time job at a petrol station.'

do (for a living). When you want to know about the type of work that someone does, the usual questions are What do you do? What

does she do for a living? etc 'What does your father do?' - 'He's a police inspector.'

occupation. Occupation and job have similar meanings. However, occupation is far less common than job and is used mainly in formal

and official styles: 'Please give brief details of your employment history and present occupation.' 'People in manual occupations seem to

suffer less from stress.'

post/position. The particular job that you have in a company or organization is your post or position: 'She's been appointed to the post of

deputy principal.' 'He's applied for the position of sales manager.' Post and position are used mainly in formal styles and ofter refer to

jobs which have a lot of responsability.

career. Your career is your working life, or the series of jobs that you have during your working life: 'The scandal brought his career in

politics to a sudden end.' 'Later on in his career, he became first secretary at the British Embassy in Washington.' Your career is also the

particular kind of work for which you are trained and that you intend to do for a long time: 'I wanted to find out more about careers in

publishing.'

trade. A trade is a type of work in which you do or make things with your hands: 'Most of the men had worked in skilled trades such as

carpentry or printing.' 'My grandfather was a bricklayer by trade.'

profession. A profession is a type of work such as medicine, teaching, or law which requires a high level of training or education: 'Until

recently, medicine has been a male-dominated profession.' 'She entered the teaching profession in 1987.'

LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)

occupations. The words people use affectionately, humorously, or disparagingly to describe their own occupations are their own affair.

They may say, “I'm in show business” (or, more likely, “show biz”), or “I'm in the advertising racket,” or “I'm in the oil game,” or “I'm in

the garment line.” But outsiders should use more caution, more discretion, and more precision. For instance, it is improper to write, “Mr.

Danaher has been in the law business in Washington.” Law is a profession. Similarly, to say someone is “in the teaching game” would

undoubtedly give offense to teachers. Unless there is some special reason to be slangy or colloquial, the advisable thing to do is to accord

every occupation the dignity it deserves.

Discussion: In this exercise, it is clear from the statement that we need to test the independence of two

variables. A particular sample (x1,...,x490) were grouped and we are given the absolute frequencies in the

empirical table. By looking at the table, the cashier occupation appears to be most prone to homicides.

Statistic: Since we have to apply a test of independence, from a table of statistics (e.g. in [T]) we select

L K(N lk − e^lk )2 d 2

T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)

e^lk

for L and K classes, respectively.

Hypotheses: The null hypothesis supposes that the two variables are independent,

H 0 : X , Y independent and H 1 : X , Y dependent

or, probabilistically,

H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )

This implies that the probability at any cell is the product of the marginal probabilities of its file and column.

Note that two underlying probability distributions are supposed for X and Y, although we do not care about

them, and we will directly estimate the probabilities from the empirical table. As

318⋅174 2 172⋅101 2

T 0 ( x)=

(

82−

490

+⋯+

42− ) 490 (

=65.52

)

318⋅174 172⋅101

490 490

This value, calculated under H0 and using the data, is necessary both to determine the critical region and to

calculate the p-value.

On the other hand, for any chi-square tests T0 is a nonnegative measure of the dissimilarity between the

two tables; therefore, a value close to zero means that the two tables are similar, while the critical region is

always of the form:

d

T 0 ( X ) → χ(2L−1)(K −1) ≡ χ2(2−1 )(4−1) ≡ χ32

For the first methodology, to calculate a the definition of type I error is applied with α = 0.05:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P(T ( X )∈ Rc ∣ H 0)≈ P (T 0 ( X )>a)

→ a=r α=7.81 → Rc = {T 0 ( X )>7.81 }

If we apply the methodology based on the p-value,

−14

= P (T 0 ( X )>65.52) = 3.885781⋅10

1-pchisq(65.52, 3)

→ pV <0.05=α → H0 is rejected. [1] 3.885781e-14

Instead of using the computer, we can consider the last value in our table to bind the p-value (statisticians

want to discover its value, while we want only to check whether or not it is smaller than α):

pV = P (T 0 ( X )>65.52) < P(T 0 ( X )>11.3)=0.01 → pV <0.01<0.05=α → H0 is rejected

Conclusion: The hypothesis that the two variables are independent is rejected. This means that there seems

to be a correlation between occupation and cause of death. (Remember: statistical results depend on: the

assumptions, the methods, the certainty and the data.)

My notes:

Exercise 2ht-np

World War II Bomb Hits in London. To carry out an analysis, South London was divided into 576 areas. For

the variable N ≡ number of bombs in the k-th area (any), a simple random sample (x1,...,x576) was gathered

and grouped in the following table:

EMPIRICAL

5 or

Number of Bombs 0 1 2 3 4

more

Number of Regions 229 211 93 35 7 1 n = 576

Data taken from: An application of the Poisson distribution. Clarke, R. D. Journal of the Institute of Actuaries [JIA] (1946) 72: 481

http://www.actuaries.org.uk/research-and-resources/documents/application-poisson-distribution

(1) Test at 95% confidence whether N can be supposed to follow a Poisson distribution.

(2) Test at 95% confidence whether N can be supposed to follow a Poisson distribution with λ = 0.8.

Discussion: We must apply the chi-square methodology to study if the data statistically fit the models

specified. In the second section, a value for the parameter is given. For this probability model,

We have to calculate or estimate the probabilities in order to obtain the expected absolute frequencies. Finally,

by using the statistic T we will compare the two tables and make a decision.

Statistic: Since we have to apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

( N k −e^k )2 d 2

K

T 0 ( X )=∑k =1 → χ K −s−1

e^k

where K is the number of classes and s is the number of parameters of F0 that we need to estimate so as to use

this distribution for obtaining the class probabilities or approximations of them.

Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are

H 0 : N ∼ F 0 = Pois (λ) and H 1 : N ∼ F≠ Pois( λ)

(It can be thought that both hypotheses are composite.) To fill in the expected table (under H0), the formula

e^ k =n⋅^p k will be applied. To estimate ^p k the supposed distribution under H0 must be used. And to use the

distribution, an estimator λ^ of the parameter is necessary. Once we have this estimator, the probabilities are

̂

calculated by using software, tables, or the mass function plus the plug-in principle: f ( x ; λ).

On the other hand, to estimate λ we take into account that for this distribution the expectation (and also

the variance) is equal to the parameter. Since the sample mean estimates the expectation, in this case it can be

used to estimate λ too. (If we had not remembered this property, we would have applied the method of the

moments or the maximum likelihood method to obtain this estimator.) Then,

^ μ= 1 576

λ= ^ x̄= ∑ x

576 j =1 j

Since our data are grouped, we can imagine that (look at the table): 229 data are 0's, 211 are 1's, 93 are 2's, 35

are 3's, 7 are 4's, and, finally, 1 is unknown but equal or higher than 5, so we can consider 5 or even 6.

̂ 229⋅0+211⋅1+ 93⋅2+35⋅3+7⋅4+1⋅5 =0.93

λ=

576

By using the plug-in principle and the calculator we obtain

̂p 0=P λ̂ ( X =0)= f λ̂ (0)= e =0.395 ̂p 1=P λ̂ ( X =1)= f λ̂ (1)= e =0.367

0! 1!

0.932 −0.93 0.933 −0.93

̂p 2=P λ̂ ( X =2)= f λ̂ (2)= e =0.171 ̂p 3=P λ̂ ( X =3)= f λ̂ (3)= e =0.0529

2! 3!

4

0.93 −0.93

̂p 4= P λ̂ ( X =4)= f λ̂ (4)= e =0.0123 ̂p 5>=1−P λ̂ ( X ≤4)=1−(0.39+⋯+0.012)=0.00270

4!

Poisson (λ = 0.93)

Values 0 1 2 3 4 5 or more

Probabilities 0.395 0.367 0.17 0.0529 0.0123 0.00270 1

Number of Bombs 0 1 2 3 4 5 or more

Number of Regions 227.26 211.35 98.28 30.47 7.08 1.55 n = 576

We have really done the calculations with the programming language R. By using a calculator, some

quantities may be slightly different due to technicals effects (number of decimal digits, accuracy, etc).

> dpois(c(0,1,2,3,4), 0.93) > 576*dpois(c(0,1,2,3,4), 0.93)

[1] 0.39455371 0.36693495 0.17062475 0.05289367 0.01229778 [1] 227.262937 211.354532 98.279857 30.466756 7.083521

> 1 - sum(dpois(c(0,1,2,3,4), 0.93)) > 576*(1 - sum(dpois(c(0,1,2,3,4), 0.93)))

[1] 0.002695135 [1] 1.552398

To guarantee the quality of the chi-square methodology, the expected absolute frequencies are usually required

to be larger than four (≥5). For this reason, we merge the last two classes in both the empirical and the

expected tables.

EMPIRICAL

Number of Bombs 0 1 2 3 4 or more

Number of Regions 229 211 93 35 7+1=8 n = 576

Number of Bombs 0 1 2 3 5 or more

Number of Regions 227.26 211.35 98.28 30.47 7.08+1.55=8.63 n = 576

( 229−227.26 )2 ( 8−8.63 )2

T 0 ( x)= +⋯+ =1.019

227.26 8.63

We have calculated the value of T0 with the computer too:

> empirical = c(229, 211, 93, 35, 8)

> expected = 576*c(dpois(c(0,1,2,3), 0.93), (1-sum(dpois(c(0,1,2,3), 0.93))))

> sum(((empirical-expected)^2)/expected)

[1] 1.018862

For this kind of test, the critical region always has the following form:

Decision: There are K = 5 classes (after merging two of them) and s = 1 estimation, so

d

T 0 ( X ) → χ 2K −s −1 ≡ χ5−1−1

2

≡ χ 23

If we apply the methodology based on the critical region, the necessary quantile a is calculated from the

definition of the type I error, with the given α = 0.05:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈Rc )≈P (T 0 (X )>a)

> qchisq(1-0.05, 3)

→ a=r α=7.81 → Rc = {T 0 ( X )>7.81 } [1] 7.814728

Then, the decision is: T 0 ( x)=1.019 < 7.81 → T 0 ( x)∉ Rc → H0 is not rejected.

If we apply the alternative methodology based on the p-value,

→ pV =0.80> 0.05=α → H0 is not rejected. > 1 - pchisq(1.019, 3)

[1] 0.7966547

(2) Fit to a member of the Poisson family

Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are

H 0 : N ∼ F 0 = Pois (0.8) and H 1 : N ∼ F≠ Pois( 0.8)

(It can be thought that the null hypothesis is simple while the alternative hypothesis is composite.) To fill in

the expected table (under H0), the formula e k =n⋅p k will be applied, where the probabilities can be taken

from a table or can be calculated by substituting in the mass function f (x ; λ) ,

0 1

0.8 −0.8 0.8 −0.8

p 0=P λ ( X =0)= f λ (0)= e =0.449 p 1=P λ ( X =1)= f λ (1)= e =0.359

0! 1!

0.82 −0.8 0.83 −0.8

p 2=P λ ( X =2)= f λ (2)= e =0.144 p 3=P λ ( X =3)= f λ (3)= e =0.0383

2! 3!

0.84 −0.8

p 4= P λ ( X =4)= f λ (4)= e =0.00767 p 5>=1−P λ ( X ≤4)=⋯=0.00141

4!

Poisson (λ = 0.8)

Values 0 1 2 3 4 5 or more

Probabilities 0.449 0.359 0.144 0.0383 0.00767 0.00141 1

Number of Bombs 0 1 2 3 4 5 or more

Number of Regions 258.81 207.05 82.82 22.09 4.42 0.813 n = 576

> dpois(c(0,1,2,3,4), 0.8) > 576*dpois(c(0,1,2,3,4), 0.8)

[1] 0.449328964 0.359463171 0.143785269 0.038342738 0.007668548 [1] 258.813483 207.050787 82.820315 22.085417 4.417083

> (1-sum(dpois(c(0,1,2,3,4), 0.8))) > 576*(1-sum(dpois(c(0,1,2,3,4), 0.8)))

[1] 0.00141131 [1] 0.8129146

As in the previous case, we merge the last two classes for all the expected absolute frequencies to be larger

than four

EMPIRICAL

Number of Bombs 0 1 2 3 4 or more

Number of Regions 229 211 93 35 7+1=8 n = 576

Number of Bombs 0 1 2 3 5 or more

Number of Regions 258.81 207.05 82.82 22.09 4.42+0.813=5.233 n = 576

> empirical = c(229, 211, 93, 35, 8)

> expected = 576*c(dpois(c(0,1,2,3), 0.8), (1-sum(dpois(c(0,1,2,3), 0.8))))

> sum(((empirical-expected)^2)/expected)

[1] 13.77982

so

( 229−258.81 )2 ( 8−5.233 ) 2

T 0 ( x)= +⋯+ =13.78

258.81 5.233

On the other hand, for this kind of test the critical region always has the following form:

d

T 0 ( X ) → χ 2K −s −1 ≡ χ5−1−0

2

≡ χ24

Now T0 follows the χ2 distribution with 4 degrees of freedom—it was 3 in the previous section.

α=P (Type I error)=P ( Reject H 0 | H 0 true)=P (T 0 ( X )∈ R c )≈P (T 0 ( X )> a)

> qchisq(1-0.05, 4)

→ a=r α=9.49 → Rc = {T 0 ( X )>9.49 } [1] 9.487729

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x | H 0 true)= P( T 0 ( X )>T 0 (x ))=P (T 0 ( X )> 13.78)=0.0080

→ pV =0.0080< 0.05=α → H0 is rejected. > 1-pchisq(13.78, 4)

[1] 0.00803134

Conclusion: The hypothesis that bomb hits can reasonably be modeled by using the Poisson family has not

^

been rejected. In this case, data provided an estimate λ=0.93 . Nevertheless, when the value λ=0.8 is

imposed, the hypotheses that bomb hits can be modeled by using a Pois(λ=0.8) model is rejected. This proves

that:

i. Even quite reasonable a model may not fit the data if inappropriate parameter values are considered.

This emphasizes the importance of using good parameter estimation methods.

ii. Estimating the parameter value was better than fixing a value close to the estimate. As statisticians say:

“let the data talk”. This hightlights the necessity of testing all suppositions, which implies that

nonparametric procedures should sometimes be applied before the parametric ones: in this case, before

supposing that the Poisson family is proper and imposing a value for the parameter, the whole Poisson

family must be considered.

(Remember: statistical results depend on: the assumptions, the methods, the certainty and the data.)

Advanced theory: Mendelhall, W., D.D. Wackerly and R.L. Scheaffer say (Mathematical Statistics with

Applications, Duxbury Press) that the expected absolute frequencies can be as low as 1 for some situations,

according to Cochran, W.G., “The χ2 Test of Goodness of Fit”, Annals of Mathematical Statistics, 23 (1952)

pp. 315-345. To take the most advantage of this exercise, we repeat the previous calculations without merging

the last two classes.

(1) Fit to the Poisson family

We evaluate T0, which is necessary to apply any of the two methodologies.

( 229−227.26 )2 ( 1−1.55 )2

T 0 ( x)= +⋯+ =1.167

227.26 1.55

2 d 2 2

Now there are K = 6 classes and s = 1 estimation, so T 0 ( X ) → χ K −s −1 ≡ χ6−1−1 ≡ χ4 . If we apply the

methodology based on the critical region, the necessary quantile a is calculated from the definition of the type

I error, with the given α = 0.05:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈Rc )≈P (T 0 (X )>a)

> qchisq(1-0.05, 4)

→ a=r α=9.49 → Rc = {T 0 ( X )>9.49 } [1] 9.487729

Then, the decision is: T 0 ( x)=1.167 < 9.49 → T 0 ( x)∉Rc → H0 is not rejected.

If we apply the alternative methodology based on the p-value,

> 1-pchisq(1.167, 4)

→ pV =0.88> 0.05=α → H0 is not rejected. [1] 0.8835014

We calculate the value of T0

( 229−258.81 )2 ( 1−0.813 )2

T 0 ( x)= +⋯+ =13.87

258.81 0.813

2d 2 2

Since K = 6 and s = 0, now T 0 ( X ) → χ K −s −1 ≡ χ6−1−0 ≡ χ5 . Then,

α=P (Type I error)=P ( Reject H 0 | H 0 true)=P (T 0 ( X )∈ R c )≈P (T 0 ( X )> a)

→ a=r α=11.07 → Rc = {T 0 ( X )>11.07} > qchisq(1-0.05, 5)

[1] 11.0705

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x | H 0 true)= P(T 0 ( X )>T 0 ( x ))=P (T 0 ( X )> 13.87)=0.0165

→ pV =0.0165< 0.05=α → H0 is rejected. > 1-pchisq(13.87, 5)

[1] 0.01645663

In both sections the same decisions have been made, which implies that this is one of those situations where

merging the last two classes does not seem essential.

My notes:

Exercise 3ht-np

Three finantial products have been commercialized and the presence of interest in them has been registered

for some individuals. It is possible to imagine different situations where the following data could have been

obtained.

Product 1 Product 2 Product 3

Group 1 10 18 9 37

Group 2 20 13 15 48

30 31 24 85

(a) A simple random sample of 48 people of the second group were allocated after considering the

variable product, test at α = 0.01 whether this variable follows the distribution determined by the

sample of the first group.

(b) A simple random sample of 85 people with interest in any of the products were allocated after

considering the two variables group and product. Test at α = 0.01 the independence of the two

variables.

(c) From two independent groups, simple random samples of 37 people and 48 people are surveyed,

respectively. Test at α = 0.01 the homogeneity of the distribution of the variable product in the groups.

Discussion: In this exercise, the same table is looked at as containing data obtained from three different

schemes. The chi-square methodology will be applied in all sections through three kinds of test: goodness-of-

-fit, independence and homogeneity. In the first case, a probability distribution F0 is specified, while in the last

two cases the underlying distributions have no interest by themselves.

Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

2

( N k −e^k ) d 2

K

T 0 ( X )=∑k =1 → χ K −s−1

e^k

where there are K classes and s parameters must be estimated to determine the probabilities.

Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis assumes that the theoretical

probabilities of the second group follow the probabilities determined by the sample of the first group. If Fk

represents the distribution of the variable product in the k-th population,

H 0: F 2 ∼ F 1 and H 1 : F 2 ∼ F ≠F 1

The variable of the first group determines the following distribution F1:

Value 1 2 3

10 18 9

Probability

37 37 37

Now, under H0 the formula e k =n pk allows us to fill in the expected table:

2 2 2

10 18 9

T 0 ( x)=

( 20−48

37

+

) (

13−48

37

+

15−48 ) (

37

=9.34

)

10 18 9

48 48 48

37 37 37

On the other hand, for this kind of test the critical region always has the form

Decision: Since there are K = 3 classes and s = 0 (no parameter has to be estimated to determine the

probabilities),

d 2 2 2

T 0 ( X ) → χ K −s −1 ≡ χ3−0−1 ≡ χ2

If we apply the methodology based on the critical region, the necessary quantile a is calculated from the

definition of the type I error, with the given α = 0.01:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈Rc )≈P (T 0 (X )>a)

→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )> T 0 ( x))≈P (T 0 ( X )>9.34)

< P (T 0 ( X )> 9.21)=1−0.99=0.01

9.34 is not in our table while 9.21 is

→ pV < 0.01=α → H0 is rejected.

Statistic: To apply a test of independence, from a table of statistics (e.g. in [T]) we select

L (N lk − e^lk )2 d 2

K

T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)

e^lk

for L and K classes, respectively. Underlying distributions are supposed—but not specified—for the variables

X and Y, and the probabilities are directly estimated from the sample information.

Hypotheses: For a nonparametric independence test, the null hypothesis assumes that the probabilities at any

cell is the product of the marginal probabilities of its file and column,

H 0 : X , Y independent and H 1 : X , Y dependent

or, probabilistically,

H 0 : f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1 : f ( x , y )≠ f X ( x )⋅ f Y ( y )

N l⋅ N ⋅k

Under H0, the formula e^lk =n p^lk =n p^ l p^ k =n allows us to fill in the expected table:

n n

Then,

37⋅30 2 48⋅24 2

T 0 ( x)=

( 10−

85

+⋯+

)

15−

85 (

=4.29

)

37⋅30 48⋅24

85 85

For this kind of test, the critical region always has the following form:

Decision: There are L= 2 and K = 3 classes, respectively, so

d 2 2 2

T 0 ( X ) → χ( L−1)(K −1) ≡ χ(2−1 )(3−1) ≡ χ 2

For the first methodology, to calculate a the definition of type I error is applied with α = 0.01:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P(T 0 ( X )∈ Rc )≈P (T 0 (X )>a)

→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x ∣ H 0 true)=P (T 0 ( X )>T 0 ( x))≈P (T 0 (X )≥4.29)

> P (T 0 ( X )>4.61)=1−0.9=0.1

→ pV > 0.1> 0.01=α → H0 is not rejected.

Statistic: To apply a test of homogeneity, from a table of statistics (e.g. in [T]) we select

2

L (N lk − e^lk ) d 2

K

T 0 ( X )=∑l =1 ∑ k=1 → χ( L−1)(K −1)

e^lk

for L groups and K classes. An underlying distribution is supposed—but not specified—for the variable X, and

the probabilities are directly estimated from the sample information. (Note that the membership of a group can

be seen as the value of a factor.)

Hypotheses: For a nonparametric homogeneity test, the null hypothesis assumes that the marginal

probabilities in any column are the same for the two groups, that is, are independent of the group or stratum.

This means that the variable of interest X follows the same probability distribution in each (sub)group or

stratum. If G represents the variable group, mathematically,

H 0 : F ( x∣ G)= F ( x) and H 1 : F ( x∣ G)≠F (x )

N ⋅k

Under H0, the formula e^lk =nl p^lk =n l p^ k =nl allows us to fill in the expected table:

n

Then

30 2 24 2

(

T 0 ( x)=

10−37

85 )+⋯+

(

15−48

85

=4.29

)

30 24

37 48

85 85

For this kind of test, the critical region always has the following form:

d 2 2 2

T 0 ( X ) → χ( L−1)(K −1) ≡ χ(2−1 )(3−1) ≡ χ 2

If we apply the methodology based on the critical region, to calculate the quantile a the definition of type I

error is applied with α = 0.01:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X i )∈Rc )≈ P (T 0 ( X i )> a)

→ a=r α=9.21 → Rc = {T 0 ( X )>9.21 }

If we apply the methodology based on the p-value,

> P (T 0 ( X i )> 4.61)=1−0.9=0.1

4.29 is not in our table while 4.61 is

→ pV > 0.1> 0.01=α → H0 is not rejected

Conclusion (advanced): Neither the independence nor the homogeneity has been rejected, while the

hypothesis supposing that the variable product follows in population 2 the distribution determined by the

sample of the group 1 has been rejected. On the one hand, the distribution determined by one sample, involved

in section (a), is in general different to the common supposed underlying distribution involved in section (b),

which is estimated by using the samples of both groups. Thus, it can be thought that this underlying

distribution “is between the two samples”, by which we can justify the decisions made in (a), (b) and (c).

Group 2 has more weight in determining that distribution, since it has more elements. It is worth noticing the

similarity between the independence and the homogeneity tests: same distribution and evaluation for the

statistic, same critical region, et cetera. (As regards the application of the methodologies, binding the p-value

is sometimes enough to discover whether it is smaller than α or not, but in general statisticians want to find its

value.)

My notes:

[HT] Parametric and Nonparametric

Exercise 1ht

To test if a coin is fair or not, it has independently been tossed 100,000 times (the outputs are a simple

random sample), and 50,347 of them were heads. Should the fairness of the coin, as null hypothesis, be

rejected when α = 0.1?

(a) Apply a parametric test. By using a computer, plot the power function.

(b) Apply the nonparametric chi-square goodness-of-fit test.

(c) Apply the nonparametric position signs test.

Discussion: In this exercise, no supposition should be evaluated: in (a) because the Bernoulli model is “the

only proper one” to model a coin, and in (b) and (c) because they involve nonparametric tests. The sections of

this exercise need the same calculations as in previous exercises.

Statistic: From a table of statistics (e.g. in [T]), since the population variable is Bernoulli and the asymptotic

framework can be considered (since n is big), the statistic

̂

η−η d

T ( X ; η)= → N (0,1)

√ ?(1−? )

n

is selected, where the symbol ? is substituted by the best information available. In testing hypotheses, it will

be used in two forms:

̂

η−η0

d ̂

η−η1

d

T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)

√ η0 (1−η0 )

n √ η1 (1−η1 )

n

where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we

do not have nor suppose this information when T is used to build a confidence interval, or for tests with two

populations). Regardless of the methodology to be applied, the following value will be necessary:

50,347 1

−

100,000 2

T 0 ( x)= =2.19

√

1 1

2( )

1−

2

100,000

where η0 = 1/2 when the coin is supposed to be fair.

Hypotheses: Since a parametric test must be applied, the coin—dichotomic situation—is modeled by a

Bernoulli random variable, and the hypotheses are

1 1

H 0 : η = η0 = and H 1 : η= η1 ≠

2 2

Note that the question is about the value of the parameter η while the Bernoulli distribution is supposed under

both hypotheses; in some nonparametric tests, this distribution is not even supposed in general (although the

only reasonable distribution to model a coin is the Bernoulli). For this kind of alternative hypothesis, the

critical region takes the form

Decision: To determine Rc, the quantiles are calculated from the type I error with α = 0.1 at η0 = 1/2:

α (1/2)=P (Type I error )=P (Reject H 0 ∣ H 0 true)=P (T ( X ; θ)∈ Rc ∣ H 0 )= P(∣T 0 ( X )∣> a)

→ a=r α/2 =1.645 → Rc = {∣T 0 ( X )∣>1.645 }

Thus, the decision is: T 0 ( x)=2.19 > 1.645 → T 0 ( x)∈Rc → H0 is rejected.

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x ∣ H 0 true)=P (∣T 0 ( X )∣>∣T 0 ( x)∣)

= 2⋅P (T 0 ( X )<−2.19)=2⋅0.0143=0.0248

→ pV =0.0248 < 0.1=α → H0 is rejected.

Power function: To calculate β, we have to work under H1. Since in this case the critical region is already

expressed in terms of T0 and we must use T1, we apply the mathematical tricks of multiplying and dividing by

the same quantity and of adding and subtracting the quantity:

β(η1 ) = P (Type II error ) = P ( Accept H 0 ∣ H 1 true)= P (T 0 ( X )∉ R c ∣ H 1) = P(∣T 0 ( X )∣≤1.645 ∣ H 1 )

∣)

√ η1 (1−η1) ≤+1.645

∣) (

̂ −η0

η ̂ −η0

η

(

= P −1.645≤

√ η0 (1−η0 )

n

≤+1.645 H 1 = P −1.645≤

√ η1 (1−η1 ) √ η0 (1−η0)

n

H1

∣)

√ η0 (1−η0) ≤ √ η0 (1−η0)

(

̂

η−η0

= P −1.645 ≤+1.645 H1

√ η1 (1−η1)

√ η1 (1−η1)

n

√ η1(1−η1)

∣)

√ η0 (1−η0) ≤ η−η √ η (1−η0)

(

̂ 1+ η1−η0

= P −1.645 ≤+1.645 0 H1

√ η1 (1−η1)

√

η1 (1−η1 )

n

√ η1 (1−η1)

∣)

√ η0 (1−η0) − √ η0 (1−η0 ) −

(

η1−η0 ̂

η−η1 η1−η0

= P −1.645 ≤ ≤+1.645 H1

√ η1 (1−η1)

√ η1 (1−η1)

n √ η1 (1−η1 )

n

√ η1 (1−η1 )

√ η1 (1−η1 )

n

=P

( √ η1(1−η1)

≤T 1≤

√ η1 (1−η1 ) )

+1.645 √ η0 (1−η0 )− √ n(η1−η0) −1.645 √ η0 (1−η0)− √ n( η1−η0)

(

= P T 1≤

√ η1 (1−η1 )

−P T 1 <

) ( √ η1 (1−η1 ) )

139 Solved Exercises and Problems of Statistical Inference

By using a computer, many more values η1 ≠ 0.5 can be considered to plot the power function

ϕ(η) = P (Reject H 0) =

{ α(η) if p∈Θ0

1−β(η) if p ∈Θ1

# Sample and inference

n = 100000

alpha = 0.1

theta0 = 0.5 # Value under the null hypothesis H0

q = qnorm(c(alpha/2, 1-alpha/2),0,1)

theta1 = seq(from=0,to=1,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1-pnorm((q[2]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1) +

pnorm((q[1]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

K ( N k −e^k )2 d 2

T 0 ( X )=∑k =1 → χ K −s−1

e^k

where there are K classes, and s parameters have to be estimated to determine F0 and hence the probabilities.

Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis supposes that the sample was

generated by a Bernoulli distribution with η0 = 1/2, while the alternative hypothesis supposes that it was

generated by a different distribution (Bernoulli or not, although this distribution is here “the reasonable way”

of modeling a coin).

1 1

H 0: X ∼ F 0 = B

2 ()

and H 1 : X ∼ F ≠ B

2 ()

For the distribution F0, the table of probabilities is

Value –1 (tail) +1 (head)

Probability 1/2 1/2

th 1

and, under H0, the formula e k =n pk =n P θ (k class)=100,000 =50,000 allows us to fill in the expected

2

table:

and

2 (nk −e k )2 (50,347−50,000) 2 ( 49,653−50,000)2

T 0 ( x)=∑k=1 = + =4.82

ek 50,000 50,000

On the other and, for this kind of test, the critical region always has the following form:

Decision: There are K = 2 classes and s = 0 (no parameter has been estimated), so

d 2 2 2

T 0 ( X ) → χ K −s −1 ≡ χ 2−1−0 ≡ χ1

If we apply the methodology based on the critical region, the definition of type I error, with α = 0.1, is applied

to calculate the quantile a:

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P( T 0 ( X )∈ Rc )≈P (T 0 (X )>a)

→ a=r α=2.71 → Rc = {T 0 ( X )> 2.71}

If we apply the methodology based on the p-value,

pV = P ( X more rejecting than x )=P (T 0 ( X )>T 0 ( x ))≈ P (T 0 ( X )>4.82)

< P (T 0 ( X )>3.84)=0.05

→ pV < 0.05 < 0.1=α → H0 is rejected.

Note: Binding the p-value is sometimes enough to make the decision—4.82 is not in our table while 3.84 is.

Statistic: To apply a position sign test, from a table of statistics (e.g. in [T]) we select

T 0 ( X )=Number { X j −θ0 > 0 }∼ Bin(n , P ( X j >θ 0))

Here θ0=0 and P(Xj>0)=1/2, so Me(T0)=E(T0)=n/2.

Hypotheses: For a nonparametric position test, if head and tail are equivalently translated into the numbers

+1 and –1, respectively, the hypotheses are

H 0 : Me( X ) = θ 0 = 0 and H 1 : Me( X ) = θ1 ≠ 0

For these hypotheses,

We need the evaluation

∣T 0 (x )−100,000/ 2∣=∣50,347−50,000∣=347

Decision: In the first methodology, the quantile a is calculated by applying the definition of the type I error

with α = 0.1. On the one hand, we know the distribution of T0, while, on the other hand, Rc was easily written

in terms of T0–n/2, whose distribution is involved in a well-known asymptotic result—the Central Limit

Theorem for the Bin(n,1/2). (Moreover, the probabilities of the binomial distribution are not tabulated for n =

100,000.) Then,

α=P (Type I error)= P ( Reject H 0 ∣ H 0 true)= P ( T 0 ( X )∈ Rc )

T 0 ( X )−n /2

(∣√ )∣ √ )

a 2⋅a

=P (∣T 0 ( X )−n/ 2∣> a )= P

1 1

>

1 1

(

≈ P ∣Z∣>

√n )

n

2(1−

2

n

2( )

1−

2

→ r α /2=1.645=

2⋅a

√n

→ a≈1.645 √

100,000

2

n

{∣

≈260.097 → Rc = T 0 ( X )− > 260.097

2 ∣ }

The final decision is: ∣T 0 (x )−100,000/ 2∣=347 > 260.097=a → T 0 ( x)∈Rc → H0 is rejected.

If we apply the methodology based on the p-value,

T 0 ( X )−n/2 T 0 ( x)−n /2

(∣√ ∣ ∣√ ∣) ( ∣√

50,347−50,000

=P

n

1

2( )

1−

1

2

>

n

1

2( )

1−

1

2

≈ P ∣Z∣>

100,000

1

2

1−

1

( )

2

∣)

= P (∣Z∣> 2.19)=2⋅P (Z <−2.19)=2⋅0.0143=0.0248

→ pV = 0.0248 < 0.1=α → H0 is rejected.

Conclusion: (1) In this case the three different tests agree to make the same decision, but this may not

happen in other situations. When it is possible to compare the power functions and there exists a uniformly

most powerful test, the decision of the most powerful should be considered. In general, (proper) parametric

tests are expected to have more power than the nonparametric ones in testing the same hypotheses. (2) With

two classes, the chi-square test does not distinguish any two distributions such that the two class probabilities

are (½, ½), that is, in this case the test provides a decision about the symmetry of the distribution (chi-square

tests work with class probabilities, not with the distributions themselves). (3) In this exercise the parametric

test and the nonparametric test of the signs are essentially the same. (Remember: statistical results depend on:

the assumptions, the methods, the certainty and the data.)

My notes:

PE – CI – HT

Exercise 1pe-ci-ht

From a previous pilot study, concerning the monthly amount of money (in $) that male and female students of

a community spend on cell phones, the following hypotheses are reasonably supported:

i. The variable amount of money follows a normal distribution in both populations.

ii. The population means are μM = $14.2 and μF = $13.5, respectively.

iii. The two populations are independent.

Two independent simple random samples of sizes nM = 53 and nF = 49 are considered, from which the

following statistics have been calculated:

2 2 2 2

S M = $ 4.99 and S F = $ 5.02

Then,

̄ −F

A) Calculate the probability P ( M ̄ ≤ 1.27). Repeat the calculations with the supposition that σM

and σF are equal.

B) Build a 95% confidence interval for the quotient σM/σF.

Discussion: The pilot statistical study mentioned in the statement should cover the evaluation of all

suppositions. The hypothesis that σM = σF should be evaluated as well. The interval will be built by applying

the method of the pivot.

M ≡ Amount of money spent by a male (one) M ~ N(μM, σM2)

F ≡ Amount of money spent by a female (one) F ~ N(μF, σF2)

• There are two independent normal populations

• Standard deviations σM and σF are unknown, and we compare them through σM/σF

2 2 2

T ( M , F ; μ M ,μ F )=

̄ −F

(M ̄ )−(μ M −μ F )

∼ t κ with κ=

( SM S F

+

nM nF )

√

2 2 2 2

( ) ( )

2 2

SM SF 1 SM 1 SF

+ +

nM nF n M −1 n M n F −1 n F

̄ −F

(M ̄ )−(μ M −μ F ) 2n M s 2M + n F s 2F (n M −1) S 2M +(n F −1)S 2F

T ( M , F ; μ M ,μ F )= ∼ tn + nF −2 with S =p =

n M + n F −2 nM + nF −2

√ S 2p S 2p

M

+

nM n F

2

SM

2 2 2

σM SM σF

T ( M , F ; σ M ,σ F )= = ∼ Fn −1 , nF −1

S 2F S 2F σ 2M M

2

σF

Because of the information available, the first and the second statistics allow studying M – F (the second for

the particular case where σM = σF), while the third allows studying σM/σF.

( )( )

̄ −F

(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)

P (M

̄ −F

̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)

√ √ √

2 2 2 2

S S S S 4.99 5.02

+ M F M

+ F +

nM nF nM nF 53 49

with T ~ tκ where

2

S 2M S 2F

κ=

( +

nM nF ) =

4.99 5.02 2

53

+

49 ( =99.33

)

2 2 2 2 2

S 2M 1 4.99 1 5.02

1

( )

n M −1 n M

+

1 S

( )

F

n F −1 n F 53−1 53

+ ( )

49−1 49 ( )

Should we round this value downward, κ = 99, or upward, κ = 100? We will use this exercise to show that

➢ For large values of κ1 and κ2, the t distribution provides close values

➢ For a large value of κ, the t distribution provides values close to those of the standard normal

distribution (the tκ distribution tends with κ to the standard normal distribution)

By using the programming language R: > pt(1.29, 99)

[1] 0.8999721

• If we make it κ = 99.33 to 99, the probability is

> pt(1.29, 100)

• If we make it κ = 99.33 to 100, the probability is [1] 0.8999871

[1] 0.9014747

On the other hand, when the variances are supposed to be equal they can and should be estimated jointly by

using the pooled sample variance.

(n −1) S 2M +(nF −1) S 2F (53−1)$ 2 4.99+(49−1)$ 2 5.02

S 2p = M = =$ 2 5.0044≈$ 2 5

n M + n F −2 53+ 49−2

Then,

( )( )

̄ −F

(M ̄ )−(μ M −μ F ) 1.27−(μ M −μ F ) 1.27−(14.2−13.5)

P (M

̄ −F

̄ ≤ 1.27)=P ≤ =P T ≤ =P (T ≤ 1.29)

√ S 2p S 2p

+

nM nF √ S 2p S 2p

+

nM nF √

5 5

+

53 49

• By using the table of the t distribution, the probability is 0.9.

> pt((1.27-14.2+13.5)/sqrt((5/53)+(5/49)), 100)

• By using the language R, the probability is [1] 0.8993372

(B) Method of the pivotal quantity:

2 2 2 2 2 2 2 2

( SM σF

σ2M S 2F ) (

≤r α/ 2 =P l α/ 2

SF

S 2M

≤

σF

σ 2M

≤r α/ 2

SF

S 2M ) (

=P

SM

l α/ 2 S 2F

≥

σM

σ 2F

≥

SM

r α /2 S 2F )

so confidence intervals for σM2/σF2 and σM/σF are respectively given by

I 1−α =

[ S 2M

r α /2 S 2F

,

S 2M

l α/ 2 S 2F ] and then I 1−α =

[√ S 2M

r α/ 2 S 2F

,

√ ] S 2M

l α/ 2 S 2F

In the calculations, multiplying by a quantity and inverting can be applied in either order.

2 2 2 2

• S M = $ 4.99 and S F = $ 5.02

{ l α / 2=0.57

r α/ 2=1.76

> qf(c(0.025, 1-0.025), 53-1, 49-1)

[1] 0.5723433 1.7583576

Then

I 0.95= [√ 4.99

1.76⋅5.02

,

√ 4.99

0.57⋅5.02 ]

=[0.75, 1.32]

Conclusion: First of all, in this case there is very little difference between the two ways of estimating the

variance. On the other hand, as the variances are related through a quotient, the interpretation is not direct: the

dimensionless, multiplicative factor c in σM2 = cσF2 is, with 95% confidence, in the interval obtained. The

interval (with dimensionless endpoints) contains the value 1, so it may happen that the variability of the

amount of money spent is the same for males and females—we cannot reject this hypothesis (note that

confidence intervals can be used to make decisions). (Remember: statistical results depend on: the

assumptions, the methods, the certainty and the data.)

My notes:

Exercise 2pe-ci-ht

The electric light bulbs of manufacturer X have a mean lifetime of 1400 hours (h), while those of

manufacturer Y have a mean lifetime of 1200h. Simple random samples of 125 bulbs of each brand are tested.

From these datasets the sample quasivariances Sx2 = 156h2 and Sy2 = 159h2 are computed. If manufacturers

are supposed to be independent and their lifetimes are supposed to be normally distributed:

a) Build a 99% confidence interval for the quotient of standard deviations σX/σY. Is the value σX/σY=1,

that is, the case σX=σY, included in the interval?

b) By using the proper statistic T, find k such that P ( X

̄ −Ȳ ≤ k ) = 0.4.

Hint: (i) Firstly, build an interval for the quotient σX2/σY2; secondly, apply the positive square root function. (ii) If a random variable

ξ follows a F124, 124 then P(ξ ≤ 0.628) = 0.005 and P(ξ ≤ 1.59) = 0.995. (iii) If ξ follows a t248, then P(ξ ≤ –0.25) = 0.4

(Based on an exercise of Statistics, Spiegel, M.R., and L.J. Stephens, McGraw–Hill.)

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)

electric means carrying, producing, produced by, powered by, or charged with electricity: 'an electric wire', 'an electric generator', 'an

electric shock', 'an electric current', 'an electric light bulb', 'an electric toaster'. For machines and devices that are powered by electricity

but do not have transistors, microchips, valves, etc, use electric (NOT electronic): 'an electric guitar', 'an electric train set', 'an electric

razor'.

electrical means associated with electricity: 'electrical systems', 'a course in electrical engineering', 'an electrical engineer'. To refer to the

general class of things that are powered by electricity, use electrical (NOT electric): 'electrical equipment', 'We stock all the latest

electrical kitchen appliances'.

electronic is used to refer to equipment which is designed to work by means of an electric current passing through a large number of

transistors, microchips, valves etc, and components of this equipment: 'an electronic calculator', 'tiny electronic components'. Compare:

'an electronic calculator' BUT 'an electric oven'. An electronic system is one that uses equipment of this type: 'electronic surveillance', 'e-

mail' (=electronic mail, a system for sending messages very quickly by means of computers).

electronics (WITH s) refers to (1) the branch of science and technology concerned with the study, design or use of electronic equipment:

'a students of electronics' (2) (used as a modifier) anything that is connected with this branch: 'the electronics industry'.

Discussion: There are two independent normal populations. All suppositions should be evaluated. Their

means are known while their variances are estimated from samples of size 125. A 99% confidence interval for

σX/σY is required. The interval will be built by applying the method of the pivot. If the value σX/σY=1 belongs

to this interval of confidence 0.99, the probability of the second section can reasonably be calculated under the

supposition σX=σ=σY—this implies that the common variance σ2 is jointly estimated by using the pooled

sample quasivariance Sp2. On the other hand, this exercise shows the natural order in which the statistical

techniques must sometimes be applied in practice: the supposition σX=σY is empirically supported—by

applying a confidence interval or a hypothesis test—before using it in calculating the probability. Since the

standard deviations have the same units of measurement as the data (hours), their quotient is dimensionless,

and so are the endpoints of the interval.

X ≡ Lifetime of a light bulb of manufacturer X X ~ N(μX=1400h, σX2)

Y ≡ Lifetime of a light bulb of manufacturer Y Y ~ N(μY=1200h, σY2)

Selection of the statistics: We know that:

• There are two independent normal populations

• The standard deviations σX and σY are unknown, and we compare them through σX/σY

From a table of statistics (e.g. in [T]) we select a (dimensionless) statistic. To compare the variances of two

independent normal populations, we have two candidates:

V 2X σY2 S 2X σ2Y

T ( X , Y ; σ X , σ Y )= ∼ Fn , nY and T ( X , Y ; σ X , σ Y )= ∼ Fn −1 ,nY −1

V Y2 σ 2X X

S 2Y σ2X X

1 n

2 2 2 1 n 2

where V X =

n ∑ j =1

( X j−μ ) and S X =

n−1 ∑ j=1

( X j− X̄ ) , respectively (similarly for population Y).

We would use the first if we were given V 2X and V 2Y or we had enough information to calculate them (we

know the means but not the data themselves). In this exercise we can use only the second statistic.

S 2X σ 2Y S 2Y σY2 S 2Y S 2X σ2X S 2X

1−α=P (l α/ 2≤T ≤r α / 2 )=P l α / 2≤

( σ2X S Y 2 ) (

≤r α/ 2 =P l α/2

S 2X

≤

σ 2X

≤r α/ 2

S 2X ) (

=P

l α/2 S Y 2

≥

σ2Y

≥

r α / 2 S Y2 )

so confidence intervals for σX2/σY2 and σX/σY are respectively given by

I 1−α =

[ S 2X

r α/2 S 2Y

,

S 2X

l α/2 S 2Y ] and I 1−α =

[√ S 2X

r α/ 2 S 2Y

,

√ ] S 2X

l α/ 2 S 2Y

(In the previous calculations, multiplying by a quantity and inverting can be applied either way.)

• S 2X = 156 h2 and S 2Y = 159 h 2

• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 →

{ P ( ξ≤l α / 2)=α/ 2=0.005

P( ξ≤r α/ 2 )=1−α /2=0.995

→

{ l α/ 2=0.628

r α/ 2=1.59

where the information in (ii) of the hint has been used. Then

[√ √ ]

2 2

156 h 156 h

I 0.95= 2

, 2

=[0.786, 1.25]

1.59⋅159 h 0.628⋅159 h

The value σX/σY=1 is in the interval of confidence 0.99 (99%), so the supposition σX=σY is strongly supported.

(b) Probability

To work with the difference of the means of two independent normal populations when σX=σY, we consider:

̄ −Ȳ )−(μ X −μ Y )

(X

T ( X , Y ; μ X ,μ Y )= ∼ t n +n −2

S 2p S 2p

√

x y

+

n X nY

2 2

2 (n X −1) S +(nY −1)S

X 124⋅156 h2 +124⋅159 h 2

Y

where S p = = =157.5 h2 is the pooled sample quasivariance.

n X + n y −2 125+125−2

( X̄ −Ȳ )−(μ X −μ Y ) k −(μ X −μY ) k −(1400−1200)

0.4 = P( X̄ −Ȳ ≤ k )=P

( √ S 2p S 2p

+

n X nY

≤

√ S 2p S 2p

+

n X nY )(

=P T ≤

√ 157.5 157.5

125

+

125

)

Now, by using the information in (iii) of the hint,

l 0.4=−0.25=

kh−(1400 h−1200 h)

→ k =200 h−0.25 2

125√

157.5 h 2

=199.60 h

√ 157.5 h2 157.5 h 2

125

+

125

Conclusion: A confidence interval has been obtained for the quotient of the standard deviations. The

dimensionless value of θ = σX/σY is between 0.786 and 1.250 with confidence 99%; alternatively, as the

standard deviations are related through a quotient, an equivalent interpretation is the following: the

(dimensionless) multiplicative factor θ in σX=θ·σY is, with 99% confidence, in the interval obtained. Since the

value θ = 1 is in this high-confidence interval, it may happen that the variability of the two lifetimes is the

same—we cannot reject this hypothesis (note that confidence intervals can be used to make decisions);

besides, it is reasonable to use the supposition σX=σY in calculating the probability of the second section. If

any two simple random samples of size 125 were considered, the difference of the sample means will be

smaller than 199.60 with a probability of 0.4. Once two particular samples are substituted, randomness is not

involved any more and the inequality ̄x − ̄y ≤k =199.60 is true or false. The endpoints of the interval have

no dimension, like the quotient σX/σY or the multiplicative factor c. (Remember: statistical results depend on:

the assumptions, the methods, the certainty and the data.)

My notes:

Exercise 3pe-ci-ht

In 1990, 25% of births were by mothers of more than 30 years of age. This year a simple random sample of

size 120 births has been taken, yielding the result that 34 of them were by mothers of over 30 years of age.

a) With a significance of 10%, can it be accepted that the proportion of births by mothers of over 30

years of age is still ≤ 25%, against that it has increased? Select the statistic, write the critical region

and make a decision. Calculate the p-value and make a decision. If the critical region is

Rc ={ η̂ > 0.30 }, calculate β (probability of the type II error) for η1 = 0.35. Plot the power function

with the help of a computer.

b) Obtain a 90% confidence interval for the proportion. Use it to make a decision about the value of η,

which is equivalent to having applied a two-sided (nondirectional) hypothesis test in the first section.

(First half of section a and first calculation in b, from 2007's exams for accessing the Spanish university; I have added the other parts.)

Discussion: In this exercise, no supposition should be evaluated. The number 30 plays a role only in

defining the population under study. The Bernoulli model is “the only proper one” to register the presence-

-absence of a condition. Percents must be rewritten in a 0-to-1 scale. Since the default option is that the

proportion has not changed, the equality is allocated in the null hypothesis. On the other hand, proportions are

dimensionless by definition.

Statistic: From a table of statistics (e.g. in [T]), since the population variable follows the Bernoulli

distribution and the asymptotic framework can be considered (large n), the statistic

̂

η−η d

T ( X ; p)= → N ( 0,1)

√ ?(1−?)

n

is selected, where the symbol ? is substituted by the best information available: η or η.

^ In testing

hypotheses, it will be used in two forms:

̂

η−η0

d ̂

η−η1

d

T 0 ( X )= → N (0,1) and T 1 ( X )= → N (0,1)

√ η0 (1−η0 )

n √ η1 (1−η1 )

n

where the supposed knowledge about the value of η is used in the denominators to estimate the variance (we

do not have this information when T is used to build a confidence interval, like in the next section).

Regardless of the testing methodology to be applied, the evaluation of the statistic is necessary to make the

decision. Since η0=0.25

34

−0.25

120

T 0 ( x)= =0.843

0.25(1−0.25)

120 √

Hypotheses:

For this alternative hypothesis, the critical region takes the form

{√ }

η−η

̂ 0 c−η0

Rc ={ η>

̂ c }= > ={T 0 >a }

η0 (1−η0)

n √ η0 (1−η0 )

n

Decision: To determine Rc, the quantile is calculated from the type I error with α = 0.1 at η0 = 0.25:

α (0.25)=P (Type I error)= P( Reject H 0 ∣ H 0 true)= P(T 0 >a)

→ a=r 0.1=l 0.9=1.28 → Rc = {T 0 ( X )>1.28 }.

Now, the decision is: T 0 ( x)=0.843 < 1.28 → T 0 ( x)∉Rc → H0 is not rejected.

p-value

→ pV = 0.200 > 0.1=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1. Since the critical region has been expressed in terms

of T0, and we must use T1, we could apply the mathematical trick of adding and subtracting the same quantity.

Nevertheless, this way is useful when the value c in Rc ={ η> ̂ c } has not been calculated yet; now, since we

have been said that Rc ={ η>

̂ 0.3} it is easier to directly standardize with η1:

β(η) = P (Type II error ) = P ( Accept H 0 | H 1 true)= P (T 0 ( X )∉ Rc | H 1 )= P ( η≤0.3

^ | H 1)

∣) (

̂ −η1 0.3−η1 0.3−η1

(√ )

η

=P ≤ H 1 = P T 1≤

η1 (1−η1)

n √ η1 (1−η1 )

n √ η1 (1−η1 )

n

For the particular value η1 = 0.35,

0.3−0.35

β(0.35) = P T 1≤

( √ 0.35(1−0.35)

120

)

= P ( T 1 ≤−1.15 ) = 0.125

> pnorm(-1.15,0,1)

[1] 0.125

By using a computer, many more values η1 ≠ 0.35 can be considered to plot the power function

ϕ(η) = P (Reject H 0) =

{ α(η) if p∈Θ0

1−β(η) if p ∈Θ1

# Sample and inference

n = 120

alpha = 0.1

theta0 = 0.25 # Value under the null hypothesis H0

c = 0.3

theta1 = seq(from=0.25,to=1,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pnorm((c-paramSpace)/sqrt(paramSpace*(1-paramSpace)/n),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

(b) Confidence interval

Statistic: From a table of statistics (e.g. in [T]), the same statistic is selected

̂

η−η d

T ( X ; η)= → N (0,1)

√ ?(1−?)

n

where the symbol ? is substituted by the best information available. In testing hypotheses we were also

studying the unknown quantity η, although it was provisionally supposed to be known under the hypotheses;

for confidence intervals, we are not working under any hypothesis and η must be estimated in the

denominator:

̂ −η

η d

T ( X ; η)= → N (0,1)

√ η(1−

̂

n

η)

̂

The interval is obtained with the same calculations as in previous exercises involving a Bernoulli population,

[

I 1−α = η

^ −r α/ 2

√ η(1−

^

n

η)

^

, η+

^ r α/ 2

^

√

η(1−

n

η)

^

]

where r α / 2 is the value of the standard normal distribution such that P( Z>r α/2 )=α / 2. By using

• n = 120.

34

• Sample proportion: ^

η= =0.283 .

120

• 90% → 1–α = 0.9 → α = 0.1 → α/2 = 0.05 → r 0.05=l 0.95 =1.645 .

the particular interval (for these data) appears

[

I 0.9= 0.283−1.645

√ 0.283 (1−0.283)

120

, 0.283+1.645

√

0.283 (1−0.283)

120 ]

=[ 0.215 , 0.351]

Thinking about the interval as an acceptance region, since η0=0.25 ∈ I the hypothesis that η may still be

0.25 is not rejected.

Conclusion: With confidence 90%, the proportion of births by mothers of over 30 years of age seems to be

0.25 at most. The same decision is still made by considering the confidence interval that would correspond to

a two-sided (nondirectional) test with the same confidence, that is, by allowing the new proportion to be

different because it had severely increased or decreased. (Remember: statistical results depend on: the

assumptions, the methods, the certainty and the data.)

My notes:

Exercise 4pe-ci-ht

A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,

{

θ−1

θx if 0≤x ≤1

f (x ; θ) =

0 otherwise

A) Apply the method of the moments to find an estimator of the parameter θ.

B) Apply the maximum likelihood method to find an estimator of the parameter θ.

C) Use the estimators obtained to build others for the mean μ and the variance σ2.

D) Let X = (X1,...,Xn) be a simple random sample. By applying the results involving Neyman-Pearson's

lemma and the likelihood ratio, study the critical region for the following pairs of hypotheses.

{ H 0 : θ = θ0

H 1 : θ = θ1 { H 0 : θ = θ0

H 1 : θ = θ1 >θ 0 { H 0 : θ = θ0

H 1 : θ = θ 1<θ0 { H 0 : θ ≤ θ0

H 1 : θ = θ1 >θ 0 { H 0 : θ ≥ θ0

H 1 : θ = θ1 <θ0

Hint: Use that E(X) = θ/(θ+1) and E(X2) = θ/(θ+2).

Discussion: This statement is basically mathematical. The random variable X is dimensionless. (This

probability distribution, with standard power function density, is a particular case of the Beta distribution.)

Note: If E(X) had not been given in the statement, it could have been calculated by integrating:

θ+1 1

2

+∞

E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ x

Besides, E(X ) could have been calculated as follows:

1

θ−1

1

dx=θ ∫0 x dx =θ

θ x

[ ]

= θ

θ +1 0 θ+1

1

Now,

2

+∞

E ( X )=∫−∞ x f (x ;θ)dx=∫0 x θ x

2

1

2 θ−1

1

dx=θ∫0 x θ+1

dx=θ[ ]

xθ+2

= θ

θ+ 2 0 θ+ 2

2

μ=E ( X )= θ σ 2=Var ( X )=E ( X 2)−E ( X )2= θ − θ = θ

θ+1

and

θ+ 2 θ+1 (θ+ 2)(θ +1)

2(. )

A) Method of the moments

a1) Population and sample moments: There is only one parameter—one equation is needed. The first-order

moments of the model X and the sample x are, respectively,

1 n

μ1 (θ )=E( X )= θ and m1 (x 1 , x 2 ,... , x n )= ∑ j =1 x j= x̄

θ+1 n

a2) System of equations: Since the parameter of interest θ appears in the first-order moment of X, the first

equation suffices:

θ = 1 n x = x̄ → θ=θ x̄ + x̄ → θ= x̄

θ +1 n ∑ j =1 j

μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) →

1− x̄

a3) The estimator:

X̄

θ^ M =

1− X̄

b1) Likelihood function: For this probability distribution, the density function is f (x ; θ)=θ x θ−1 so

n n n θ−1

L( x 1 , x 2 , ... , x n ; θ)=∏ j =1 f ( x j ; θ)=∏ j=1 θ xθj −1=θn (∏ x

j =1 j )

b2) Optimization problem: The logarithm function is applied to make calculations easier

n

log[ L( x 1 , x 2 , ... , x n ; θ)]=n log(θ)+(θ−1)log( ∏ j =1 x j )

To find the local or relative extreme values, the necessary condition is:

d 1 n n n n

0= log [ L(x 1 , x 2 , ... , x n ; θ)]=n θ +log( ∏ j=1 x j ) → θ =−log( ∏ j=1 x j ) → θ0 =−

dθ n

log ( ∏ j=1 x j )

To verify that the only candidate is a (local) maximum, the sufficient condition is:

2

d d 1 n n

dθ

2

log[ L( x 1 , x 2 , ... , x n ; θ)]=

dθ

[n θ +log( ∏ j=1

x j )]=− 2 < 0

θ

The second derivative is always negative, also at the value θ0.

n

θ^ ML =− n

log( ∏ j=1 X j )

C) Estimation of η and σ2

c1) For the mean: By applying the plug-in principle,

X̄ X̄

^θ M 1− X̄ 1− X̄

From the method of the moments: μ^ M = = = = X̄ .

θ^ M +1 X̄ X̄ 1− X̄

+1 +

1− X̄ 1− X̄ 1− X̄

n

− n

θ^ ML log( ∏ j=1 X j ) n

From the maximum likelihood method: μ^ ML = = = .

^θ ML +1 n n

− n

+1 n−log( ∏ j=1 j

X )

log( ∏ j=1 X j )

c2) For the variance: Instead of substituting in the large expression of σ2, we use functional notation

From the method of the moments: σ^ 2M =σ 2 ( θ^ M ) , with σ2 (θ) and θ^ M given above.

From the maximum likelihood method: σ^ 2ML =σ2 ( θ^ ML ), with σ2 (θ) and θ^ ML given above.

{ H 0 : θ = θ0

H 1 : θ = θ1

The likelihood function and the likelihood ratio are

n

θ−1 L ( X ; θ0 ) θ0 θ0−θ1

( ) (∏

n n

L( X ; θ) = θ n (∏ j =1

X j ) and Λ ( X ; θ 0 , θ1) = =

L( X ; θ1 ) θ1 j=1

X j )

Then, the critical or rejection region is

{( } {

n

θ0 θ0

) < log ( k )}

θ0−θ1

) (∏ ( θ )+( θ −θ ) log ( ∏

n n

Rc = { Λ < k } =

θ1 j =1

Xj ) < k = n⋅log

1

0 1 j =1

X j

{

= (θ 0−θ 1) log (∏

n

j=1 ) θ

X j < log (k )−n⋅log θ0 =

1

( )}

{ (θ 0−θ 1)log

1

(∏

n

j=1

X j )

>

θ

log ( k )−n⋅log θ 0

1

( ) }

1

{ ( )} { }

1 −n −n 1 −n

= < = θ^ ML <

(θ 0−θ1 ) log n θ (θ0−θ1) θ

(∏ j=1

X j ) log ( k )−n⋅log θ0

1

log (k )−n⋅log θ 0

1

( )

Now it is necessary that θ1≠θ 0 and

{ }

−n(θ 0−θ1 )

• if θ1 <θ 0 then (θ0 −θ1 )> 0 and hence Rc = θ^ ML <

θ0

log( k )−n⋅log θ

1

( )

{ ( )}

−n(θ 0−θ1 )

• if θ1 >θ 0 then (θ0 −θ1 )< 0 and hence Rc = θ^ ML >

θ0

log( k )−n⋅log

θ1

Rc = {Λ< k } = ⋯= { θ^ ML < c } or Rc = {Λ<k } = ⋯= { θ^ ML >c }

The form of the critical region can qualitatively be justified as follows: if θ1 < θ0, the hypothesis H1 will be

accepted when an estimator of θ is in the lower tail, and vice versa.

Hypothesis tests

{ H 0 : θ = θ0

H 1 : θ = θ 1>θ 0 { H 0 : θ = θ0

H 1 : θ = θ1 <θ 0

In applying the methodologies, the same critical value c will be obtained for any θ1 since it only depends upon

θ0 through θ^ ML :

This implies that the uniformly most powerful test has been found.

Hypothesis tests

{ H 0 : θ ≤ θ0

H 1 : θ = θ 1>θ 0 { H 0 : θ ≥ θ0

H 1 : θ = θ1 <θ 0

A uniformly most powerful test for H 0 : θ = θ0 is also uniformly most powerful for H 0 : θ ≤ θ0 .

Conclusion: For the probability distribution determined by the function given, two methods of points

estimation have been applied. In this case, the two methods provide different estimators. By applying the

plug-in principle, estimators of the mean and the variance have also been obtained. The form of the critical

region has been studied by applying the Neyman-Pearson's lemma and the likelihood ratio.

My notes:

Additional Exercises

Exercise 1ae

Assume that the height (in centimeters, cm) of any student of a group follows a normal distribution with

variance 55cm2. If a simple random sample of 25 students is considered, calculate the probability that the

sample quasivariance will be bigger than 64.625cm2.

Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variable

height should be evaluated by using proper statistical techniques.

Identification of the variable and selection of the statistic : The variable is the height, the

population distribution is normal, the sample size is 25, and we are asked for the probability of an event

expressed in terms of one of the usual statistics: P (S 2 > 64.625).

Search for a known distribution: Since we do not know the sampling distribution of S2, we cannot

calculate this probability directly. Instead, just after reading 'sample quasivariance' we should think about the

following theoretical result

( n−1)S 2 2 ( 25−1) S 2

T= 2

∼ χn −1 , or, in this case, T = 2

∼ χ225−1 ,

σ 55 cm

Rewriting the event: The event has to be rewritten by completing some terms until (the dimensionless

statistic) T appears. Additionally, when the table of the χ 2 distribution gives lower-tail probabilities P(X ≤ x), it

is necessary to consider the complementary event:

2 2

2

P (S > 64.625)=P (

(25−1) S

55 cm 2

>

( 25−1) 64.625 cm

55 cm2 )

=P ( T > 28.2 )=1− P ( T ≤ 28.2 )=1−0.75=0.25 .

In these calculations, one property of the transformations has been applied: multiplying or dividing by a

positive quantity does not modify an inequality.

Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value bigger than

64.625cm2, when evaluated at specific data x coming from the population distribution.

My notes:

Exercise 2ae

Let X be a random variable with probability function

θ−1

θx

f ( x ; θ) = , x ∈[0,3]

3θ

such that E(X) = 3θ/(θ+1). Supposed a simple random sample X = (X1,...,Xn), apply the method of the

moments to find an estimator θ̂ M of the parameter θ.

Discussion: This statement is mathematical. Although it is given, the expectation of X could be calculated as

follows

3

+∞ θ x θ−1

μ1 (θ )=E ( X )=∫−∞ x f ( x ; θ)dx=∫0 x θ dx= θ

3

θ x θ +1

3

= θ 3θ+1 3θ

=

3 θ+1 0 3θ θ+ 1 θ+1 [ ]

Method of the moments

3θ 1 n

n ∑ j =1 j

μ1 (θ )= and m1 (x 1 , x 2 ,... , x n )= x = x̄

θ +1

System of equations: Since the parameter θ appears in the first-order moment of X, the first equation is

sufficient to apply the method:

3θ 1 n x̄

μ1 (θ )=m 1 (x 1 , x 2 ,... , x n ) → = ∑ j =1 x j= x̄ → 3 θ=θ x̄+ x̄ → θ(3− x̄ )= x̄ → θ=

θ +1 n 3− x̄

The estimator:

X̄

θ^ M =

3− X̄

My notes:

Exercise 3ae

A poll of 1000 individuals, being a simple random sample, over the age of 65 years was taken to determine

the percent of the population in this age group who had an Internet connection. It was found that 387 of the

1000 had one. Find a 95% confidence interval for η.

(Taken from an exercise of Statistics, Spiegel and Stephens, Mc-Graw Hill)

Discussion: Asymptotic results can be applied for this large sample of a Bernoulli population. The cutoff

age value determines the population of the statistical analysis, but it plays no other role. Both η and η^ are

dimensionless.

Identification of the variable: Having the connection or not is a dichotomic situation; then

X ≡ Connected (an individual)? X ~ Bern(η)

• There is one Bernoulli population

• The sample size is big, n = 1000, so an asymptotic approximation can be applied

A statistic is selected from a table (e.g. in [T]):

̂ −η

η d

T ( X ; η)= → N (0,1)

√ η(1−

̂

n

η)

̂

( )

η

^ −η

1−α=P (l α / 2≤ T ( X ; η) ≤r α / 2 )=P −r α /2≤ ≤+ r α / 2

√ η(1−

^

n

η

^)

( √

=P −r α /2

η

̂ (1− η)

n

̂

̂ −η ≤+ r α / 2

≤η

η(1−

̂

n√η

̂)

̂ α/2

=P −η−r

η

) (

̂ (1− η

n

̂)

̂ rα / 2

≤ −η ≤−η+

√η(1−

̂

n

η)

̂

√ )

( √

̂ +r +α/2

=P η

η

̂ (1− η)

n

̂

̂ α/ 2

≥ η≥ η−r

η(1−

̂

n

η

̂)

√ )

(3) The interval: Then,

[

I 1−α = η

̂ −r +α/ 2

√ η(1−

̂

n

η)

̂

, η+

̂ r +α / 2

η(1−

̂

n √

η)

̂

]

where r α / 2 is the value of the standard normal distribution verifying P( Z> r α /2 )=α /2.

• n = 1000

• Theoretical (simple random) sample: X1,...,X1000 s.r.s. (each value is 1 or 0)

1000 1 1000 387

Empirical sample: x1,...,x1000 → ∑j =1 x j=387 → ^

η=

1000 ∑ j=1

x j=

1000

=0.387

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → r α/ 2=1.96

Finally,

[

I 0.95= 0.387−1.96

√ 0.387 (1−0.387)

1000

, 0.387+ 1.96

√

0.387 (1−0.387)

1000 ]

=[0.357 , 0.417 ]

Conclusion: The unknown proportion of individuals over the age of 65 years with Internet connection is

inside the range [0.357, 0.417] with a probability of 0.95, and outside the interval with a probability of 0.05.

Perhaps a 0-to-100 scale facilitates the interpretation: the percent of individuals is in [35.7%, 41.7%] with

95% confidence. Proportions and probabilities are always dimensionless quantities, though expressed in

percent.

My notes:

Exercise 4ae

A company is interested in studying its clients' behaviour. For this purpose, the mean time between

consecutive demands of service is modelized by a random variable whose density function is:

1 − x−2

f ( x ; θ)= θ e θ , x≥2, (θ>0)

1st Is it an unbiased estimator of the parameter? Why?

2nd Calculate its mean square error. Is it a consistent estimator of the parameter? Why?

Note: E(X) = θ + 2 and Var(X) = θ2

Discussion: The two sections are based on the calculation of the mean and the variance of the estimator

given in the statement. Then, the formulas of the bias and the mean square error must be used. Finally, the

limit of the mean square error is studied.

E ( θ^ M )= E ( X̄ −2)=E ( X̄ )−E (2)= E ( X )−2=θ+2−2=θ

Var ( X ) 2 2

Var ( θ^ M )=Var ( X̄ −2)=Var ( X̄ )−Var (2)= −0= σ = θ

n n n

Unbiasedness: The estimator is unbiased, as the expression of the mean shows. Alternatively, we calculate

the bias

b ( θ^ M )= E(θ^ M )−θ=θ−θ=0

2 2

MSE ( θ^ M ) = b ( θ^ M )2 +Var (θ^ M ) = 02 + θ = θ

n n

The population variance θ2 does not depend on the sample, particularly on the sample size n. Then,

2

lim n →∞ MSE ( θ^ M ) = lim n →∞ θ = 0

n

Note: In general, the population variance can be finite or infinite (for some “strange” probability distributions we do not consider in

this subject). If the variance is infinite, σ 2 = ∞, neither Var ( θ^ M ) not MSE ( θ^ ) exists, in the sense that they are infinite; in

M

this particular exercise it is finite, θ2 < ∞. In the former case, the mean square error would not exist and the consistency (in

probability) could not be studied by using this way. In the latter case, the mean square error would exist and tend to zero

(consistency in the mean-square sense), which is sufficient for the estimator of θ to be consistent (in probability).

Conclusion: The calculations of the mean and the variance are quite easy. They show that the estimator is

unbiased and, if the variance is finite, consistent.

Advanced Theory: The If E(X) had not been given in the statement, it could have been calculated by

applying integration by parts (since polynomials and exponentials are functions “of different type”):

∞

1 − x−2

[ ]

+∞ ∞ x−2 x−2

− −

E ( X )=∫−∞ x f ( x ;θ)dx=∫2 x e θ dx= −x e θ −∫ 1⋅(−e θ )dx

θ 2

x−2 ∞ x−2 2

[

= −x e

−

x−2

θ

−θ e

−

θ ] =[( x +θ)e ] =2+θ .

2

−

θ

∞

That ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx has been used with

• u=x → u ' =1

x−2

1 − 1 − x−2 −

x−2

• v '= θ e θ

→ v=∫ θ e θ dx=−e θ

On the other hand, ex changes faster than xk for any k. To calculate E(X2):

∞

1 − x−2

[ ]

+∞ ∞ x−2 x−2

− −

E ( X )=∫−∞ x f (x ;θ)dx=∫2 x θ e θ dx= −x 2 e θ +2∫ x e θ dx

2 2 2

2

x−2 2

[

= x2 e

−

θ ] +2 θ∫

∞

∞

2

1 − x−2

x θ e θ dx=(2 2−0)+2 θ μ=4+ 2 θ(2+θ)=2θ 2 +4 θ +4 .

Again, integration by parts has been applied: ∫ u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫ u ' (x)⋅v (x )dx with

• u=x 2 → u ' =2 x

1 − x−3 1 − x−3 −

x−3

• v '= θ e θ → v=∫ θ e θ dx=−e θ

σ 2=E ( X 2)−E ( X )2=2 θ2 + 4 θ+4−(θ+2) 2=2 θ 2+4 θ+4−θ2−4 θ−4=θ 2 .

Regarding the original probability distribution: (i) the expression reminds us the exponential distribution; (ii)

the term x–2 suggests a translation; and (iii) the variance θ 2 is the same as the variance of the exponential

distribution. After translating all possible values x, the mean is also translated but the variance is not. Thus, the

distribution of the statement is a translation of the exponential distribution, which has two equivalent notations

My notes:

Exercise 5ae

Is There Intelligent Life on Other Planets? In a 1997 Marist Institute survey of 935 randomly selected

Americans, 60% of the sample answered “yes” to the question “Do you think there is intelligent life on other

planets?” (http://maristpoll.marist.edu/tag/mipo/). Let's use this sample estimate to calculate a 90%

confidence interval for the proportion of all Americans who believe there is intelligent life on other planets.

What are the margin of error and the length of the interval?

(From Mind on Statistics. Utts, J.M., and R.F. Heckard. Thomson)

LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)

American. Many Canadians and Latin Americans are understandably irritated when U.S. citizens refer to themselves simply as

“Americans.” Canadians (and only Canadians) use the term “North American” to include themselves in a two-member group with their

neighbor to the south, though geographers usually include Mexico in North America. When addressing and international audience

composed largely of people from the Americas, it is wise to consider their sensitivities.

However, it is pointless to try to ban this usage in all contexts. Outside of the Americas, “American” is universally underst

## Bien plus que des documents.

Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.

Annulez à tout moment.