Vous êtes sur la page 1sur 2

1

This document is a synopsis of the material covered in the Lecture Notes, section 4.1, as well as
PowerPoint slides 4.1a.pptx and 4.1b.pptx. References here will be taken from both sources.
First, a brief review of general Discrete Probability Models (4.1a.pptx) followed by an
important special case known as the Binomial Distribution (4.1b.pptx).

From Chapter 2, recall that a numerical random variable X can either be discrete (i.e., it takes its
values in “jumps,” such as X = shoe size: 6, 6½, 7, 7½, etc.) or continuous (i.e., it takes its values
along the continuous real number line, and thus can in principle be measured to arbitrary precision,
such as X = exact foot length). Random variables of both types can be used to define events in a
sample space of outcomes from an experiment, and their probabilities (vis-à-vis Chapter 3)
calculated. We consider discrete probability models here; continuous models, which are more
complicated, are discussed in section 4.2. For example, page 4.1-2 shows the outcomes of
“ordered pairs” (Die1, Die2) resulting from the experiment of rolling two dice. We can define
many discrete random variables on this space, such as X = “the difference ‘Die1 – Die2’ of the two
values,” or X = “the larger of the two values.” But here, X = “the sum ‘Die1 + Die2’ of the two
values (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),” and we can list the probability of each of these events,
i.e., P(X = 2), P(X = 3), P(X = 4),…, P(X = 12), in a probability table. If the dice are fair, i.e., if
each of the “ordered pairs” in the sample space shown are equally likely (1/36), then the table, and
corresponding probability histogram (which is symmetric, since the dice are fair), are both shown
on page 4.1-3.1 Recall that, similar to a density histogram for sample data, the area of each
rectangle measures the probability of its corresponding X-value, hence the total area = 1. This
example demonstrates the properties of a general discrete distribution on a population, including
the formal mathematical definitions of the population mean  and population variance σ2, via
the formulas and examples on pages 4.1-4 through 4.1-7 of the notes, slides 4 through 8 of
4.1a.pptx, and slide 2 of 4.1b.pptx.

We now discuss a very particular scenario. Imagine that we have a binary population, that is, a
population divided into two categories, for example, Male vs. Female, Diseased vs. Non-diseased,
Treatment vs. Control, etc. Traditionally, we label these categories as “Success” and “Failure,”
depending on which of these two represents the category of interest in the context of the study. In
addition, suppose we also know the probability of Success (and hence Failure) in the population.
It is important to note that since this is a population characteristic, i.e., a parameter (rather than
a sample characteristic, i.e., a statistic), its value should technically be denoted by a Greek letter,
just like  (“mu”) and σ (“sigma”). The traditional symbol chosen for P(Success) is π (“pi”), and
hence, P(Failure) = 1 – π. 2 Next, we intend to select a random sample of n individuals from this
population, in such a way that the “Success or Failure” outcome of any single selected individual is
statistically independent of the “Success or Failure” outcome of any other selected individual.

1
If the dice are not fair, then these probabilities must be computed differently, and the resulting histogram would not
be symmetric, but skewed toward the higher probabilities (see pages 4.1-6, 4.1-10, and slides 7, 8, 13 through 17 of
4.1a.pptx).
2
This is not the same π = 3.14159… from elementary geometry! This is a probability, hence between 0 and 1.
Unfortunately, many authors choose the Roman letter “p” to denote P(Success), and q = 1 – p = P(Failure), as in the
Hardy-Weinberg Law of genetics. But this use can be easily confused with the “p-value” of a sample… another
important statistical concept… so I will stick with the symbol π.
2

(For example, the Male/Female sex of any one individual conveys no information about the
Male/Female sex of any other individual in the sample.) The random sample that results from
meeting these two criteria –

 statistical independence between individual binary outcomes, and


 constant probability π of Success per outcome

– is said to form a sequence of n Bernoulli trials, and is perhaps best modeled by n consecutive,
random, independent tosses of a coin, where the binary outcomes correspond to P(Heads) = π, and
P(Tails) = 1 – π. Last but not least, we need to define a discrete random variable X on this sample
of n binary outcomes: X = “the number of Successes (e.g., Heads) in the sample.” Thus, the
possible values that X can assume would be 0, 1, 2, 3, …, n, and we wish to know how to compute
the probability of each of these events. That is, we wish to construct a probability table, listing
P(X = 0), P(X = 1), P(X = 2),…, P(X = n), or in shorthand notation, P(X = x) for x = 0, 1, 2,…, n.
The formula is “derived” – with the use of examples – on pages 4.1-15 through 4.1-19 in the notes,
as well as slides 4 through 14 in 4.1b.pptx, and is reproduced below.3

n x
   (1   )
n x

 x

The variable X is said to follow a Binomial Distribution with values n and π, and is written as
X ~ Bin(n, π). Furthermore, the mean and variance for this distribution can be mathematically
proved to be   n and  2  n  (1   ) , respectively.

 As a quick example, suppose that a certain non-contagious 4


medical condition affects 10% of a
population, so   0.10 . Further suppose that a sample of n  70 randomly chosen individuals
is to be selected for a study, so that X ~ Bin(70, 0.1). We wish to calculate the following:

 mean (i.e., “expected”) number of affected individuals =   n = (70)(0.1) = 7


 variance  2  n  (1   ) = (70)(0.1)(0.9) = 6.3
 standard deviation   6.3 = 2.51 individuals
 probability of exactly 13 affected individuals in the sample
 70 
P(X = 13) =   (0.1)13 (0.9)57 = 0.0117 Using R: dbinom(13, 70, 0.1)
 13 

The   term is called a “combinatorial symbol” or “binomial coefficient,” usually defined using “factorials,” and
3 n
 x
easily evaluated with a calculator or computer. Working with these quantities is standard pre-calculus fare. If you
are a bit rusty with these ideas, read the review section on Permutations and Combinations in the Appendix of the
Lecture Notes.
4
The non-contagious nature of the condition in necessary to ensure that the affected status (Yes/No) of any chosen
individual is statistically independent of the affected status (Yes/No) of any other individual in the sample. If this
criterion is called into question, then so is the validity of using the Binomial Distribution to model this condition.

Vous aimerez peut-être aussi