Vous êtes sur la page 1sur 4

# Description

## POPULATIONS, DISTRIBUTIONS AND SAMPLES

Terms you should learn:
Target population Statistical population Sample population Underlying distribution Sample distribution Observations

## Concepts you should master:

Generalizations from sample to statistic to target Frequencies, probabilities and events Random sampling Bias

The average person uses the word 'population' to mean a collection of individuals living together in a community. To the statistician, though, the word means much more than that. Formally, a statistical population is the set of all possible values (called observations) that could be obtained for a given attribute if all the test subjects were measured simultaneously. Less formally, suppose you are interested in a population of hypertensive rats, and suppose you decide to measure one attribute that you think describes your rats, say blood pressure or heart rate. The entire range of all possible blood pressures makes up the statistical population. While the point is a subtle one, it deserves to be made. You want to describe a target population (hypertensive rats) by summarizing a set of measures (blood pressure) and generalize from one back to the other. It is the population of blood pressure values which interests the statistician. Let us consider other examples. Suppose you were measuring the density of igneous rock. Then the statistical population of interest is not all the igneous rocks in the world, but all their densities. The target population you want to describe is 'igneous rock' by summarizing the attribute we call 'density'. Suppose you want to verify the quality of an assay run for you by an outside laboratory. The target population would be all the tests run for you by that laboratory, and the statistical population might be all hemoglobin measurements performed during January. Care is needed, however. A target population and an attribute do not necessarily have anything to do with each other. For example, in the most absurd case, you could measure the tail lengths of hypertensive rats

## The Biostatistics Cookbook

rather than their blood pressures. One must wonder why, but if you did do something so silly, why would you target hypertensives rather than normotensives? Do you really gain any insight into your target population that you would not have had anyway? What you really want to summarize (and then tell your colleagues about) is blood pressure. Maybe you want to describe new blood pressure lowering medicines, or maybe just the rat population itself. In either case, tail length will probably not suffice since it is not a 'surrogate' for blood pressure. Good statistics cannot help silly science and vice versa! If we assume that you choose a statistical population that really represents your target, the next step is to build the link between your target and statistical populations, i.e. to define a mathematically descriptive relationship between your subjects and your statistical universe. If we could count the number of subjects in the entire universe that achieves a value between some predefined upper and lower limit, and if we let these intervals cover our entire universe, then we could calculate the frequency of observations within each interval. From that set of frequencies we would know exactly what the most frequently attained values are. The whole set of frequency-value pairs makes up what the statistician calls the underlying distribution of the statistical population. Grouping the observations into predefined intervals, counting their frequencies and presenting them graphically results in a plot known as the histogram, which is covered in much greater detail below. Mathematically, the frequency distribution of the underlying population explicitly defines a probability space. That means that we now know the exact chances of a value drawn from any subject falling within a specified interval. To carry our hypertensive rat example to its most extreme limits, we know that if 23% of all the hypertensive rats in the world registered mean blood pressures between 140 and 150 mmHg, the chances of observing any one rat with a measure in that range is 23/100. The frequency distribution therefore becomes a measure of probability in an event space where the events are 'blood pressure between . . .'. This linkage between the underlying frequency distribution and the probability of observing any particular event, e.g. blood pressure between 140 mmHg and 150 mmHg, forms the basis for the inferential statistics presented below. You have probably heard of terms such as normal or Gaussian distribution, chi-square distribution, F-distribution. These are simply well defined Probability distributions which seem to describe the real world fairly well. Each is well established and well characterized. More

Description