Vous êtes sur la page 1sur 3

Using the Standard Normal Distribution to Statistically Analyze Data Kenneth E.

Osborn Laboratory Sciences Specialist, Retired Most of us live comfortably with some level of uncertainty. (Larry Gonick, The Cartoon Guide to Statistics). Laboratories generate measurement-based information used in making regulatory and scientific decisions. Because all measurements include a component of error, collections of measurements of an environmental parameter will have a distribution of values. Such distributions have three characteristics that define them one from the other: central tendency, dispersion, and shape. The arithmetic mean, geometric mean, median, and mode are statistical measures representing the center of the distribution. Range, variance, and standard deviation are used to represent the dispersion or spread of the distribution. The shape of a distribution lacks such readily identifiable measures, but is frequently referred to as Normal or skewed. While the mean and standard deviation are important statistical parameters, there are times when we may be most interested in the shape of a distribution. For example, we may wish to know whether a process is in control, when extreme results might not belong to the parent distribution, or if there is a disproportionate frequency of occurrence for a given range of the distribution. The standard normal distribution is a mathematical function. When plotted the curve has a symmetrically distributed spread of values centered at a mean of zero and a standard deviation of one. The standard normal function has the formula:

f(z) = (e- z2/2)/(2 )


Where:

(1) (2)

z = (x - )/

When referring to the population of all possible measurements, x refers to any individual measurement, is the distribution mean, and is the standard deviation. The mean and standard deviation are represented by x-bar and lower-case s when referring to the measurements made from a sampling of the population. This distinction is important because any sampling of the population provides an estimate of the population mean and standard deviation. The value of z (equation 2) for a given measurement is calculated as the distance of measurement from the mean divided by the standard deviation. When a collection of measurements is adjusted by subtracting the mean from each value, then dividing those

results by the standard deviation, a normalized distribution will result with a mean of zero and a standard deviation of one. As an example, figure 1 displays the uncensored1concentration of mercury over time for a wastewater discharge. The measurements range from a low of negative eleven to a high of 74 micrograms per liter. Given that negative measurements are present, it is obvious that measurement error is present in this data set. That is expected, and given an uncensored series of measurements in the region of analytical detection it would be unusual for negative values to be absent. The more important question is, should the extremely high values be included in the calculation of the distribution statistics? If these values are the consequence of some event that is outside the control of the process, then they should not be used to calculate the mean and standard deviation of the represented distribution. FIG 1: EFFLUENT MERCURY 80 MERCURY CONCENTRATION (ug/L) 60 40 20 0 -10-20 10 TIME ORDER FIG 2: EFFLUENT MERCURY 3 2 1 0 -1 0 -2 -3 30

Z-VALUE

10

20

30

RANKED ORDER When the data are transformed to the corresponding z-values and plotted as a ranked distribution from lowest value to highest, the shape of the data set is more readily seen as
1

Approximately 67% of the measurements are less than the method specific detection limit of 20 ug/L.

roughly symmetrical around an average value of zero and a standard deviation of one, as indicated in figure 2. By exchanging axes, and recalculating the z-value rank order as a probability, the data can be compared to a standard normal curve as in figure 3. The smooth thin line is a standard normal curve of 1000 values generated by computer with a Monte-Carlo simulation program. The distribution has a mean of zero and a standard deviation of one. The fit of environmental data to the standard normal curve is close, but can be improved. Note that one value is at the extreme end of the range and exceeds the mean by more than three standard deviations. When this outlier is removed and the z-values are recalculated, the results more closely approximate a normal distribution, as in figure 4. The center of the distribution of z-value transformed mercury values has not changed, but the sigmoid curve has rotated around the center bringing the extreme values closer to the standard normal curve. The conclusion is that the extreme value of 74 ug/L (z value = 4.6 in figure 4) is appropriately removed from a statistical evaluation of the data set. FIG 3: FIT TO NORMAL PROBABILITY 100 50 0 -2 -1 0 1 2 3 4 Z-VALUE PROBABILITY FIG 4: BETTER FIT 100

50

0 -2 -1 0 1 2 3 4 5 Z-VALUE

Comparing environmental data to the standard normal curve allows statistical analyses of data sets using tools that are available to anyone with a spreadsheet program. This visual approach to the analysis of environmental data can frequently simplify the resolution of questions regarding equivalency of different data sets, process upsets, and suspect results.

Vous aimerez peut-être aussi