Vous êtes sur la page 1sur 3

Theory of Probability Estimation from PDF's

The probability of a specific value occurring can be determined from the area under a probability distribution. See Theory of Random Variables and Probability Distributions for an introduction to probability distributions. The probability for a continuous range of occurrences can be determined from the Continuous Distribution Function (CDF) for that distribution by measuring the area that is under the curve and between the limits of that range. Note that these plots assume a total area of 1 (since the probability of all possible values for a parameter or event must be 100%). The process of turning data into a standard normalized distribution (of area 1) is Normalization (described below). The full probability integrates from -, but commonly, an area symmetric about the mean is needed (a two-tailed result):

This is the Two-sided Area Under The Curve. Often, the limits are given by integral numbers of standard deviations (e.g. s=1, s=2, s=3). Sometimes, the area under the tails are required. Hence:

Errors and Residuals


A Statistical Error is the difference between a sample value and the expected value for that sample (i.e. the mean). It can be caused by errors in measurements or by variability of the population. For 1

(i.e. the mean). It can be caused by errors in measurements or by variability of the population. For example, the measured weights of a sample of men will almost never match the average weight, even if the measurement scale was perfect. The weight of a liter of skim milk will be much closer to the average, so errors could be due both to variability and measurement errors. An error in the weight of a precise scientific grade kilogram standard will likely be due to measurement errors only. A Residual is an observed estimate of the statistical error and is the difference between a measured value and the average value from that total sample. So, if the average weight of a group of twenty men is 79 kg, the residual for one man from that group would be the difference between his weight and 79 kg, regardless of the actual expected value for men in general.

Standard Error
The standard error (SE) is the estimated standard deviation of the error of a method of measurement, and indicates the uncertainty in a value. The SE estimates the standard deviation of the difference between the measured values and the true (but usually unmeasurable) values of a population. The standard error for a parameter p measured from a sample of size n, with a sampled

standard deviation SDp for that parameter, is

The standard error of a mean of a sample is "SEM", where

and s is the sample standard

deviation. Note that the true standard deviation of a population of size N is

Confidence Intervals and Levels


A Confidence Interval (CI) is an estimated range of values with a given probability of having a true parameter value. Usually, it refers to the probability that a range of values of a sample includes the true population mean. Note that this is only in reference to a sampling process, so the interval itself is not a probability value. The probability only applies to the samples from that interval. The probability is called the Confidence Level (CL). A confidence interval is shown as an area under a distribution curve. The endpoints of the confidence interval are the confidence limits. A common CL is 95%, which means that the samples have a 95% chance of at least one sample having the desired value. This also implies that 5% of the samples should be expected not to have that value. Therefore, a higher CL means a wider interval (which therefore has a greater chance of having the value) and a lower CL means a thinner interval. For the same CL, a survey with a smaller CI is more reliable than a survey with a larger CI. CI's vary inversely with sample size, as a greater number of samples allow a smaller CI for the same confidence level. Confidence intervals for normal distributions are often set in units of the standard deviation. For normal distributions, a 95% CL corresponds on a standard normal distribution to a value of 1.96s, where s is the standard deviation. This is often rounded up to 2s. Note that this is a curve with total area = 1. The term "Margin of Error" refers to one half of a confidence interval, as it is added and subtracted from a sample result to show the interval range. Therefore, larger samples have lower margins of error. In classification, a confidence level shows the likely probabilities of a sample being in a specific class. Therefore if a zone is estimated to be 60% likely to be gas-bearing, with a 95% confidence level, the confidence interval range is from 56% to 64%. The normalized standard deviation value that achieves a Confidence Level is called the Critical Value (Zc) and is used to determine the confidence limits.

Therefore,

can be used with the inverse CDF to find zc.

To determine the standard deviation range for a percentage of area under the curve, we need the inverse cdf, which has probability as the dependent variable.

Normalization
To turn a set of normally distributed data into a standard normal curve, with mean = 0, standard deviation = 1, and the area under the curve = 1, you use two steps. 1. Mean Shifting: Move the curve so the mean is 0. This centers the curve around the 0 value. To do this, shift all values toward 0 by the mean. 2. Autoscaling: Adjust the shape of the curve so the standard deviation = 1. To do this, divide all values by the standard deviation.

Normalized Standard Error


This is used in cross plot regressions of two parameters. The normalization is done to make the variation for both parameters equal, which results in a slope of 1 (i.e. 45 degrees). The calculated standard error is divided by the data range in either the vertical or horizontal direction to normalize the error.

The equation is:

where n = number of data points, ei is the error between the actual point and the fitted point and s2 is the estimator for the variance in ei. A vertical offset method is used when the slope of the line is smaller than 1 (we measure straight up or down to the regression line), while a horizontal offset method is used when the slope is greater than 1 (we measure across). 3

Vous aimerez peut-être aussi