Vous êtes sur la page 1sur 4

Bowman, A. W., and A.

Azzalini, Applied Smoothing Techniques for Data analysis, Oxford University Press, 1997 Special Properties of Lifetime Data - Some features of lifetime data distinguish them other types of data. 1. The lifetimes are always positive values, usually representing time. 2. Some lifetimes may not be observed exactly, so that they are known only to be larger than some value. 3. The distributions and analysis techniques that are commonly used are fairly specific to lifetime data.

Ways of Looking at Distributions - Before we examine the distribution of the data, let's consider different ways of looking at a probability distribution. 1. A probability density function (PDF) indicates the relative probability of failure at different times. 2. A survivor function gives the probability of survival as a function of time, and is simply one minus the cumulative distribution function (1CDF). 3. The hazard rate gives the instantaneous probability of failure given survival to a given time. It is the PDF divided by the survivor function. (If the hazard rates turn out to be increasing, meaning the items are more susceptible to failure as time passes (aging)). 4. A probability plot is a re-scaled CDF, and is used to compare data to a fitted distribution. There is various type of distribution for life time data but here we using Weibull, is a common distribution for modelling lifetime data. Fitting a Weibull Distribution - The Weibull distribution is a generalization of the exponential distribution. If lifetimes follow an exponential distribution, then they have a constant hazard rate. This means that they do not age, in the sense that the probability of observing a failure in an interval, given survival to the start of that interval, doesn't depend on where the interval starts. A Weibull distribution has a hazard rate that may increase or decrease. The Weibull model allows us to project out and compute failure probabilities for times beyond the end of the test. Empirical cumulative distributions function of data, showing the proportion failing up to each possible survival time.

Measures of central tendency It locate a distribution of data along an appropriate scale. The average is a simple and popular estimate of location. If the data sample comes from a normal distribution, then the sample mean is also optimal. Unfortunately, outliers, data entry errors, or glitches exist in almost all real data. The sample mean is sensitive to these problems. One bad data value

can move the average away from the center of the rest of the data by an arbitrarily large distance. The median and trimmed mean are two measures that are resistant (robust) to outliers. The median is the 50th percentile of the sample, which will only change slightly if you add a large perturbation to any value. The idea behind the trimmed mean is to ignore a small percentage of the highest and lowest values of a sample when determining the center of the sample. The geometric mean and harmonic mean, like the average, are not robust to outliers. They are useful when the sample is distributed lognormal or heavily skewed.

Measures of Dispersion The purpose of measures of dispersion (Measures of spread) is to find out how spreads out the data values are on the number line. The range (the difference between the maximum and minimum values) is the simplest measure of spread. But if there is an outlier in the data, it will be the minimum or maximum value. Thus, the range is not robust to outliers. The standard deviation and the variance are popular measures of spread that are optimal for normally distributed samples. The sample variance is the of the normal parameter 2. The standard deviation is the square root of the variance and has the desirable property of being in the same units as the data. That is, if the data is in meters, the standard deviation is in meters as well. The variance is in meters2, which is more difficult to interpret. Neither the standard deviation nor the variance is robust to outliers. A data value that is separate from the body of the data can increase the value of the statistics by an arbitrarily large amount. The mean absolute deviation (MAD) is also sensitive to outliers. But the MAD does not move quite as much as the standard deviation or variance in response to bad data. The interquartile range (IQR) is the difference between the 75th and 25th percentile of the data. Since only the middle 50% of the data affects this measure, it is robust to outliers.

Quantiles and Percentiles Quantiles and percentiles provide information about the shape of data as well as its location and spread. The quantile of order p (0 = p = 1) is the smallest x value where the cumulative distribution function equals or exceeds p. n sorted data points are the 0.5/n, 1.5/n, ..., (n0.5)/n quantiles. Linear interpolation is used to compute intermediate quantiles. The data min or max is assigned to quantiles outside the range. Missing values are treated as NaN, and removed from the data.

Percentiles are specified using percentages, from 0 to 100. For an n-element vector X, The sorted values in X are taken to be the 100(0.5/n), 100(1.5/n),..., 100([n-0.5]/n) percentiles. Linear interpolation is used to compute percentiles for percent values between 100(0.5/n) and 100([n-0.5]/n). The minimum or maximum values in X are assigned to percentiles for percent values outside that range.

Normal Probability Plot Normal probability plots are used to assess whether data comes from a normal distribution. Many statistical procedures make the assumption that an underlying distribution is normal, so normal probability plots can provide some assurance that the assumption is justified, or else provide a warning of problems with the assumption. In a normal probability plot, if all the data points fall near the line, an assumption of normality is reasonable. Otherwise, the points will curve away from the line, and an assumption of normality is not justified. Quantile-quantile Plot Quantile-quantile plots are used to determine whether two samples come from the same distribution family. They are scatter plots of quantiles computed from each sample, with a line drawn between the first and third quartiles. If the data falls near the line, it is reasonable to assume that the two samples come from the same distribution. The method is robust with respect to changes in the location and scale of either distribution. Even though the parameters and sample sizes are different, the approximate linear relationship suggests that the two samples may come from the same distribution family. CDF Plot An empirical cumulative distribution function (CDF) plot shows the proportion of data less than each x value, as a function of x. The scale on the y-axis is linear; in particular, it is not scaled to any particular distribution. Empirical CDF plots are used to compare data CDFs to CDFs for particular distributions. In practice, the sampling distribution would be unknown, and would be chosen to match the empirical CDF. P-P Plot A probability plot, like the normal probability plot, is just an empirical CDF plot scaled to a particular distribution. The y-axis values are probabilities from zero to one, but the scale is not linear. The distance between tick marks is the distance between quantiles of the distribution. In the plot, a line is drawn between the first and third quartiles in the data. If the data falls near the line, it is reasonable to choose the distribution as a model for the data.

Probability Distribution A typical data sample is distributed over a range of values, with some values occurring more frequently than others. Some of the variability may be the result of measurement error or sampling effects. For large random samples, however, the distribution of the data typically reflects the variability of the source population and can be used to model the data-producing process. Statistics computed from data samples also vary from sample to sample. Modelling distributions of statistics is important for drawing inferences from statistical summaries of data. Probability distributions are theoretical distributions, based on assumptions about a source population. They assign probability to the event that a random variable, such as a data value or a statistic, takes on a specific, discrete value, or falls within a specified range of continuous values. Choosing a model often means choosing a parametric family of probability distributions and then adjusting the parameters to fit the data. The choice of an appropriate distribution family may be based on a priori knowledge, such as matching the mechanism of a data-producing process to the theoretical assumptions underlying a particular family, or a posteriori knowledge, such as information provided by probability plots and distribution tests. Parameters can then be found that achieve the maximum likelihood of producing the data.

Vous aimerez peut-être aussi