Statistics

Statistics in MATLAB
COMM2M Harry R. Erwin, PhD University of Sunderland
Resources
http://www.mathworks.com/access/helpdesk/help/pdf_doc/ stats/stats.pdf This can be found in the COMM2M Lectures folder as STATS.PDF. Higham and Higham, 2000, MATLAB Guide, SIAM. James E. Gentle, 2002, Elements of Computational Statistics, Springer. Wendy L.Martinez & Angel R. Martinez, 2002, Computational Statistics Handbook with MATLAB, Chapman & Hall/CRC. Michael J. Crawley, 2005, Statistics: An Introduction Using R, Wiley. Our Statistics Study Group is working through this.
Doing Computational Statistics

Usually you do computational statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model. A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.
Statistical Models
Often expressed as a set of equations relating data elements. Can include probability distributions for the elements. If this is the case, you have a stochastic model. The model should be free to evolve based on data mining.
Common Stochastic Models

Parameterized statistical distributions, such as the normal distribution, binomial distribution, or the chi-squared distribution. Sometimes more complicated, where you need to use simulation, resampling, and visualization to determine the parameters of the model.
Structure-in-the-data
Of most interest, for example:
Modes Gaps Clusters Symmetry Shape Deviations from normality
Visualization
Multiple views are necessary Be able to zoom in on the data as a few points can obscure the interesting structure. Scaling of the axes may be necessary, since our eyes are not perfect tools for detecting structure. Watch out for time-ordered or location-ordered data, particularly if time or location are not explicitly reported.
Plots
Use simple plots to start with. Watch for rounded datashown by horizontal strata in the data. That often signals other problems.
Statistical Activities
Data collection (ideally the statistician has a say on how they are collected) Description of a dataset
Averages Spreads Extreme points
Inference within a model or collection of models Model selection
How to Do It
Start by determining what sort of statistical analysis should you do. You need to know:
Which variable is the response variable? Which are the explanatory variables? What kind are the explanatory variables? What kind of response variable do you have?
Basic Method of Analysis

If all explanatory variables are continuous, plan on a regression analysis. If all explanatory variables are categorical, plan for an analysis of variance (ANOVA). If you have a mix, plan for an analysis of covariance (ANCOVA)
Effect of Response Variable

If the response variable is continuous, then plan on a normal regression, ANOVA, or ANCOVA. If the response variable is a proportion, do a logistic regression. If a count, you need a log linear model. If binary, you need a binary logistic analysis If time to event or time at death, you will be doing a survival analysis.
Variation
You want to understand how the response is dependent on variation in the explanatory variables, but you are also interested in lack of dependence. Design the simplest model that explains the data adequately.
Signicance
You have to determine what the probability of a false alarm will bethat is, that you will think something is signicant that really isnt. Typical values are 5%, 1%, and 0.1%. Dont test every hypothesis. Some will be true by chance.
Good and Bad Hypotheses

There are vultures in the local park. There are no vultures in the local park. Which is testable? The null hypothesis is testable. You test it by taking measurements and showing that if the null hypothesis is true, the chance of those measurements is nearly zero.
Experimental Design
Replication
Increases reliability, so be thorough. Usually the answer is 30.
Randomization
Reduces bias, so do it properly Almost never done properly Discuss
Controls
No controls, no conclusions. A control experiment is one where you dont apply the treatment or dont enable the part of your experiment that is supposed to produce the different outcome.
Replication
Must be independent Not part of a time series Not grouped together in space Of an appropriate spatial scale Covers the normal variation in initial conditons.
Error Types
Null hypothesis Null hypothesis actually true actually false Accept null hypothesis Reject null hypothesis Correct Type II () error
Type I () error Correct
Typical and values

You usually want the probability of rejecting the null hypothesis () when it is true to be less than 5%. You usually want the probability of accepting the null hypothesis () when it is false to be less than 20%. The power of a test is 1- , or greater than 80% in this case. Rule of Thumb: the number of replicates to reject the null hypothesis with probability 80% is about 8s2/d2, where s2 is the variance in the response and d is the size of the difference to be detected in a single sample.
Inference
Strong inference
A clear hypothesis An acceptable test
Weak inference
Natural experiments
Conclusions from natural experiments are hypotheses.
How Long to Go On?

To stop the experiment as soon as a pleasing result is obtained? To keep going until the theoretically correct result is obtained? Discuss.
Statistics in MATLAB
MATLAB has some useful statistical tools you can use to do all this (although most computational statistics is done using FORTRAN, SAS, R, or S-Plus). Supports the usual range of statistical tasks, including both analysis and visualization. Following is an overview of the capabilities of the MATLAB statistics toolbox.
Statistics Capabilities
Probability distributions Descriptive statistics Linear and non-linear models Hypothesis testing Multivariate statistics Plotting Statistical process control, Design of experiments, and Hidden Markov models.
Random number generators

There are functions in the Statistics Toolbox that return random output. These allow the user to observe probability distributions, evaluate statistical tests, and use resampling techniques.
Probability distributions
These are used to display possible probability distributions and create histograms. MATLAB provides the pdf, cdf, cdf-1, a random number generator, and mean and variance estimators for each distribution.
Continuous Distributions Provided

Beta Exponential Extreme value Gamma Lognormal Normal Rayleigh Uniform Weibull
Continuous Statistical Distributions

Chi-square Non-central Chi-square F Non-central F t Non-central t
Discrete distributions
Binomial Discrete uniform Geometric Hypergeometric Negative binomial Poisson
Descriptive statistics
mean median variance standard deviation Grouped data
Linear and non-linear models

ANOVA Covariance analysis (ANCOVA) Multiple linear regression Quadratic response surface models Stepwise regression GLM Robust and nonparametric methods Nonlinear least squares Regression and Classication Trees (CART)
Hypothesis testing
Null hypothesis Alternative hypotheses Signicance level p-value Condence intervals A number of tests are provided (this is a hard area)
Multivariate statistics
Principal components analysis Factor analysis MANOVA Cluster analysis Multidimensional scaling
Plotting and Visualization

Box plots Distribution plots Scatter plots
Statistical process control

Quality of manufactured goods
Control charts Capability studies
Design of experiments
Full factorial designs Fractional factorial designs Response surface designs D-optimal designs
Hidden Markov Models

Concepts Markov chains Analysis of hidden Markov models (HMMs).
Conclusions
MATLAB provides a basic engineering toolkit for these statistical activities. Not as broad as R or S-plus, but compatible with data collected or generated by other toolkits. Supports all activities well. More specialized work (e.g., Bayesian analysis) requires either your own extensions or more specialized toolkits.

Statistics

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Statistics

Transféré par

Droits d'auteur :

Formats disponibles

Statistics in MATLAB

COMM2M Harry R. Erwin, PhD University of Sunderland

Doing Computational Statistics

Common Stochastic Models

Inference within a model or collection of models Model selection

Basic Method of Analysis

Effect of Response Variable

Good and Bad Hypotheses

Type I () error Correct

Typical and values

Conclusions from natural experiments are hypotheses.

How Long to Go On?

Random number generators

Continuous Distributions Provided

Continuous Statistical Distributions

Linear and non-linear models

Plotting and Visualization

Statistical process control

Hidden Markov Models

Vous aimerez peut-être aussi