Académique Documents
Professionnel Documents
Culture Documents
Resources
http://www.mathworks.com/access/helpdesk/help/pdf_doc/ stats/stats.pdf This can be found in the COMM2M Lectures folder as STATS.PDF. Higham and Higham, 2000, MATLAB Guide, SIAM. James E. Gentle, 2002, Elements of Computational Statistics, Springer. Wendy L.Martinez & Angel R. Martinez, 2002, Computational Statistics Handbook with MATLAB, Chapman & Hall/CRC. Michael J. Crawley, 2005, Statistics: An Introduction Using R, Wiley. Our Statistics Study Group is working through this.
Statistical Models
Often expressed as a set of equations relating data elements. Can include probability distributions for the elements. If this is the case, you have a stochastic model. The model should be free to evolve based on data mining.
Structure-in-the-data
Of most interest, for example:
Modes Gaps Clusters Symmetry Shape Deviations from normality
Visualization
Multiple views are necessary Be able to zoom in on the data as a few points can obscure the interesting structure. Scaling of the axes may be necessary, since our eyes are not perfect tools for detecting structure. Watch out for time-ordered or location-ordered data, particularly if time or location are not explicitly reported.
Plots
Use simple plots to start with. Watch for rounded datashown by horizontal strata in the data. That often signals other problems.
Statistical Activities
Data collection (ideally the statistician has a say on how they are collected) Description of a dataset
Averages Spreads Extreme points
How to Do It
Start by determining what sort of statistical analysis should you do. You need to know:
Which variable is the response variable? Which are the explanatory variables? What kind are the explanatory variables? What kind of response variable do you have?
Variation
You want to understand how the response is dependent on variation in the explanatory variables, but you are also interested in lack of dependence. Design the simplest model that explains the data adequately.
Signicance
You have to determine what the probability of a false alarm will bethat is, that you will think something is signicant that really isnt. Typical values are 5%, 1%, and 0.1%. Dont test every hypothesis. Some will be true by chance.
Experimental Design
Replication
Increases reliability, so be thorough. Usually the answer is 30.
Randomization
Reduces bias, so do it properly Almost never done properly Discuss
Controls
No controls, no conclusions. A control experiment is one where you dont apply the treatment or dont enable the part of your experiment that is supposed to produce the different outcome.
Replication
Must be independent Not part of a time series Not grouped together in space Of an appropriate spatial scale Covers the normal variation in initial conditons.
Error Types
Null hypothesis Null hypothesis actually true actually false Accept null hypothesis Reject null hypothesis Correct Type II () error
Inference
Strong inference
A clear hypothesis An acceptable test
Weak inference
Natural experiments
Statistics in MATLAB
MATLAB has some useful statistical tools you can use to do all this (although most computational statistics is done using FORTRAN, SAS, R, or S-Plus). Supports the usual range of statistical tasks, including both analysis and visualization. Following is an overview of the capabilities of the MATLAB statistics toolbox.
Statistics Capabilities
Probability distributions Descriptive statistics Linear and non-linear models Hypothesis testing Multivariate statistics Plotting Statistical process control, Design of experiments, and Hidden Markov models.
Probability distributions
These are used to display possible probability distributions and create histograms. MATLAB provides the pdf, cdf, cdf-1, a random number generator, and mean and variance estimators for each distribution.
Discrete distributions
Binomial Discrete uniform Geometric Hypergeometric Negative binomial Poisson
Descriptive statistics
mean median variance standard deviation Grouped data
Hypothesis testing
Null hypothesis Alternative hypotheses Signicance level p-value Condence intervals A number of tests are provided (this is a hard area)
Multivariate statistics
Principal components analysis Factor analysis MANOVA Cluster analysis Multidimensional scaling
Design of experiments
Full factorial designs Fractional factorial designs Response surface designs D-optimal designs
Conclusions
MATLAB provides a basic engineering toolkit for these statistical activities. Not as broad as R or S-plus, but compatible with data collected or generated by other toolkits. Supports all activities well. More specialized work (e.g., Bayesian analysis) requires either your own extensions or more specialized toolkits.