Vous êtes sur la page 1sur 14

Computational Statistics Handbook with

MATLAB

Wendy L. Martinez Angel R. Martinez

CHAPMAN & HALL/CRC


Boca Raton London New York Washington, D.C.

2002 by Chapman & Hall/CRC

Library of Congress Cataloging-in-Publication Data


Catalog record is available from the Library of Congress

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microlming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specic permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identication and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com


2002 by Chapman & Hall/CRC No claim to original U.S. Government works International Standard Book Number 1-58488-229-8 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper

To Edward J. Wegman Teacher, Mentor and Friend

2002 by Chapman & Hall/CRC

Table of Contents
Preface

Chapter 1
Introduction
1.1 What Is Computational Statistics? 1.2 An Overview of the Book Philosophy What Is Covered A Word About Notation 1.3 M ATLAB Code Computational Statistics Toolbox Internet Resources 1.4 Further Reading

Chapter 2
Probability Concepts
2.1 Introduction 2.2 Probability Background Probability Axioms of Probability 2.3 Conditional Probability and Independence Conditional Probability Independence Bayes Theorem 2.4 Expectation Mean and Variance Skewness Kurtosis 2.5 Common Distributions Binomial Poisson Uniform Normal Exponential Gamma Chi-Square Weibull Beta

2002 by Chapman & Hall/CRC

viii Multivariate Normal 2.6 M ATLAB Code 2.7 Further Reading Exercises

Computational Statistics Handbook with MATLAB

Chapter 3
Sampling Concepts
3.1 Introduction 3.2 Sampling Terminology and Concepts Sample Mean and Sample Variance Sample Moments Covariance 3.3 Sampling Distributions 3.4 Parameter Estimation Bias Mean Squared Error Relative Efficiency Standard Error Maximum Likelihood Estimation Method of Moments 3.5 Empirical Distribution Function Quantiles 3.6 M ATLAB Code 3.7 Further Reading Exercises

Chapter 4
Generating Random Variables
4.1 Introduction 4.2 General Techniques for Generating Random Variables Uniform Random Numbers Inverse Transform Method Acceptance-Rejection Method 4.3 Generating Continuous Random Variables Normal Distribution Exponential Distribution Gamma Chi-Square Beta Multivariate Normal Generating Variates on a Sphere 4.4 Generating Discrete Random Variables Binomial Poisson Discrete Uniform

2002 by Chapman & Hall/CRC

Table of Contents 4.5 M ATLAB Code 4.6 Further Reading Exercises

ix

Chapter 5
Exploratory Data Analysis
5.1 Introduction 5.2 Exploring Univariate Data Histograms Stem-and-Leaf Quantile-Based Plots - Continuous Distributions Q-Q Plot Quantile Plots Quantile Plots - Discrete Distributions Poissonness Plot Binomialness Plot Box Plots 5.3 Exploring Bivariate and Trivariate Data Scatterplots Surface Plots Contour Plots Bivariate Histogram 3-D Scatterplot 5.4 Exploring Multi-Dimensional Data Scatterplot Matrix Slices and Isosurfaces Star Plots Andrews Curves Parallel Coordinates Projection Pursuit Projection Pursuit Index Finding the Structure Structure Removal Grand Tour 5.5 M ATLAB Code 5.6 Further Reading Exercises

Chapter 6
Monte Carlo Methods for Inferential Statistics
6.1 Introduction 6.2 Classical Inferential Statistics Hypothesis Testing Confidence Intervals 6.3 Monte Carlo Methods for Inferential Statistics

2002 by Chapman & Hall/CRC

Computational Statistics Handbook with MATLAB

Basic Monte Carlo Procedure Monte Carlo Hypothesis Testing Monte Carlo Assessment of Hypothesis Testing 6.4 Bootstrap Methods General Bootstrap Methodology Bootstrap Estimate of Standard Error Bootstrap Estimate of Bias Bootstrap Confidence Intervals Bootstrap Standard Confidence Interval Bootstrap-t Confidence Interval Bootstrap Percentile Interval 6.5 M ATLAB Code 6.6 Further Reading Exercises

Chapter 7
Data Partitioning
7.1 Introduction 7.2 Cross-Validation 7.3 Jackknife 7.4 Better Bootstrap Confidence Intervals 7.5 Jackknife-After-Bootstrap 7.6 M ATLAB Code 7.7 Further Reading Exercises

Chapter 8
Probability Density Estimation
8.1 Introduction 8.2 Histograms 1-D Histograms Multivariate Histograms Frequency Polygons Averaged Shifted Histograms 8.3 Kernel Density Estimation Univariate Kernel Estimators Multivariate Kernel Estimators 8.4 Finite Mixtures Univariate Finite Mixtures Visualizing Finite Mixtures Multivariate Finite Mixtures EM Algorithm for Estimating the Parameters Adaptive Mixtures 8.5 Generating Random Variables 8.6 M ATLAB Code

2002 by Chapman & Hall/CRC

Table of Contents 8.7 Further Reading Exercises

xi

Chapter 9
Statistical Pattern Recognition
9.1 Introduction 9.2 Bayes Decision Theory Estimating Class-Conditional Probabilities: Parametric Method Estimating Class-Conditional Probabilities: Nonparametric Bayes Decision Rule Likelihood Ratio Approach 9.3 Evaluating the Classifier Independent Test Sample Cross-Validation Receiver Operating Characteristic (ROC) Curve 9.4 Classification Trees Growing the Tree Pruning the Tree Choosing the Best Tree Selecting the Best Tree Using an Independent Test Sample Selecting the Best Tree Using Cross-Validation 9.5 Clustering Measures of Distance Hierarchical Clustering K-Means Clustering 9.6 M ATLAB Code 9.7 Further Reading Exercises

Chapter 10
Nonparametric Regression
10.1 Introduction 10.2 Smoothing Loess Robust Loess Smoothing Upper and Lower Smooths 10.3 Kernel Methods Nadaraya-Watson Estimator Local Linear Kernel Estimator 10.4 Regression Trees Growing a Regression Tree Pruning a Regression Tree Selecting a Tree 10.5 M ATLAB Code 10.6 Further Reading

2002 by Chapman & Hall/CRC

xii Exercises

Computational Statistics Handbook with MATLAB

Chapter 11
Markov Chain Monte Carlo Methods
11.1 Introduction 11.2 Background Bayesian Inference Monte Carlo Integration Markov Chains Analyzing the Output 11.3 Metropolis-Hastings Algorithms Metropolis-Hastings Sampler Metropolis Sampler Independence Sampler Autoregressive Generating Density 11.4 The Gibbs Sampler 11.5 Convergence Monitoring Gelman and Rubin Method Raftery and Lewis Method 11.6 M ATLAB Code 11.7 Further Reading Exercises

Chapter 12
Spatial Statistics
12.1 Introduction What Is Spatial Statistics? Types of Spatial Data Spatial Point Patterns Complete Spatial Randomness 12.2 Visualizing Spatial Point Processes 12.3 Exploring First-order and Second-order Properties Estimating the Intensity Estimating the Spatial Dependence Nearest Neighbor Distances - G and F Distributions K-Function 12.4 Modeling Spatial Point Processes Nearest Neighbor Distances K-Function 12.5 Simulating Spatial Point Processes Homogeneous Poisson Process Binomial Process Poisson Cluster Process Inhibition Process Strauss Process

2002 by Chapman & Hall/CRC

Table of Contents 12.6 M ATLAB Code 12.7 Further Reading Exercises

xiii

Appendix A
Introduction to MATLAB
A.1 What Is MATLAB? A.2 Getting Help in MATLAB A.3 File and Workspace Management A.4 Punctuation in MATLAB A.5 Arithmetic Operators A.6 Data Constructs in MATLAB Basic Data Constructs Building Arrays Cell Arrays A.7 Script Files and Functions A.8 Control Flow For Loop While Loop If-Else Statements Switch Statement A.9 Simple Plotting A.10 Contact Information

Appendix B
Index of Notation

Appendix C
Projection Pursuit Indexes
C.1 Indexes Friedman-Tukey Index Entropy Index Moment Index Distances C.2 MATLAB Source Code

Appendix D
MATLAB Code
D.1 Bootstrap Confidence Interval D.2 Adaptive Mixtures Density Estimation D.3 Classification Trees D.4 Regression Trees

2002 by Chapman & Hall/CRC

xiv

Computational Statistics Handbook with MATLAB

Appendix E
MATLAB Statistics Toolbox

Appendix F
Computational Statistics Toolbox

Appendix G
Data Sets
References

2002 by Chapman & Hall/CRC

Preface
Computational statistics is a fascinating and relatively new field within statistics. While much of classical statistics relies on parameterized functions and related assumptions, the computational statistics approach is to let the data tell the story. The advent of computers with their number-crunching capability, as well as their power to show on the screen two- and threedimensional structures, has made computational statistics available for any data analyst to use. Computational statistics has a lot to offer the researcher faced with a file full of numbers. The methods of computational statistics can provide assistance ranging from preliminary exploratory data analysis to sophisticated probability density estimation techniques, Monte Carlo methods, and powerful multi-dimensional visualization. All of this power and novel ways of looking at data are accessible to researchers in their daily data analysis tasks. One purpose of this book is to facilitate the exploration of these methods and approaches and to provide the tools to make of this, not just a theoretical exploration, but a practical one. The two main goals of this book are: To make computational statistics techniques available to a wide range of users, including engineers and scientists, and To promote the use of MATLAB by statisticians and other data analysts. M AT L AB a n d H a n d le G r a p h ic s a re re g is t e re d t ra de m a r k s o f The MathWorks, Inc. There are wonderful books that cover many of the techniques in computational statistics and, in the course of this book, references will be made to many of them. However, there are very few books that have endeavored to forgo the theoretical underpinnings to present the methods and techniques in a manner immediately usable to the practitioner. The approach we take in this book is to make computational statistics accessible to a wide range of users and to provide an understanding of statistics from a computational point of view via algorithms applied to real applications. This book is intended for researchers in engineering, statistics, psychology, biostatistics, data mining and any other discipline that must deal with the analysis of raw data. Students at the senior undergraduate level or beginning graduate level in statistics or engineering can use the book to supplement course material. Exercises are included with each chapter, making it suitable as a textbook for a course in computational statistics and data analysis. Scien-

2002 by Chapman & Hall/CRC

xvi

Computational Statistics Handbook with MATLAB

tists who would like to know more about programming methods for analyzing data in MATLAB would also find it useful. We assume that the reader has the following background: Calculus: Since this book is computational in nature, the reader needs only a rudimentary knowledge of calculus. Knowing the definition of a derivative and an integral is all that is required. Linear Algebra: Since MATLAB is an array-based computing language, we cast several of the algorithms in terms of matrix algebra. The reader should have a familiarity with the notation of linear algebra, array multiplication, inverses, determinants, an array transpose, etc. Probability and Statistics: We assume that the reader has had introductory probability and statistics courses. However, we provide a brief overview of the relevant topics for those who might need a refresher. We list below some of the major features of the book. The focus is on implementation rather than theory, helping the reader understand the concepts without being burdened by the theory. References that explain the theory are provided at the end of each chapter. Thus, those readers who need the theoretical underpinnings will know where to find the information. Detailed step-by-step algorithms are provided to facilitate implementation in any computer programming language or appropriate software. This makes the book appropriate for computer users who do not know MATLAB. MATLAB code in the form of a Computational Statistics Toolbox is provided. These functions are available for download at: http://www.infinityassociates.com http://lib.stat.cmu.edu . Please review the readme file for installation instructions and information on any changes. Exercises are given at the end of each chapter. The reader is encouraged to go through these, because concepts are sometimes explored further in them. Exercises are computational in nature, which is in keeping with the philosophy of the book. Many data sets are included with the book, so the reader can apply the methods to real problems and verify the results shown in the book. The data can also be downloaded separately from the toolbox at http://www.infinityassociates.com. The data are pro-

2002 by Chapman & Hall/CRC

xvii vided in MATLAB binary files (.mat) as well as text, for those who want to use them with other software. Typing in all of the commands in the examples can be frustrating. So, MATLAB scripts containing the commands used in the examples are also available for download at http://www.infinityassociates.com . A brief introduction to MATLAB is provided in Appendix A. Most of the constructs and syntax that are needed to understand the programming contained in the book are explained. An index of notation is given in Appendix B. Definitions and page numbers are provided, so the user can find the corresponding explanation in the text. Where appropriate, we provide references to internet resources for computer code implementing the algorithms described in the chapter. These include code for MATLAB, S-plus, Fortran, etc. We would like to acknowledge the invaluable help of the reviewers: Noel Cressie, James Gentle, Thomas Holland, Tom Lane, David Marchette, Christian Posse, Carey Priebe, Adrian Raftery, David Scott, Jeffrey Solka, and Clifton Sutton. Their many helpful comments made this book a much better product. Any shortcomings are the sole responsibility of the authors. We owe a special thanks to Jeffrey Solka for some programming assistance with finite mixtures. We greatly appreciate the help and patience of those at CRC Press: Bob Stern, Joanne Blake, and Evelyn Meany. We also thank Harris Quesnell and James Yanchak for their help with resolving font problems. Finally, we are indebted to Naomi Fernandes and Tom Lane at The MathWorks, Inc. for their special assistance with MATLAB.

Disc Disclaim laim er s


1. Any MATLAB programs and data sets that are included with the book are provided in good faith. The authors, publishers or distributors do not guarantee their accuracy and are not responsible for the consequences of their use. 2. The views expressed in this book are those of the authors and do not necessarily represent the views of DoD or its components.

Wendy L. and Angel R. Martinez August 2001

2002 by Chapman & Hall/CRC

Vous aimerez peut-être aussi