Vous êtes sur la page 1sur 20

Modern Multivariate

Statistical Techniques
--Nonparametric Density Estimation
Xi Chen
Nov 6

Multivariate Analysis
Classical Analysis:
Poor results for huge and complex data sets
The questions become more different
Computational cost of storing and processing data goes down

We need Modern Multivariate Analysis Techniques

Modern Data
Exploratory data analysis (EDA) 1977 Data mining
From simple dirty techs to big data.
Internet traffic data are described as ferocious
Human Genome Project has to deal with gigabytes (230 ( 109)
bytes) of genetic information
earth sciences have terabytes (240 ( 1012) bytes) and soon,
petabytes (250 ( 1015) bytes), of data for processing
Etc.

What is Data Mining?


Descriptive data mining: Search massive data sets
and discover the locations of unexpected structures or
relationships, patterns, trends, clusters, and outliers in
the data.
Predictive data mining: Build models and procedures
for regression, classification, pattern recognition, or
machine learning tasks, and assess the predictive
accuracy of those models and procedures when applied
to fresh data.
In machine-learning terms, descriptive data mining is known
as unsupervised learning, whereas predictive data mining is

Nonparametric Density Estimation


(NPDE)
What makes NPDE techniques so appealing to the data
analyst is:
they make no specific distributional assumptions and,
thus, can be employed as an initial exploratory look at the
data.

Suppose we wish to estimate a continuous probability


density function p of a random r-vector variate X ,
where
(1)

Nonparametric Density Estimation


(NPDE)
Any p that satisifies (1) is called a bona fide density.
Problem:
To estimate p without specifying a formal parametric structure

Loads of p is bona fide,


No finite number of parameters!
Is the density function a smooth or not (i.e. continuous), we
hope so, but certain application (X-ray transition tomography)
is discrete.

Nonparametric Density Estimation


(NPDE)
Earliest NPDE of a univariate density p was the
histogram. (Most time this is the first step for
analyzing data.)
Further: (np discrimination and time series)
Kernel
Orthogonal series
Nearest neighbor methods

NPDE Example: Coronary Heart


Disease

Statistical Properties of Density


Estimators
i.e. why we can claim certain estimators are better than
others (since they are not finite numbered.)
Unviased:
Consistency

The Histogram

Kernel Density Estimation


Given n iid univariate observations, x1, x2, , xn, drawn
from the density p, the kernel density estimator (2) of
p(x), xis used to obtain a smoother density estimate
than the histogram.
(2)
K is the kernel function, and the window width h
determines the smoothness of the density estimate.
h is too small density estimate too dependent upon the
sample values
h is too large oversmooths of removing interesting

Kernel Density Estimation


Popular ones: Gaussian Kernel with unbounded support,
and the polynomial kernels

Kernel Density Estimation

Kernel Density Estimation Example

Kernel Density Estimation Example

Estimating the Window Width


Automated method to deterining the optimal window
width for any given data set
Rule-of-Thumb Method
Cross-Validation
Plug-in Methods

Rule-of-Thumb Method
We take p to be and K to be a standard Gaussian
kernel.
The optimal (ROT) window width for the above density
would be

S is the usual estimate for \


Become how to pick the s problem

Cross-Validation
In
the univariate case, the basic algorithm removes a single value, say xi, from
the sample, computes the appropriate density estimate at that xi from the
remaining n1 sample values,

and then chooses h to optimize some given criterion involving all values of i =
1,2,...,n.
The unbiased cross-validation choice of window width is the h that minimizes

g(x): standarlizing function

The End

Vous aimerez peut-être aussi