Académique Documents
Professionnel Documents
Culture Documents
Nadaraya-Watson
Kernel-weighted Average
N-W kernel weighted average:
K
is a kernel function:
( )
( )
( )
=
=
=
N
i
i
N
i
i i
x x K
y x x K
x f
1
0
1
0
0
,
,
( )
( )
|
|
.
|
\
|
=
0
0
0
,
x h
x x
D x x K
( ) ( ) ( ) ( ) 0 and 0 , 1 , 0
such that K function smooth Any
2
> = = >
} } }
dx x K x dx x xK dx x K x K
where
Typically K is also symmetric about 0
Some Points About Kernels
h
(x
0
) is a width function also dependent on
For the N-W kernel average h
(x
0
) =
For k-nn average h
(x
0
) = |x
0
-x
[k]
|, where x
[k]
is
the k
th
closest x
i
to x
0
determines the width of local neighborhood
and degree of smoothness
also controls the tradeoff between bias and
variance
Larger makes lower variance but higher bias (Why?)
is computed from training data (how?)
Example Kernel functions
Epanechnikov quadratic kernel (used in N-W method)
tri-cube kernel
Gaussian kernel
( ) ( )
( )
{
; 1 t if 1
4
3
otherwise. 0
0
0
2
,
s
=
|
|
.
|
\
|
=
t
t D
x x
D x x K
( ) ( )
( )
{
; 1 t if 1
otherwise. 0
0
0
3
3
,
s
=
|
|
.
|
\
|
=
t
t D
x x
D x x K
( ) )
2
) (
exp(
2
1
,
2
2
0
0
t
x x
x x K
=
Compact vanishes beyond a finite range (such as Epanechnikov, tri-cube)
Everywhere differentiable (Gaussian, tri-cube)
Kernel
characteristics
Local Linear Regression
In kernel-weighted average method estimated function value
has a high bias at the boundary
This high bias is a result of the asymmetry at the boundary
The bias can also be present in the interior when the x values
in the training set are not equally spaced
Fitting straight lines rather than constants locally helps us to
remove bias (why?)
Locally Weighted Linear Regression
Least squares solution:
Note that the estimate is linear in y
i
The weights l
i
(x
i
) are sometimes referred to as
the equivalent kernel
( ) ( )
( ) ( ) ( ) | |
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( ) ( )
( )
( ) ( )
i
i
T
N
i
i i
T T
T
N
i
i i i
x x
x x K ith x W N N
x b ith B N
,x x b
y x l
y x W B B x W B x b x x x x f
x x x y x x K
, element diagonal with matrix diagonal
row with matrix regression 2
1 : function valued - vector
, min
0 0
T
1
0
0
1
0 0 0 0 0 0
2
1
0 0 0
,
0 0
| o
| o
| o
=
=
= + =
=
Ex.
Bias Reduction In Local Linear
Regression
Local linear regression automatically modifies the kernel to
correct the bias exactly to the first order
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) 0 and 1 : since
1
0 0
1
0
1
0
2
0 0 0 0
1
0
2
0 0 0
1
0
2
0 0
1
0 0 0
1
0 0
1
0 0
= =
+ ' ' = =
+ ' ' + =
+ ' ' + ' + =
=
= =
=
=
= = =
=
N
i
i i
N
i
i
N
i
i i
N
i
i i
N
i
i i
N
i
i i
N
i
i
N
i
i i
x l x x x l
R x l x x x f x f x f E bias
R x l x x x f x f
R x l x x x f x l x x x f x l x f
x f x l x f E
Write a Taylor series expansion of f(x
i
)
Ex. 6.2 in [HTF]
Local Polynomial Regression
Why have a polynomial for the local fit? What would be
the rationale?
We will gain on bias; however we will pay the price in
terms of variance (why?)
( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( ) ( )
( )
( ) ( )
i
i
d T
N
i
i i
T T T
d
j
j
j
N
i
d
j
j
i j i i
d j x x
x x K ith x W N N
x b ith B d N
x ,x x b
y x l
y x W B B x W B x b x x x x f
x x x y x x K
j
, element diagonal with matrix diagonal
row with matrix regression 1
,..., 1 : function valued - vector
, min
0 0
T
1
0
0
1
0 0
1
0 0 0 0
2
1 1
0 0 0
,..., 1 , ,
0 0
| o
| o
| o
+
=
=
= + =
(
=
= =
=
Bias and Variance Tradeoff
As the degree of local polynomial regression increases, bias
decreases and variance increases
Local linear fits can help reduce bias significantly at the boundaries
at a modest cost in variance
Local quadratic fits tend to be most helpful in reducing bias due to
curvature in the interior of the domain
So, would it be helpful have a mixture or linear and quadratic local
fits?
Local Regression in Higher
Dimensions
We can extend 1D local regression to higher dimensions
Standardize each coordinates in the kernel, because Euclidean
(square) norm is affected by scaling
( ) ( )
( ) ( ) ( ) | |
( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( )
( )
( ) ( )
i
i
p
p
T d
p
N
i
i i
T T T T
N
i
T
i i i
x x
x x K ith x W N N
x b ith B H N
x b H
p
y x l
y x W B B x W B x b x x b x f
x x
D x x K
x x b y x x K
, element diagonal with matrix diagonal
row with matrix regression
: function valued - vector 1
degree d with dimension
,
, min
0 0
T
1
1
1
0
0
1
0 0 0 0 0
0
0
2
1
0 0
,
0 0
| o
|
|
=
= =
|
|
.
|
\
|
=
+
+
=
\
|
=
0 0
0 ,
,
x x A x x
D x x K
T
A
Combating Dimensions: Low Order
Additive Models
ANOVA (analysis of variance) decomposition:
One-dimensional local regression is all needed:
( ) ( )
=
+ =
p
j
j j p
x g x x x f
1
2 1
, , , o
( ) ( ) ( ) ... , , , ,
1
2 1
+ + + =
< = l k
l k kl
p
j
j j p
x x g x g x x x f o
Probability Density Function
Estimation
In many classification or regression problems we desperately
want to estimate probability densities recall the instances
So can we not estimate a probability density, directly given
some samples from it?
Local methods of Density Estimation:
This estimate is typically bumpy, non-smooth (why?)
i
x
0
0
# ( )
( )
i
x Nbhood x
f x
N
e
=
Smooth PDF Estimation using Kernels
Parzen method:
Gaussian kernel:
In p-dimensions
=
=
N
i
i
x x K
N
x f
1
0 0
) , (
1
) (
2
0
1
(|| ||/ )
2
0
1 2
2
1
( )
(2 )
i
N
x x
X p
i
f x e
N
t
=
=
)
2
) (
exp(
2
1
) , (
2
2
0
0
t
i
i
x x
x x K
=
Kernel density estimation
Using Kernel Density Estimates in
Classification
) | ( ) ( j G x p x f
j
= =
=
= = =
K
l
l l
j j
x f
x f
x X j G P
1
0
0
0
) (
) (
) | (
t
t
Posterior probability density:
In order to estimate this density, we can estimate the class conditional densities
using Parzen method
where is the j
th
class conditional density
Class conditional densities Ratio of posteriors
) (
) (
) | 1 (
) | 1 (
2 2
1 1
x f
x f
x X G P
x X G P
t
t
=
= =
= =
Naive Bayes Classifier
In Bayesian Classification we need to
estimate the class conditional densities:
What if the input space x is multi-
dimensional?
If we apply kernel density estimates, we
will run into the same problems that we
faced in high dimensions
To avoid these difficulties, assume that
the class conditional density factorizes:
In other words we are assuming here
that the features are independent
Nave Bayes model
Advantages:
Each class density for each feature can
be estimated (low variance)
If some of the features are continuous,
some are discrete this method can
seamlessly handle the situation
Nave Bayes classifier works surprisingly
well for many problems (why?)
) | ( ) ( j G x p x f
j
= =
[
=
= =
p
i
i p j
j G x p x x f
1
1
) | ( ) , , (
Discriminant function is now generalized linear additive
Key Points
Local assumption
Usually Bandwidth () selection is more important than
kernel function selection
Low bias, low variance usually not guaranteed in high
dimensions
Little training and high online computational complexity
Use sparingly: only when really required, like in the high-
confusion zone
Use when model may not be used again: No need for the
training phase