Vous êtes sur la page 1sur 3

The Element of Statistical Learning – Chapter 6

oxstar@SJTU
January 6, 2011

PN PN
Ex. 6.2 Show that i=1 (xi − x0 )li (x0 ) = 0 for local linear regression. Define bj (x0 ) = i=1 (xi −
x0 )j li (x0 ). Show that b0 (x0 ) = 1 for local polynomial regression of any degree (including local
constants). Show that bj (x0 ) = 0 for all j ∈ {1, 2, . . . , k} for local polynomial regression of degree
k. What are the implications of this on the bias?

Proof By the definition of vector-valued function, b(x)T = (1, x) and B = [1, x], so we have

b(x0 )T = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )B


(1, x0 ) = b(x0 )T (BT W(x0 )B)−1 BT W(x0 )[1, x0 ]
PN
b(x0 )T (BT W(x0 )B)−1 BT W(x0 )1 = i=1 li (x0 )

1
= PN (1)
x0 b(x0 )T (BT W(x0 )B)−1 BT W(x0 )x0 = i=1 li (x0 )xi
N
X N
X N
X
(xi − x0 )li (x0 ) = li (x0 )xi − x0 li (x0 )
i=1 i=1 i=1
= x0 − x0 · 1 = 0 

From (1), we have


N
X N
X
0
b0 (x0 ) = (xi − x0 ) li (x0 ) = li (x0 ) = 1 
i=1 i=1

When j ∈ {1, 2, . . . , k}, we have vector-valued function b(x)T = (1, x, x2 , . . . , xk ) and B =


[1, x, x2 , . . . , xk ].

From (1), we similarly have


N
X
xj0 = li (x0 )xji (2)
i=1

Expanding (xj − x0 )j without combing of similar terms, each term can be written as (−1)b xai xb0 ,
where a + b = j. Obviously, the number of positive terms equals with the number of negative terms,
P2j bn
i.e. n=1 (−1) = 0. So each term of bj (x0 ) can be written as
N
X N
X
(−1)b xai xb0 li (x0 ) = (−1)b xb0 li (x0 )xai
i=1 i=1

= b b a
(−1) x0 x0 = (−1)b xj0 // (2)
N
X
bj (x0 ) = (xi − x0 )j li (x0 )
i=1
j
N X
X 2 
= (−1)bn xj0 li (x0 ) = 0 
i=1 n=1

1
Hence we have the bias
N
X
E fˆ(x0 ) − f (x0 ) = li (x0 )f (xi ) − f (x0 )
i=1
 N
X  N
X
= f (x0 ) li (x0 ) − f (x0 ) + f 0 (x0 ) (xi − x0 )li (x0 )
i=1 i=1
N
X N
X
00 2 (j+1)
+ k2 f (x0 ) (xi − x0 ) li (x0 ) + . . . + kj+1 f (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1 i=1
N
X
= kj+1 f (j+1) (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1

where kn is the coefficients of series expansion terms.


Now we see that the bias depends only on (j + 1)th-degree and higher order terms in the expan-
sion of f . 

Ex. 6.3 Show that ||l(x)|| (Section 6.1.2) increases with the degree of the local polynomial.

Proof Preliminary: B is a N × 2 regression matrix, so B is not invertible while BBT is invert-


ible.

From

fˆ(xj ) = b(xj )T (BT W(xj )B)−1 BT W(xj )y


N
X
= li (xj )yi = l(xj )T y
i=1

we have

l(xj )T = b(xj )T (BT W(xj )B)−1 BT W(xj )


l(xj ) = W(xj )B(BT W(xj )B)−1 b(xj )
||l(xj )||2 = l(xj )T l(xj ) = [b(xj )T (BT W(xj )B)−1 BT W(xj )][W(xj )B(BT W(xj )B)−1 b(xj )]
= b(xj )T (BT W(xj )B)−1 BT W(xj )[BBT (BBT )−1 (BBT )−1 BBT ]
W(xj )B(BT W(xj )B)−1 b(xj )
= b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
d+1
X d+1
X
||l(x)||2 = ||l(xj )||2 = b(xj )T BT (BBT )−1 (BBT )−1 Bb(xj )
j=1 j=1

= trace(BBT (BBT )−1 (BBT )−1 BBT ) // Prof. Zhang has proved it
= trace(Id+1 ) = d + 1

Hence ||l(x)|| = d + 1 which increases with the degree of the local polynomial.

Ex. 6.4 Suppose that the p predictors X arise from sampling relatively smooth analog curves
at p uniformly spaced abscissa values. Denote by Cov(X|Y ) = Σ the conditional covariance matrix
of the predictors, and assume this does not change much with Y . Discuss the nature of Mahalanobis
choice A = Σ−1 for the metric in (6.14). How does this compare with A = I? How might you
construct a kernel A that (a) downweights high-frequency components in the distance metric; (b)
ignores them completely?

2
p
Answer D = (x − x0 )T Σ−1 (x − x0 ) is called the Mahalanobis distance of the point x to x0 .
It takes the correlations of the data set into account. If the predictors are highly correlated, Maha-
lanobis distance is much
p accurate than Euclidian distance.
If A = I, then d = (x − x0 )T (x − x0 ) equals to Euclidian distance of the point x to x0 . Prior to
smoothing, we should standardize each predictor, for example

xi − E(xi )
x0i = p
Var(xi )

When comparing it with Σ−1 and using the standard predictors, we have

Cov(x0i , x0j ) = E[(x0i − E(x0i ))(x0j − E(x0j ))] = E(x0i x0j )


 
xi − E(xi ) xj − E(xj )
=E (p )( p )
Var(xi ) Var(xj )
E[(xi − E(xi ))(xj − E(xj ))]
= p p
Var(xi ) Var(xj )
Cov(xi , xj )
=p p = ρ(xi , xj )
Var(xi ) Var(xj )

Then the covariance matrix Σ will change to its standardized version (correlation matrix). Hence
A = I means ∀i 6= j, ρ(xi , xj ) = 0, i.e., all dimensions of x are not correlated.
(a) If we want to construct a kernel A that downweights high-frequency components (xi s) in the
distance metric, we can just decrease Cov(xi , xj ) or ρ(xi , xj ) in order to suppress their influence.
(b) If we want to construct a kernel A that ignores them completely, we can just set Cov(xi , xj ) or
ρ(xi , xj ) as 0.

Vous aimerez peut-être aussi