Académique Documents
Professionnel Documents
Culture Documents
0.8
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
JIANPING SHEN
University of Waterloo
1
Problem description:
Perform a k-nearest neighbor regression for a simulated data set and select the best k
using cross-validation for Y=X+e where e~norm(0,gamma) and X is uniformly
distributed on [0,1] interval with factor :
(1) Training set size: 30 or 100
(2) Gamma: 0.1 or 0.2
(3) Cross-Validation: 10 fold or 2 fold
Generate 500 observations set to provide a final evaluation.
KNN Modeling:
Cross Validation:
Each observation is used both in training and validation but never used at the same time
The learning sample is roughly equally divided into V (V>1, V<=n) subsets (folds).
For k from 1 to V,
Do {
The k-th fold is used as test set, while the rest of the data is used as training set to build a
sequence of models corresponding to some tuning parameter α. Compute the prediction
error of each candidate model on the test set
The average prediction error of the V-Folds for each tuning parameter α is called the
“cross-validated” error. The value of the tuning parameter α which gives the smallest
cross-validated error is used to train the whole learning data set and provide the final
model for future use.
1) Generate a training set ( for example 30 or 100 points) : (x1,y1), (x2,y2), …, (xn,
yn) where X is uniform distribution and Y=X+e, where e~Norm (0,gamma)
2
2) Divided the training into d-folds (v=2 and v=10)
3) Cross validation each train set and test set and find their the Euclidean distance
4) Find K which has the minimum MSE in above Cross Validation. This K and the
training set (X,Y)will be used a classifier for the new test set (Xnew,Ynew),
which has the same distribution as the training set, generated from 1)
5) Generates Yknn for Xnew for the new test set using majority votes among kNN
points
6) Repeat 1)-5) 10 times to get CV_MSE and K mean
Implementation:
A KNN implementation in MATLAB allowing continuous responses, the specification of
the Euclidean distance used to calculate nearest neighbors, the aggregation method used
to summarize response (majority class, mean, SSE or MSE etc.) and the method of
handling ties (all random selection odd K etc.).
Notes:
3
Gamma=0.1
0.8
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
4
K Nearest Neighbours for (K=7 at 2 folds)
1.2
sample pts
cv knn pts
1
0.8
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.6
0.4
0.2
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5
Gamma=0.2
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.6
0.4
0.2
-0.2
-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
6
K Nearest Neighbours for (K=11 at 2 folds)
1.4
sample pts
1.2 cv knn pts
0.8
0.6
0.4
0.2
-0.2
-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.6
0.4
0.2
-0.2
-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
As shown in the graphs, the KNN results fit to the sample points well!
7
Behavior of training and test (validation)
Sample error (test_MSE) vs model complexity plot
0.022
0.02
0.018
0.016
0.014
0.012
0.01
0.008
0 5 10 15 20 25 30 35 40
0.08
0.07
0.06
0.05
0.04
0.03
0 5 10 15 20 25 30
The calculation results of test_MSE based on above training and test set are plotted in
relation with K. As we can see from the plot, we have following conclusion:
8
Less sample pts (dot) has low K but high variance vs more sample pts (circle) has high
K but low variance and less folds (blue) has less complexity (low K) but more folds (red)
has more complexity (high K) However, when complexity increases, i.e. K>=15, the
more complexity of the model, the high variance will produce in more folds.
Same as above when gamma=0.2 and when running the KNN Matlab codes more step
produce similar results below
0.026
0.024
0.022
0.02
0.018
0.016
9
The above results of KNN are plotted by the MATLAB code below:
close all
% T=Tset_MSE;
% T1 is the result of (30,2) corresponding to K1
% T2 is the results of (30,10) corresponding to K2
% T3 is the result of (30,2) corresponding to K3
% T4 is the results of (30,10) corresponding to K4
% K=[1 5 1 1 5 3 1 3 5 1 1 7 3 11 1 5 5 7 1 9 7 5 7 9 5 15 9 7 5 3 7 3
21 15 11 3 3 27 9 51];
%
Tset_MSE=[.0238 .0120 .0233 .0180 .0127 .0153 .0226 .0169 .0145 .0254...
% .0210 .0128 .0133 .0162 .0154 .0140 .0145 .0129 .0177 .0128
...
% .0111 .0103 .0128 .0116 .0121 .0106 .0112 .0111 .0130 .0130
...
% .0117 .0148 .0120 .0116 .0103 .0133 .0124 .0135 .0117 .0264
];
%
% K1=[7 7 3 1 3 3 1 3 1 3];
% K2=[5 1 3 1 1 15 13 7 1 17];
% K3=[9 7 7 9 5 5 5 7 5 19];
% K4=[5 27 7 3 39 1 5 17 13 11];
%
% T=[.0151 .0128 .0157 .0198 .0138 .0124 .0210 .0130 .0183 .0139...
% .0109 .0251 .0141 .0150 .0195 .0236 .0154 .0137 .0204 .0271...
% .0114 .0113 .0099 .0126 .0117 .0110 .0134 .0123 .0129 .0120...
% .0117 .0146 .0114 .0136 .0132 .0204 .0138 .0114 .0122 .0118];
%
% gamma=0.2
K1=[5 5 7 1 3 9 5 5 7 3];
K2=[1 3 1 15 5 11 7 1 15 21];
K3=[7 13 15 9 13 5 11 23 13 9 ];
K4=[19 19 21 23 17 3 25 5 3 27];
T1=[.0633 .0499 .0605 .0864 .0556 .0476 .0463 .0480 .0530 .0485 ];
T2=[.0615 .0545 .0730 .0531 .0557 .0466 .0426 .0691 .0464 .1058 ];
T3=[.0490 .0444 .0436 .0437 .0460 .0481 .0382 .0417 .0452 .0477 ];
T4=[.0417 .0429 .0450 .0486 .0434 .0528 .0408 .0446 .0534 .0427 ];
figure (1)
plot(K1,T1,'.',K3,T3,'bo')
hold on
plot(K2,T2,'r.',K4,T4,'ro')
legend('2 folds with 30 pts','2 folds with 100','10 folds with 30
pts','10 folds with 100 pts',2)
title('test-MSE vs K compared with diff sample pts at gamma=0.2')
10
Appendix: the MATLAB code of KNN Regression Model:
The distances to be used for K-Nearest Neighbor (KNN) predictions are calculated
and returned as a symmetric matrix. Distances are calculated by knndist.m
if nargin == 1,
mat2 = mat1;
end
if n1 ~= n2,
error('Matrices mismatch!');
end
if n1 == 1,
distmat = abs(mat1*ones(1,m2)-ones(m1,1)*mat2');
elseif m2 >= m1,
for i = 1:m1,
distmat(i,:) = sqrt(sum(((ones(m2,1)*mat1(i,:)-
mat2)').^2));
end
else
for i = 1:m2,
distmat(:,i) = sqrt(sum(((mat1-
ones(m1,1)*mat2(i,:))').^2))';
end
end
11
kcv.m calculates v-folds cross validation between the training set and test set and
return the best K which has minimum SSE.
function K=kcv(mrow,folds,X,Y)
md=mrow*folds-mrow;
switch(folds)
case 2
Xd=blkdiag(X(:,1),X(:,2));
case 10
Xd=blkdiag(X(:,1),X(:,2),X(:,3),X(:,4),X(:,5),...
X(:,6),X(:,7),X(:,8),X(:,9),X(:,10));
case 30
Xd=diag(X(:));
end
% if(folds==10)
% Xd=blkdiag(X(:,1),X(:,2),X(:,3),X(:,4),X(:,5),...
% X(:,6),X(:,7),X(:,8),X(:,9),X(:,10));
% else
% Xd=blkdiag(X(:,1),X(:,2));
% end
X=X(:);
Y=Y(:);
bX=repmat(X,1,folds);
Xm=bX-Xd;
ikv=zeros(folds,1);
sev=zeros(folds,1);
[mrl,ncl]=size(Xm);%30x10=sample_numXfolds
for j=1:ncl
tidx=find(Xd(:,j)>0);
ridx=find(Xm(:,j)>0);
Xt=X(tidx);
Xr=X(ridx);
Ybar=knn(Xr,Xt,Y);
% calculation the tuning parameter alpha
SSE=sum((Ybar-Y(tidx)*ones(1,md)).^2);
[se, k]=min(SSE);
ikv(j)=k;
12
sev(j)=se;
end
[minse,id]=min(sev);% pick the smallest one
K=ikv(id);
end
knn.m calculates Ybar (Y classified value) Matrix, its first column is 1NN, second
column is 2NN, 3rd column is 3NN…
function Yhat=knn(Xr,Xt,Y)
eqt=isequal(Xr,Xt);
if(eqt)
m=length(Xr);
m2=m*m;
end
D=knndist(Xr,Xt);%(sample_num-nrow)xnrow md=mr,nd=mt
if(eqt)
D(1:m+1:m2)=inf;
end
[mr,mt]=size(D); %md=(sample_num-nrow) and nd=nrow
[junk,index]=sort(D);% 500x30
Yc=Y(index)';%30x500
Yhat=zeros(mt,mr);%
Yhat(:,1)=Yc(:,1);
for jd=2:mr
Yhat(:,jd)=Yhat(:,jd-1)+Yc(:,jd);
end
for jd=1:mr
Yhat(:,jd)=Yhat(:,jd)/jd;
End
if(eqt)
Yhat(:,mr)=Yc(:,mr);%last column in Ybar is junk replace by
true Y
end
%
end
13
simple_knn.m realizes KNN (v-folds) Algorithm described above
%function [k,CV_MSE,MSE]=simple_knn(train_pts,vfolds)
clear all
close all
clc
% sz=10;
% Kmv=zeros(sz,1);
% cv_msev=zeros(sz,1);
% for jr=1:sz
gamma=0.2;%or 0.2
vfolds=10;
train_num=100;
mx=train_num/vfolds;
count=0;
Km=0;
kdm=0;
cv_mse=inf;
kmv=zeros(10,1);
t_mse=zeros(10,1);
while(count<10)
kodd=1;
while(kodd)
Xs=unifrnd(0,1,[mx,vfolds]);
Ys=Xs+normrnd(0,gamma,[mx,vfolds]);
K=kcv(mx,vfolds,Xs,Ys);
kodd=mod(K+1,2); % pick a odd K to avoid tie vote
end
Xs=Xs(:);
Ys=Ys(:);
ls=length(Xs);
%equal to train_num
obs=500;
% figure (1)
% [junk1,junk2,Xo,Yo]=scv(obs,gamma);
Xo=unifrnd(0,1,[obs,1]);
lxo=length(Xo);%equal to obs
Yo=Xo+normrnd(0,gamma,[obs,1]);
% you should classify Yo to Ys not
%Ybar=knn(Xo,Xs,Yo);%100x500 but
Ybar=knn(Xs,Xo,Ys);%500x100
SSE=sum((Ybar-Yo*ones(1,length(Xs))).^2);
14
test_mse=SSE(K)/(lxo-1);
mse=test_mse;
Yp=Ybar(:,K);
%Yp=Ybar(:,1:10:30);
[se,kd]=min(SSE);
Yd=Ybar(:,kd);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
disp([mse,K])
if(mse<cv_mse)
cv_mse=mse;
%figure (2)
plot(Xs,Ys,'rp',Xo,Yp,'.')
%
% legend('sample pts','cv knn pts','minsse pts',2)
% title(sprintf('K Nearest Neighbours for (K=%d vs kd=%d
at %d folds)'...
% ,K,kd,vfolds));
legend('sample pts','cv knn pts',2)
title(sprintf('K Nearest Neighbours for (K=%d at %d
folds)'...
,K,vfolds));
end
Km=Km+K;
kdm=kdm+kd;
count=count+1;
kmv(count)=K;
t_mse(count)=test_mse;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%
Km=Km/count;
mtestmse=mean(t_mse);
testmstd=std(t_mse);
disp('------------------------------------')
disp([cv_mse,Km,mean(t_mse),testmstd])
disp('------------------------------------')
% Kmv(jr)=Km;
% cv_msev(jr)=cv_mse;
% end
% disp([mean(Kmv),std(Kmv),mean(cv_msev),std(cv_msev)]
15