Vous êtes sur la page 1sur 9

Appendix C

Projection Pursuit Indexes


In this appendix, we list several indexes for projection pursuit [Posse, 1995b], and we also provide the M-file source code for the functions included in the Computational Statistics Toolbox.

C.1 Indexes
Since structure is considered to be departures from normality, these indexes are developed to detect non-normality in the projected data. There are some criteria that we can use to assess the usefulness of projection indexes. These include affine invariance [Huber, 1985], speed of computation, and sensitivity to departure from normality in the core of the distribution rather than the tails. The last criterion ensures that we are pursuing structure and not just outliers.

Fr ied iedm an- Tukey ukey I nde ndex


This projection pursuit index [Friedman and Tukey, 1974] is based on interpoint distances and is calculated using the following
n n 2

PI F T ( , ) =
1 5

(R
i = 1j = 1 2

r ij )

2 3

1( R2 r2 ij ) , 1 ( ) is the indicator

where R = 2.29 n , r ij = ( z i z j ) + ( z i z j ) , and function for positive values,

1(x) =

1; x > 0 0 ; x 0.

2002 by Chapman & Hall/CRC

530

Computational Statistics Handbook with MATLAB

This index has been revised from the original to be affine invariant [Swayne, 2 Cook and Buja, 1991] and has computational order O ( n ) .

Entr ntr opy opy Inde Index


This projection pursuit index [Jones and Sibson, 1987] is based on the entropy and is given by
n ( z i zj ) ( zi zj ) -------------------------------------- + log ( 2 e ) , , 2 h h n

1 1 - log -------------PI E ( , ) = -n nh h
i=1

j=1

where 2 is the bivariate standard normal density. The bandwidths h , = , are obtained from zi
i=1 n

h = 1.06 n This index is also O ( n ) .


2

1 5

2 z n ( n 1 ) . j=1
j

1 --

Moment I nde ndex


This index was developed in Jones and Sibson [1987] and is based on bivariate third and fourth moments. This is very fast to compute, so it is useful for large data sets. However, a problem with this index is that it tends to locate structure in the tails of the distribution. It is given by
2 2 2 2 1 2 1 2 2 2 - ( 40 + 4 31 + 6 22 + 4 13 + 04 ) , - 30 + 3 2 PI M ( , ) = ----21 + 3 12 + 03 + -4 12

where
n n 3 n - ( z = --------------------------------i ) ( n 1 )( n 2 ) i=1 3 n - ( z = --------------------------------i ) ( n 1 )( n 2) i=1

30

03

2002 by Chapman & Hall/CRC

Appendix C: Projection Pursuit Indexes


n

531

3 n (n + 1 ) - ( z ) z 31 = --------------------------------------------------i ( n 1 )( n 2) ( n 3 ) i i=1 3 n (n + 1 ) - ( z ) z 13 = --------------------------------------------------( n 1 )( n 2) ( n 3 ) i i i=1 n

3 4 n( n + 1) 3( n 1 ) - (z - 04 = --------------------------------------------------i ) --------------------( n 1 )( n 2 ) (n 3 ) n(n + 1) i=1 3 4 n( n + 1) 3( n 1 ) - (z - 40 = --------------------------------------------------i ) --------------------( n 1 )( n 2 ) (n 3 ) n(n + 1) i=1 n

3 2 n (n + 1 ) 2 ( n 1 ) - - ( z 22 = -------------------------------------------------- i ) ( z i ) -------------------( n 1 )(n 2 )( n 3) n ( n + 1 ) i=1

21

2 n - ( z ) z = --------------------------------i ( n 1 )( n 2) i i=1

12

2 n - ( z ) z . = --------------------------------( n 1) ( n 2 ) i i i=1

Distan Distances
2

Several indexes estimate the L distance between the density of the projected 2 data and a bivariate standard normal density. The L projection indexes use orthonormal polynomial expansions to estimate the marginal densities of the projected data. One of these proposed by Friedman [1987] uses Legendre polynomials with J terms. Note that MATLAB has a function for obtaining these polynomials called legendre . 1 1 - ( 2 j + 1 ) -- Pj y PI L eg ( , ) = -i 4 n
j=1 J i=1 J n 2

1 n - Pk y + ( 2 k + 1 ) -i n k=1
i=1 J Jj 2

1 n - Pj ( y + ( 2 j + 1 ) ( 2 k + 1 ) -i ) P k ( y i ) n j = 1k = 1 i=1

2002 by Chapman & Hall/CRC

532

Computational Statistics Handbook with MATLAB

where P a ( ) is the Legendre polynomial of order a. This index is not affine invariant, so Morton [1989] proposed the following revised index. This is based on a conversion to polar coordinates as follows
2 2

= (z ) + ( z )

z - . = atan --- z

We then have the following index where Fourier series and Laguerre polynomials are used:
L K

1 PI L F ( , ) = -
n

l = 0k = 1

1 - L l ( i ) exp ( i 2 ) cos ( k i ) -n
i=1 2 L n

1 - L l ( i ) exp ( i 2 ) sin ( k i ) + -n
i=1 n

1 1 - -- L l ( i ) exp ( i 2 ) + ----2 n
l=0 i=1

1 1 --------- , exp ( i 2 ) + ----2 n 8


i=1

where La represents the Laguerre polynomial of order a. Two more indexes based on the L 2 distance using expansions in Hermite polynomials are given in Posse [1995b].

C.2 M ATLAB Source Code


The first function we look at is the one to calculate the chi-square projection pursuit index. function ppi = csppind(x,a,b,n,ck) % x is the data, a and b are the projection vectors, % n is the number of data points, and ck is the value % of the standard normal bivariate cdf for the boxes. z = zeros(n,2); ppi = 0; pk = zeros(1,48); eta = pi*(0:8)/36; delang = 45*pi/180;

2002 by Chapman & Hall/CRC

Appendix C: Projection Pursuit Indexes delr = sqrt(2*log(6))/5; angles = 0:delang:(2*pi); rd = 0:delr:5*delr; nr = length(rd); na=length(angles); for j = 1:9 % find rotated plane aj = a*cos(eta(j))-b*sin(eta(j)); bj = a*sin(eta(j))+b*cos(eta(j)); % project data onto this plane z(:,1) = x*aj; z(:,2) = x*bj; % convert to polar coordinates [th,r] = cart2pol(z(:,1),z(:,2)); % find all of the angles that are negative ind = find(th<0); th(ind) = th(ind)+2*pi; % find # points in each box for i=1:(nr-1)% loop over each ring for k=1:(na-1)% loop over each wedge ind = ... find(r>rd(i) & r<rd(i+1) & ... th>angles(k) & th<angles(k+1)); pk((i-1)*8+k)=... (length(ind)/n-ck((i-1)*8+k))^2... /ck((i-1)*8+k); end end % find the number in the outer line of boxes for k=1:(na-1) ind=... find(r>rd(nr) & th>angles(k) & ... th<angles(k+1)); pk(40+k)=(length(ind)/n-(1/48))^2/(1/48); end ppi = ppi+sum(pk); end ppi = ppi/9;

533

Any of the other indexes can be coded in an M-file function and called by the csppeda function given below. You would call your function instead of csppind . function [as,bs,ppm]=csppeda(Z,c,half,m) % Z is the sphered data.

2002 by Chapman & Hall/CRC

534

Computational Statistics Handbook with MATLAB % get the necessary constants [n,p] = size(Z); maxiter = 1500; cs = c; cstop = 0.00001; cstop = 0.01; as = zeros(p,1);% storage for the information bs = zeros(p,1); ppm = realmin; % find the probability of bivariate standard normal % over each radial box. % NOTE: the user could put the values in to ck to % prevent re-calculating each time. We thought the % reader would be interested in seeing how we did % it. % NOTE: MATLAB 5 users should use the function % quad8 instead of quadl. fnr = inline('r.*exp(-0.5*r.^2)','r'); ck = ones(1,40); ck(1:8) = quadl(fnr,0,sqrt(2*log(6))/5)/8; ck(9:16) = quadl(fnr,sqrt(2*log(6))/5,... 2*sqrt(2*log(6))/5)/8; ck(17:24) = quadl(fnr,2*sqrt(2*log(6))/5,... 3*sqrt(2*log(6))/5)/8; ck(25:32) = quadl(fnr,3*sqrt(2*log(6))/5,... 4*sqrt(2*log(6))/5)/8; ck(33:40) = quadl(fnr,4*sqrt(2*log(6))/5,... 5*sqrt(2*log(6))/5)/8; for i=1:m % generate a random starting plane % this will be the current best plane a = randn(p,1); mag = sqrt(sum(a.^2)); astar = a/mag; b = randn(p,1); bb = b-(astar'*b)*astar; mag = sqrt(sum(bb.^2)); bstar = bb/mag; clear a mag b bb % find the projection index for this plane % this will be the initial value of the index ppimax = csppind(Z,astar,bstar,n,ck); % keep repeating this search until the value

2002 by Chapman & Hall/CRC

Appendix C: Projection Pursuit Indexes

535

% c becomes less than cstop or until the % number of iterations exceeds maxiter mi = 0; % number of iterations without increase in index h = 0; c = cs; while (mi < maxiter) & (c > cstop) % generate a p-vector on the unit sphere v = randn(p,1); mag = sqrt(sum(v.^2)); v1 = v/mag; % find the a1,b1 and a2,b2 planes t = astar+c*v1; mag = sqrt(sum(t.^2)); a1 = t/mag; t = astar-c*v1; mag = sqrt(sum(t.^2)); a2 = t/mag; t = bstar-(a1'*bstar)*a1; mag = sqrt(sum(t.^2)); b1 = t/mag; t = bstar-(a2'*bstar)*a2; mag = sqrt(sum(t.^2)); b2 = t/mag; ppi1 = csppind(Z,a1,b1,n,ck); ppi2 = csppind(Z,a2,b2,n,ck); [mp,ip] = max([ppi1,ppi2]); if mp > ppimax % then reset plane and index to this value eval(['astar=a' int2str(ip) ';']); eval(['bstar=b' int2str(ip) ';']); eval(['ppimax=ppi' int2str(ip) ';']); else h = h+1;% no increase end mi = mi+1; if h==half% then decrease the neighborhood c = c*.5; h = 0; end end if ppimax > ppm % save the current projection as a best plane as = astar; bs = bstar; ppm = ppimax;

2002 by Chapman & Hall/CRC

536 end end

Computational Statistics Handbook with MATLAB

Finally, we provide the following function for removing the structure from a projection found using PPEDA. function X = csppstrtrem(Z,a,b) % maximum number of iterations allowed maxiter = 5; [n,d] = size(Z); % find the orthonormal matrix needed via Gram-Schmidt U = eye(d,d); U(1,:) = a';% vector for best plane U(2,:) = b'; for i = 3:d for j = 1:(i-1) U(i,:) = U(i,:)-(U(j,:)*U(i,:)')*U(j,:); end U(i,:) = U(i,:)/sqrt(sum(U(i,:).^2)); end % Transform data using the matrix U. % To match Friedman's treatment: T is d x n. T = U*Z'; % These should be the 2-d projection that is 'best'. x1 = T(1,:); x2 = T(2,:); % Gaussianize the first two rows of T. % set of vector of angles gam = [0,pi/4, pi/8, 3*pi/8]; for m = 1:maxiter % gaussianize the data for i=1:4 % rotate about origin xp1 = x1*cos(gam(i))+x2*sin(gam(i)); xp2 = x2*cos(gam(i))-x1*sin(gam(i)); % Transform to normality [m,rnk1] = sort(xp1); % get the ranks [m,rnk2] = sort(xp2); arg1 = (rnk1-0.5)/n;% get the arguments arg2 = (rnk2-0.5)/n; x1 = norminv(arg1,0,1); % transform to normality x2 = norminv(arg2,0,1); end

2002 by Chapman & Hall/CRC

Appendix C: Projection Pursuit Indexes end % Set the first two rows of T to the % Gaussianized values. T(1,:) = x1; T(2,:) = x2; X = (U'*T)';

537

2002 by Chapman & Hall/CRC

Vous aimerez peut-être aussi