Académique Documents
Professionnel Documents
Culture Documents
C.1 Indexes
Since structure is considered to be departures from normality, these indexes are developed to detect non-normality in the projected data. There are some criteria that we can use to assess the usefulness of projection indexes. These include affine invariance [Huber, 1985], speed of computation, and sensitivity to departure from normality in the core of the distribution rather than the tails. The last criterion ensures that we are pursuing structure and not just outliers.
PI F T ( , ) =
1 5
(R
i = 1j = 1 2
r ij )
2 3
1( R2 r2 ij ) , 1 ( ) is the indicator
1(x) =
1; x > 0 0 ; x 0.
530
This index has been revised from the original to be affine invariant [Swayne, 2 Cook and Buja, 1991] and has computational order O ( n ) .
1 1 - log -------------PI E ( , ) = -n nh h
i=1
j=1
where 2 is the bivariate standard normal density. The bandwidths h , = , are obtained from zi
i=1 n
1 5
2 z n ( n 1 ) . j=1
j
1 --
where
n n 3 n - ( z = --------------------------------i ) ( n 1 )( n 2 ) i=1 3 n - ( z = --------------------------------i ) ( n 1 )( n 2) i=1
30
03
531
21
2 n - ( z ) z = --------------------------------i ( n 1 )( n 2) i i=1
12
2 n - ( z ) z . = --------------------------------( n 1) ( n 2 ) i i i=1
Distan Distances
2
Several indexes estimate the L distance between the density of the projected 2 data and a bivariate standard normal density. The L projection indexes use orthonormal polynomial expansions to estimate the marginal densities of the projected data. One of these proposed by Friedman [1987] uses Legendre polynomials with J terms. Note that MATLAB has a function for obtaining these polynomials called legendre . 1 1 - ( 2 j + 1 ) -- Pj y PI L eg ( , ) = -i 4 n
j=1 J i=1 J n 2
1 n - Pk y + ( 2 k + 1 ) -i n k=1
i=1 J Jj 2
1 n - Pj ( y + ( 2 j + 1 ) ( 2 k + 1 ) -i ) P k ( y i ) n j = 1k = 1 i=1
532
where P a ( ) is the Legendre polynomial of order a. This index is not affine invariant, so Morton [1989] proposed the following revised index. This is based on a conversion to polar coordinates as follows
2 2
= (z ) + ( z )
z - . = atan --- z
We then have the following index where Fourier series and Laguerre polynomials are used:
L K
1 PI L F ( , ) = -
n
l = 0k = 1
1 - L l ( i ) exp ( i 2 ) cos ( k i ) -n
i=1 2 L n
1 - L l ( i ) exp ( i 2 ) sin ( k i ) + -n
i=1 n
1 1 - -- L l ( i ) exp ( i 2 ) + ----2 n
l=0 i=1
where La represents the Laguerre polynomial of order a. Two more indexes based on the L 2 distance using expansions in Hermite polynomials are given in Posse [1995b].
Appendix C: Projection Pursuit Indexes delr = sqrt(2*log(6))/5; angles = 0:delang:(2*pi); rd = 0:delr:5*delr; nr = length(rd); na=length(angles); for j = 1:9 % find rotated plane aj = a*cos(eta(j))-b*sin(eta(j)); bj = a*sin(eta(j))+b*cos(eta(j)); % project data onto this plane z(:,1) = x*aj; z(:,2) = x*bj; % convert to polar coordinates [th,r] = cart2pol(z(:,1),z(:,2)); % find all of the angles that are negative ind = find(th<0); th(ind) = th(ind)+2*pi; % find # points in each box for i=1:(nr-1)% loop over each ring for k=1:(na-1)% loop over each wedge ind = ... find(r>rd(i) & r<rd(i+1) & ... th>angles(k) & th<angles(k+1)); pk((i-1)*8+k)=... (length(ind)/n-ck((i-1)*8+k))^2... /ck((i-1)*8+k); end end % find the number in the outer line of boxes for k=1:(na-1) ind=... find(r>rd(nr) & th>angles(k) & ... th<angles(k+1)); pk(40+k)=(length(ind)/n-(1/48))^2/(1/48); end ppi = ppi+sum(pk); end ppi = ppi/9;
533
Any of the other indexes can be coded in an M-file function and called by the csppeda function given below. You would call your function instead of csppind . function [as,bs,ppm]=csppeda(Z,c,half,m) % Z is the sphered data.
534
Computational Statistics Handbook with MATLAB % get the necessary constants [n,p] = size(Z); maxiter = 1500; cs = c; cstop = 0.00001; cstop = 0.01; as = zeros(p,1);% storage for the information bs = zeros(p,1); ppm = realmin; % find the probability of bivariate standard normal % over each radial box. % NOTE: the user could put the values in to ck to % prevent re-calculating each time. We thought the % reader would be interested in seeing how we did % it. % NOTE: MATLAB 5 users should use the function % quad8 instead of quadl. fnr = inline('r.*exp(-0.5*r.^2)','r'); ck = ones(1,40); ck(1:8) = quadl(fnr,0,sqrt(2*log(6))/5)/8; ck(9:16) = quadl(fnr,sqrt(2*log(6))/5,... 2*sqrt(2*log(6))/5)/8; ck(17:24) = quadl(fnr,2*sqrt(2*log(6))/5,... 3*sqrt(2*log(6))/5)/8; ck(25:32) = quadl(fnr,3*sqrt(2*log(6))/5,... 4*sqrt(2*log(6))/5)/8; ck(33:40) = quadl(fnr,4*sqrt(2*log(6))/5,... 5*sqrt(2*log(6))/5)/8; for i=1:m % generate a random starting plane % this will be the current best plane a = randn(p,1); mag = sqrt(sum(a.^2)); astar = a/mag; b = randn(p,1); bb = b-(astar'*b)*astar; mag = sqrt(sum(bb.^2)); bstar = bb/mag; clear a mag b bb % find the projection index for this plane % this will be the initial value of the index ppimax = csppind(Z,astar,bstar,n,ck); % keep repeating this search until the value
535
% c becomes less than cstop or until the % number of iterations exceeds maxiter mi = 0; % number of iterations without increase in index h = 0; c = cs; while (mi < maxiter) & (c > cstop) % generate a p-vector on the unit sphere v = randn(p,1); mag = sqrt(sum(v.^2)); v1 = v/mag; % find the a1,b1 and a2,b2 planes t = astar+c*v1; mag = sqrt(sum(t.^2)); a1 = t/mag; t = astar-c*v1; mag = sqrt(sum(t.^2)); a2 = t/mag; t = bstar-(a1'*bstar)*a1; mag = sqrt(sum(t.^2)); b1 = t/mag; t = bstar-(a2'*bstar)*a2; mag = sqrt(sum(t.^2)); b2 = t/mag; ppi1 = csppind(Z,a1,b1,n,ck); ppi2 = csppind(Z,a2,b2,n,ck); [mp,ip] = max([ppi1,ppi2]); if mp > ppimax % then reset plane and index to this value eval(['astar=a' int2str(ip) ';']); eval(['bstar=b' int2str(ip) ';']); eval(['ppimax=ppi' int2str(ip) ';']); else h = h+1;% no increase end mi = mi+1; if h==half% then decrease the neighborhood c = c*.5; h = 0; end end if ppimax > ppm % save the current projection as a best plane as = astar; bs = bstar; ppm = ppimax;
Finally, we provide the following function for removing the structure from a projection found using PPEDA. function X = csppstrtrem(Z,a,b) % maximum number of iterations allowed maxiter = 5; [n,d] = size(Z); % find the orthonormal matrix needed via Gram-Schmidt U = eye(d,d); U(1,:) = a';% vector for best plane U(2,:) = b'; for i = 3:d for j = 1:(i-1) U(i,:) = U(i,:)-(U(j,:)*U(i,:)')*U(j,:); end U(i,:) = U(i,:)/sqrt(sum(U(i,:).^2)); end % Transform data using the matrix U. % To match Friedman's treatment: T is d x n. T = U*Z'; % These should be the 2-d projection that is 'best'. x1 = T(1,:); x2 = T(2,:); % Gaussianize the first two rows of T. % set of vector of angles gam = [0,pi/4, pi/8, 3*pi/8]; for m = 1:maxiter % gaussianize the data for i=1:4 % rotate about origin xp1 = x1*cos(gam(i))+x2*sin(gam(i)); xp2 = x2*cos(gam(i))-x1*sin(gam(i)); % Transform to normality [m,rnk1] = sort(xp1); % get the ranks [m,rnk2] = sort(xp2); arg1 = (rnk1-0.5)/n;% get the arguments arg2 = (rnk2-0.5)/n; x1 = norminv(arg1,0,1); % transform to normality x2 = norminv(arg2,0,1); end
Appendix C: Projection Pursuit Indexes end % Set the first two rows of T to the % Gaussianized values. T(1,:) = x1; T(2,:) = x2; X = (U'*T)';
537