Fast Cross Validation Via Sequential Analysis - Appendix

Appendix to Fast Cross-Validation via Sequential Analysis
Tammo Krueger, Danny Panknin, Mikio Braun Technische Universitaet Berlin Machine Learning Group 10587 Berlin t.krueger@tu-berlin.de, {panknin|mikio}@cs.tu-berlin.de
Selection of Meta-Parameters for the Fast Cross-validation
The algorithm has a number of free parameters as can be seen from the pseudo-code in Algorithm 1: maxSteps, the number of subsamples sizes to consider, , the signicance level for the binarization of the test errors, l , l , the signicance levels for the sequential analysis test, and earlyStoppingWindow, the number of steps to look back in the early stopping procedure. While we will give an in-depth treatment of the selection of 0 , 1 and the maxSteps parameter in the following sections we here give some suggestions for the other parameters. The parameter controls the signicance level in each step of the test for similar behavior. We suggest to set this to the usual level of = 0.05. Furthermore l and l control the signicance level of the H0 (conguration is a loser) and H1 (conguration is a winner) respectively. We suggest an asymmetric setup by setting l = 0.1, since we want to drop loser congurations relatively fast and l = 0.01, since we want to be really sure when we accept a conguration as overall winner. Finally, we set earlyStoppingWindow to 3 for maxSteps = 10 and 6 for maxSteps = 20, as we have observed that this choice works well in practice. 1.1 Choosing the Optimal Sequential Test Parameters
As outlined in the main part of the paper we want to use the sequential testing framework to eliminate underperforming congurations as fast as possible while postponing the decission for a winner as long as possible. Using the parameters of the sequential testing framework we have to choose 0 and 1 , such that the the area of acceptance for H0 (region H0 (0 , 1 , l , l ) denoted by LOSER in the overview gure) is maximized, while the earliest point of acceptance of H1 (Sa (0 , 1 , l , l ) in the overview gure) is postoned until the procedure has run at least maxStep steps: (0 , 1 ) = argmax
0 ,1 H0 (0 , 1 , l , l )
s.t. Sa (0 , 1 , l , l ) (maxSteps 1, maxSteps] (1)
It turns out that the global optimization in Equation (1) can be approximated by 0 = 0.5 1 = min ASN(0 , 1 | = 1.0) maxSteps
1
(2)
where ASN(, ) (Average Sample Number) is the expected number of steps until the given test will yield a decision, if the real = 1.0. For details of the sequential analysis please consult [1]. Note that sequential analysis formally requires i.i.d. variables which is clearly not the case in our setting. However, we focus on loser congurations, which are always zero (ergo deterministic) and therefore i.i.d. by construction. Also note that the true distribution of the trace matrix is complex and in general unknown. Our method should therefore be considered a rst approximation with more rened methods being the topic of future work. 1
Algorithm 1 Fast Cross-Validation 1: function FAST CV(data, maxSteps, congurations, , l , l , earlyStoppingWindow) 2: N/maxSteps; modelSize ; test getTest(steps, l , l ) 3: s {1, . . . , maxSteps}, c congurations : traces[c, s] performance[c, s] 0 4: c congurations : remainingModel[c] true 5: for s 1 to steps do 6: pointwisePerformance calcPerformance(data, modelSize, remainingModel) 7: performance[remainingModel, s] averagePerformance(pointwisePerformance) 8: traces[bestPerformingCongurations(pointwisePerformance, ), s] 1 9: remainingModel[loserCongurations(test, traces[remainingModel, 1:s])] false 10: if similarPerformance(traces[remainingModel, (s-earlyStoppingWindow):s], ) then 11: break 12: modelSize modelSize + 13: return selectWinnner(performance, remainingModel)
steps=10 70 60 50 40 30 20 0.8
q q q q q q q q
steps=20
q q q q q q q q
False Negative Rate
Relative Speedup
10 Experiment easy medium hard
Pi 0.6
q
0.1 0.2 0.3
0.4
0.4 0.5
0.2
0.0 20 40 60 80 100 120 140
Steps
Change Point
10
12
14
16
18
Figure 1: Left: Relative speed gain of fast CV compared to full CV. We assume that training time is cubic in the number of samples. Shown are simulated runtimes for 10-fold CV on different problem classes by different loser/winner ratios (easy: 3:1; medium: 1:1, hard: 1:3) over 100 resamples. Right: False negatives generated for non-stationary congurations, i.e., at the given change point the Bernoulli variable changes its before from the indicated value to 1.0.
1.2
Determine the Number of Steps
In this section we consider the maxSteps parameter. In principle, a larger number of steps leads to more robust estimates, but also to an increase of computation time. We study the effect of different choices of this parameter in a simulation. For the sake of simplicity we assume that the binary top or op scheme consists of independent Bernoulli variables with winner [0.9, 1.0] and loser [0.0, 0.1]. Figure 1 shows the resulting simulated runtimes for different settings. We see that the largest speed-up can be expected for 10 maxSteps 20. The speed gain rapidly decreases afterwards and becomes negligible between 40 for the hard setup and 100 for the easy setup. These simplied ndings suggests that all following experiments should be carried out with either 10 or 20 steps.
False Negative Rate
The types of errors we must be most concerned with in our procedure are false negatives: Congurations which are eliminated although they are among the top congurations on the full sample. In the following we study the false negative rate and prove the maximal number of times a conguration can be a loser before it is eliminated, and study the general effect in simulations. Assume that there exists a change point cp such that a winning conguration looses for the rst cp iterations. From the properties of our algorithm we can prove a security zone in which the fast 2
cross-validation has a false negative rate (FNR) of zero (see next Section for details): As long as
l log 1l log 1 1 l cp 0 with maxSteps log 0 / log 2 , 1l 11 maxSteps l log l log 10
the probability of a FNR larger than zero is zero. For instance for l = 0.01 and l = 0.1 we can start a fast cross-validation run with minimal 7 steps, since there is no suitable test available for a smaller number of steps. For maxSteps = 10 steps, the security zone amounts to 0.27 10, meaning that if the change point for all switching congurations occurs at step one or two, the fast cross-validation procedure would not suffer from false negatives. Similarly, for maxSteps = 20 the security zone is 0.39 20 = 7.8. To illustrate the false negative rate further we simulate those switching congurations by independent Bernoulli variables, which change their parameter from a chosen before {0.1, 0.2, . . . , 0.5} to a constant 1.0 at a given change point. The relative loss of these congurations for 10 and 20 steps are plotted in Figure 1, right panel, for different change points. As stated by our theoretical result above, the FNR is zero for sufciently small change points. After that, there are increasing probabilities that the conguration will be removed. As our experiments pointed out we see consistently good performance of the fast cross-validation procedure nevertheless, indicating that the change points are sufciently small for real data sets.
Proof of Security Zone Bound
In this section we prove the security zone bound of the previous Section. We will follow the notation and treatment of the sequential analysis as found in the original publication of Wald [1], Sections 5.3 to 5.5. First of all, Wald proves in Equation 5:27, that the following approximation holds: ASN(0 , 1 | = 1.0) = log
1l l log 1 0
1 The minimal ASN(0 , 1 | = 1.0) is therefore attained, if log 0 is maximal, which is clearly the case for 1 = 1.0 and 0 = 0.5, which holds by construction. So we get the lower bound of maxSteps for a given signicance level l , l :
maxSteps log
1 l / log 2 . l
The lower line L0 of the graphical sequential analysis test as exemplied in the overview Figure of the paper is dened as follows (see Equation 5:13 - 5:15): L0 = log log
1 0 l 1l 11 10
log
log log
1 0
11 10 11 10
log
Setting L0 = 0, we can get the intersection of the lower test line with the x-axis and therefore the earliest step ndrop , in which the procedure will drop a constant loser conguration. This yields ndrop = log log
1 0 l 1l 11 10
log
log log
1 0
11 10 11 10
log
log log
l 1l 11 10
Setting ndrop in relation to ASN(0 , 1 | = 1.0) yields the security zone bound of the previous Section.
Error Rates on Benchmark Data
The following table shows the mean absolute difference of test error (fast versus full crossvalidation) in percentage points and 95% condence intervals (standard error, 100 repetitions) for various setups. The fast setup runs with maxSteps = 10 steps while the slow setup is executed with 20 steps. Each setup is once employed with and without the early stopping rule. 3
banana breastCancer diabetis areSolar german image ringnorm splice thyroid twonorm waveform covertype
fast/early 0.20 % 0.18 2.00 % 1.85 0.56 % 0.88 1.44 % 2.95 0.45 % 0.70 0.19 % 0.19 0.03 % 0.03 0.25 % 0.19 0.39 % 0.53 -0.02 % 0.03 0.27 % 0.12 0.78 % 0.21
fast 0.11 % 0.15 2.09 % 1.64 0.80 % 0.82 2.53 % 3.31 0.92 % 0.58 0.22 % 0.20 0.00 % 0.04 0.32 % 0.18 -0.13 % 0.47 -0.03 % 0.04 0.21 % 0.17 0.89 % 0.19
slow/early 0.32 % 0.22 -0.38 % 2.91 0.68 % 0.81 1.39 % 1.77 1.14 % 0.53 0.46 % 0.26 0.05 % 0.04 0.15 % 0.19 -0.06 % 0.56 0.00 % 0.05 0.33 % 0.15 0.65 % 0.19
slow 0.07 % 0.10 1.46 % 1.95 -0.00 % 0.71 -0.11 % 1.86 0.86 % 0.62 0.41 % 0.24 0.03 % 0.04 0.14 % 0.15 -0.38 % 0.44 0.00 % 0.03 0.21 % 0.15 0.88 % 0.20
Example Run of Fast Cross-Validation
In this section we give an example of the whole fast cross-validation procedure on a toy data set of n = 1, 000 data points, which is based on a sine wave y = sin(x) + , x [0, 2d] with being Gaussian noise ( = 0, = 0.25). The parameter d = 50 controls the inherent complexity of the data and the sign of y is taken as the class membership. The fast cross-validation is executed with maxSteps = 10 and earlyStoppingWindow = 3. We use a -SVM [2] and test a parameter grid of {1, 0.5, 0, 0.5, 1} and {0.1, 0.2, 0.3, 0.4, 0.5}. The procedure runs for 4 steps after which the early stopping rule takes effect. This yields the following traces matrix (only remaining congurations are shown): Conguration = 0, = 0.1 = 0, = 0.2 = 0, = 0.3 = 0, = 0.4 = 0, = 0.5 = 0.5, = 0.1 = 0.5, = 0.2 = 0.5, = 0.3 = 0.5, = 0.4 = 0.5, = 0.5 = 1, = 0.1 = 1, = 0.2 = 1, = 0.3 = 1, = 0.4 modelSize=100 1 1 1 1 1 1 1 1 1 1 1 1 1 0 modelSize=200 1 1 1 1 1 1 1 1 1 1 1 1 1 1 modelSize=300 0 0 0 0 0 1 1 1 1 0 1 1 1 1 modelSize=400 0 0 0 0 0 1 1 0 0 0 1 1 1 0
The corresponding performances (prediction accuracy) are as follows, from which the procedure chooses = 1, = 0.2 as nal winning conguration: Conguration = 0, = 0.1 = 0, = 0.2 = 0, = 0.3 = 0, = 0.4 = 0, = 0.5 = 0.5, = 0.1 = 0.5, = 0.2 = 0.5, = 0.3 = 0.5, = 0.4 = 0.5, = 0.5 = 1, = 0.1 = 1, = 0.2 = 1, = 0.3 = 1, = 0.4 modelSize=100 0.659 0.659 0.659 0.659 0.659 0.657 0.657 0.657 0.658 0.658 0.652 0.648 0.646 0.624 modelSize=200 0.760 0.759 0.759 0.759 0.760 0.757 0.759 0.762 0.762 0.756 0.743 0.746 0.766 0.745 4 modelSize=300 0.824 0.826 0.824 0.827 0.824 0.841 0.853 0.851 0.850 0.837 0.847 0.866 0.861 0.861 modelSize=400 0.858 0.855 0.857 0.857 0.853 0.873 0.872 0.867 0.865 0.857 0.878 0.895 0.883 0.860
Non-Parametric Tests
The tests used in the fast cross-validation procedure are common tools in the eld of statistical data analysis. Here we give a short summary based on the Dataplot Manual [3]. Both methods deal with a data matrix of c experimental treatments with observations arranged in r blocks: Block 1 2 3 ... r 1 x11 x21 x31 ... xr1 Treatment 2 ... x12 . . . x22 . . . x32 . . . ... ... xr2 . . . c x1c x2c x3c ... xrc
Both tests treat similar questions (Do the c treatments have identical effects?) but are designed for different kinds of data: Cochran Q test is tuned for binary xij while the Friedman test acts on continuous values. In the context of the fast cross-validation procedure the test are used for two different tasks: 1. Determine whether a set of congurations are the top performing ones (step in the overview Figure and the function bestPerformingCongurations in Algorithm 1). 2. Check whether the remaining congurations behaved similar in the past (step in the overview Figure and the function similarPerformance in Algorithm 1). In both cases, the congurations act as treatments on either the samples (Point 1 above) or on the last earlyStoppingWindow traces (Point 2 above) of the remaining congurations. Depending on the learning problem either the Friedman Test for regression task or the Cochran Q test for classication tasks is used in Point 1. In both cases the hypotheses for the tests are as follows: H0 : All treatments are equally effective (no effect) H1 : There is a difference in the effectiveness among the treatments, i.e., there is at least one treatment showing a signicant effect. 6.1 Cochran Q Test
c N i=1 Ci c r i=1 Ri (c Ri )
The test statistic is calculated as follows: T = c(c 1)
with Ci denoting the column total for the ith treatment, Ri the row total for the ith block, and N the total number of values. We reject H0 , if T > 2 (1 , c 1) with 2 (1 , c 1) denoting the (1 )-quantile of the 2 distribution with c 1 degrees of freedom and is the signicance level. 6.2 Friedman Test
Let R(xij ) be the rank assigned to R(xij ) within block i (i.e., ranks within a given row). Average ranks are used in the case of ties. The ranks are summed to obtain
r
Rj =
i=1
R(xij ).
The test statistic is then calculated as follows: T =

2
12 rc(c + 1)
2
(Ri r(c + 1)/2)2 .

i=1
We reject H0 if T > (, c 1) with (, c 1) denoting the -quantile of the 2 distribution with c 1 degrees of freedom and is the signicance level. 5
References
[1] Abraham Wald. Sequential Analysis. Wiley, 1947. [2] Bernhard Sch lkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New support o vector algorithms. Neural Comput., 12:12071245, May 2000. [3] James J. Filliben and Alan Heckert. Dataplot Reference Manual Volume 1: Commands. Statistical Engineering Division, Information Technology Laboratory, National Institute of Standards and Technology.

Fast Cross Validation Via Sequential Analysis - Appendix

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Fast Cross Validation Via Sequential Analysis - Appendix

Transféré par

Droits d'auteur :

Formats disponibles

Appendix to Fast Cross-Validation via Sequential Analysis

Selection of Meta-Parameters for the Fast Cross-validation

s.t. Sa (0 , 1 , l , l ) (maxSteps 1, maxSteps] (1)

False Negative Rate

10 Experiment easy medium hard

0.1 0.2 0.3

0.0 20 40 60 80 100 120 140

Determine the Number of Steps

False Negative Rate

Proof of Security Zone Bound

Error Rates on Benchmark Data

Example Run of Fast Cross-Validation

The test statistic is calculated as follows: T = c(c 1)

The test statistic is then calculated as follows: T =

(Ri r(c + 1)/2)2 .

Vous aimerez peut-être aussi