Académique Documents
Professionnel Documents
Culture Documents
+
=
k
i
i
T
b
C
under the condition
(2) 0 , 1 ) ) ( w ( > > +
i i i
T
i
b x y
where C is the so-called capacity constant or cost, acting as a
penalty for misclassification of the training data. The
transformation in (2) is based on a so-called kernel function
K:
). ( ) ( ) , (
j
T
i j i
x x x x K =
In the analysis we have used the usual radial kernel:
(3) 0). ( }, || || exp{ ) , (
2
> =
j i j i
x x x x K
Although the method was originally designed for
distinguishing two classes, now there are available variants for
the multiclass problem like ours as well. The method based on
logistic regression is able to quantify the membership in a
given class by assigning probabilities to each observation and
class (these are the so called membership weights). We are
looking for strong membership weights, as these are hopefully
showing analogies between the pairs of stock indices.
III. THE DATA AND THE ALGORITHM
The stock index data was downloaded from the site Yahoo,
2011, [7]. We used data from March 2003 to November 2011,
altogether 2200 observations for each series. We have analyzed
73 such series from the New York stock exchange, where there
were no missing data in the time interval under investigation.
The data included the components of the Dow Jones industrial
average and quite a few other important stocks. The stability of
February Issue Page 2 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012
the GARCH coefficients is crucial in the success of the
classification, which was taken into account by a suitable
weighting.
The algorithm can be outlined as the following.
1. A window width was defined as n=250 observations,
corresponding approximately to one year. This amount of data
is the absolute minimum for GARCH parameter estimation. In
practice one may choose an even longer period for better
results. A new model was fitted to every 10
th
window of 250
subsequent observations. This step size was chosen by practical
reasons: the estimation of the GARCH parameters needs an
iterative procedure, which is rather quick, but in our case it has
to be repeated several thousand times, so our aim was a quicker
algorithm. It was observed that the replacement of only one
single observation did not cause a visible change in the
parameters in most of the cases, so we have not lost much
information by this procedure.
2. We have generated 10 perturbed series for each data set.
As we aimed at perturbed values, which could have been a
closing value on a given day, they were chosen as uniformly
distributed over the interval
(X
close
-(X
close
-X
min
)/4; X
close
+(X
max
-X
close
)/4). (4)
This method may be refined by using some kind of
bootstrap, but the choice of the resampling methodology is by
far not straightforward. The parameter 1/4 can also be changed,
but again this turned out to be a reasonable compromise: the
change in the parameter values is visible, but in most of the
cases the parameters of the perturbed sequence is classified to
the same class as those of the original observation.
3. The duration for the learning set was chosen as 3 years,
which resulted in l=74 runs from March 2004 (based on
observations from 2003) till March 2007. Summarizing these
runs, we got a parameter set P
ij
(a matrix with l rows and 6
columns) for stock i and perturbation number j, where
i=1,,73 and j=1,,10. The stability of the parameters was
different (as measured by the correlation between the
estimators of the original and the perturbed ones), see Table 1.
4. Having completed the calculations of the GARCH
coefficients, we turned our attention to the classification of the
results. We let the SVM run for each subcolumn P
ij
(k,q)
consisting of n=25 subsequent elements of column (parameter)
q of P
ij
from k to k+n-1 (q=1,6 and k=1,,l-n+1). This
means that we considered the coordinates separately. First we
optimized the cost and parameters in equations (1) and (3), as
proposed in Hsu et al. (2010). The parameters C=4 and =1/4
were suitable to get a 100% proper classification of the original
data for all coordinates. As a result we have got 6*50 models,
as l-n+1=50.
TABLE I. THE EMPIRICAL CORRELATION BETWEEN THE ESTIMATES FOR
THE ORIGINAL AND THE PERTURBED SERIES
parameter skewness shape
correlation (r) 0.983 0.611 0.801 0.631 0.899 0.832
5. Let us choose a later time period, for which the analogies
will be searched. In our case this estimation period started in
March, 2008 and contained as usual 25 estimators
(corresponding to a complete observation period of two years,
starting just after the last time point of the training set),
resulting in a 25 times 6 parameter matrix Q. In practical cases
one may just investigate one favorite stock, but here we
searched the possible analogies for all stocks in our database.
Having completed the tuning and SVM modeling of Step 4,
we searched for the stock i and run k, for which the estimated
25-dimensional vector resembled most to Q
(q)
, for all GARCH
parameters q=1,,6.. As we have an automatic selection
procedure in mind, we weighted the coordinates according to
their squared correlation coefficients between pairs of
perturbations (see Table 1)
} run in class to belongs Pr{
) ( 2 ) (
,
k i Q r p
q
q
q
k i
=
and chose stock i and run k for which the sum of the probability
of belonging to the class of this stock is maximal:
. max arg
6
1
) (
,
,
= q
q
k i
k i
p
In Figures 4 and 5 first the fitted pairs, then the same
diagrams for the 2 year period (Dell Sept 2003- Aug 2005 vs.
Boeing March 2007- March 2009) are shown. It has to be
mentioned that the both stock data were standardized to mean 0
and variance 1 in order to allow for the comparison.
Figure 4. The standardised time series (left panel) and log returns
(right panel) for the March 2007- March 2009 Boeing data (black) and the
best fitted data (Dell, Sept 2003- Aug 2005, yellow).
Figure 5. The fitted parameters (left panel) and skewness
(right panel) for the March 2007- March 2009 Boeing data (thick black) and
the same parameters for the perturbed vesrions of the best fitted data (Dell,
Sept 2003- Aug 2005, thin rainbow-coloured).
The left panel of Figure 4 shows a very good fit, and also
Figure 5 shows that there is indeed some similarity between the
two parameter sequences, especially the ranges are similar. It
February Issue Page 3 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012
should be noted that no scaling was applied in the SVM, as the
parameter vectors should be taken into consideration as they
are.
6. The next step was the forecast. Formally this means that
the vector of observations over the interval [Y
t
:Y
t+24
] might be
estimated by [X
i,k
:X
i,k+24
]
(k+24 <t), where [X
i,k-25
:X
i,k-1
]
is the
best estimator of [Y
t-25
:Y
t-1
] in the sense of Step 5. This method
ensures that we have a complete set of estimated values even
before the start of the time period for which we are interested.
IV. RESULTS
Figure 6 gives the continuation of our example from Figure
4, showing the observed and the forecasted series. It shows a
somewhat looser relation compared to Figure 4, but still the
results show a satisfactory coincidence between the two series.
Of course the price series had to be scaled here as well, as the
units for the different assets are not related to each other. These
observations become especially clear when comparing the
results with other, randomly chosen sequences.
Of course, the forecast was not as good for all shares. The
average squared distance between the scaled baseline series and
the forecast for all 73 stocks is 1.4, 70% of the independence
case of 2, which is the expected squared error for a completely
random estimator: if X and Y are independent, standardized
random variables, then E(X-Y)
2
=2.
We have carried out a control study, where the forecasts
were based on the simple squared distances: those periods and
stocks were chosen, where this distance was the smallest for
the reference period. The forecast was then based on the next
period, analogously to our method. These results were
disappointing, however: the average squared distance was even
higher than 2.
Of course, this is just a preliminary analysis, where a
limited number of reference vectors were used. With more such
data, the quality of the forecast should be improved as well.
Figure 6. The standardised time series (left panel) and log returns
(right panel) for the March 2009- March 2010 Boeing data (thick, black) and
the continuation of the data shown in Figure 4 (Dell, Sept 2005- Aug 2006,
thin, red).
V. DISCUSSION
There is a vast amount of research focusing on forecasting
stock indices. Most of them claim that their method ensures
considerable gains for the user. However, the mathematicians
did not become richer based on these results, so some skeptics
is advised when applying them. Among these works there are
some data mining applications, too. However we are not aware
of any publications focusing on SVM with the aim of finding
analogies between different stocks. We have shown that this
plan is feasible, it provided some evidence for its usefulness
even based on our limited database. Considering it as an
additional tool for the economists, having professional
knowledge about the backgrounds of the companies, it may
become even more powerful. There are papers, where the
amount of processed data made it necessary to use the R* trees
for finding the best fit to a given series (see Rafiei and
Mendelzon, [6] as an example), in our case this was not
needed, as the whole optimization needed just a few hours of
computing time.
There is however a need for further work, where the effect
of different parameter values (window size, deviation of the
perturbations etc) should be taken into account. An economic
investigation of the reasons behind the found results would be
very interesting. The economists may also help by grouping of
the stocks with similar properties as it would also be helpful in
reducing the needed computing time, especially for larger scale
investigations.
ACKNOWLEDGMENT
The European Union and the European Social Fund have
provided financial support to the project under the grant
agreement no. TMOP 4.2.1/B-09/KMR-2010-0003.
REFERENCES
[1] Bollerslev,T.: Generalized autoregressive conditional heteroscedasticity.
Journal of Econometrics 31, 307-327 (1986)
[2] Cortes, C. and Vapnik, V.: Support-vector network. Machine Learning
20, 273--297 (1995)
[3] Francq, C. and Zakoian, J-M: GARCH models: structure, statistical
inference, and financial applications. Wiley, New York. (2010)
[4] Hsu,C-W., Chang, C-C and Lin, C-J.: A Practical Guide to Support
Vector Classification. http://www.csie.ntu.edu.tw/~cjlin (as on
01.11.2011)
[5] R, http://www.cran.r-project.org
[6] Rafiei, D. and Mendelzon, A.: Similarity-based queries for time series
data. Proceedings of the 1997 ACM SIGMOD International Conference
on Management of Data (1997).
[7] Yahoo, http://finance.yahoo.com/q/hp?s=UCG.MI+Historical+Prices (as
on 01.11.2011)
AUTHORS PROFILE
Dr. Andrs Zemplni received his Ph.D degree from
Etvs Lornd University Budapest, Hungary, in 1989.
His main field of interest is multivariate extreme value
modeling. He is also interested in analysis of financial
time series. He is associate professor at the Department
of Probability Theory and Statistics of the Etvs
Lornd University.
Dr. Csilla Hajas received her Ph.D degree from
Kossuth Lajos University Debrecen, Hungary, in 1997.
Her main fields of interest are database theory, data
mining, statistical process and industrial statistics. She
is assistant professor at the Department of Information
Systems of Etvs Lornd University Budapest.
February Issue Page 4 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012
Dr. Tibor Nikovits received an MSc degree in
mathematics from Etvs Lornd University Budapest
in 1988 and an MSc degree in economics from
Budapest Economical University in 2003. His main
fields of interest are database theory, data mining,
financial and business information systems. He is a
senior lecturer at the Department of Information
Systems of Etvs Lornd University Budapest
February Issue Page 5 of 62 ISSN 2229 5208