Vous êtes sur la page 1sur 5

International Journal of Computer Information Systems,

Vol. 4, No. 2, 2012


Mining the Stock Indices, using SVM

Andrs Zemplni
Dept of Prob. Theory and Statistics
Etvs Lornd University
Budapest, Hungary
e-mail: zempleni@math.elte.hu
Csilla Hajas
Dept of Information Systems
Etvs Lornd University
Budapest, Hungary
e-mail: sila@inf.elte.hu
Tibor Nikovits
Dept of Information Systems
Etvs Lornd University
Budapest, Hungary
e-mail: nikovits@inf.elte.h


Abstract: In this paper a new application of the Support Vector
Machines (SVM) is given. We search for analogies between an
actually observed daily stock price series and similar sequences in
the past. The time series characteristics are taken into account, as
the analogies are sought by the GARCH coefficients for 1-year
moving windows for the daily log returns. The potential
randomness of the series is emphasized by using perturbed
versions as well. The classification is based on the SVM
methodology. The results are promising, as the continuation of
the past analogy has indeed some resemblance to the future of the
actually observed stock price series.
Keywords-GARCH models; forecasting; stock prices; SVM
I. INTRODUCTION
Our aim was to find analogies between recent time series of
the log-returns of daily closing prices of some stock indices of
the New York Stock Exchange and the past of the daily log
returns of some other stock. With such an investigation it might
be possible to establish estimates for the future behavior of the
given share, by considering the development of the analogous
stock. As the reference is a past series, its present is already
known.
There are different characteristics, interesting for a broker,
including the short time expectations, but the long time
forecasts are not less important. The short time fluctuations are
mostly determined by recent news and global trends. These are
taken into consideration by the fundamentalist analyses.
Another approach is the so-called technical investigation,
where the recent behavior of the daily closing prices are taken
into consideration. For the long term, the analogies we propose,
may also be considered as a real alternative.
There are other approaches in the literature. Some authors
use either the Fourier- or the wavelet-transforms for finding
analogies (see [4], for example). However, in our case the
series can be well modeled by the GARCH coefficients, which
gives a compact description and at the same time avoids some
problems, such as the number of used Fourier coefficients or
the choice of the kernel in case of the wavelets.
The paper is organized as follows: first we introduce the
used methods needed for processing the data. These include
besides some basic statistical procedures the GARCH models,
and our main tool in the investigation, the popular Support
Vector Machine methodology. It provides a framework for
detecting the analogies.
Section 3 deals with the data, and describes the used
algorithm. It includes the simulation of perturbations of the past
observations (creating the training data set). The SVM
methodology is used to create a model, based on these
perturbed sequences. All the programming was carried out in
the free statistical software R, where both the GARCH
parameter estimation and the SVM methodology was readily
available (in packages rugarch and e1071, respectively).
For details see its homepage (R, 2011 [5]).
Section 4 contains the results. We conclude with a
discussion.
II. DATA PROCESSING TOOLS
A. GARCH models
We fitted the GARCH (Generalized AutoRegressive
Conditional Heteroscedastic) (1,1) model to the daily log
returns of the stock indices X(t):
) 1 ( log / ) ( log ) ( = A t X t X t
These processes were explicitly introduced for modeling
the development of asset prices in time (Bollerslev, [1]). This
became a popular approach, as this model accounts for the
prominent volatility changes and other stylized facts,
observable in the time series of asset daily log returns (see for
example the book of Francq and Zakoian, [3]). The actual
GARCH(1,1) model for the volatility development is the
following:
). 1 ( ) 1 ( ) (
2 2 2
+ + = t t t |c oo e o
) ( ) ( ) (
2
t e t t o c =
is the model for the innovations, where e(t) are independent
and identically distributed random variables with zero mean
and unit variance and the actual return series is
. ) ( ) ( c + = A t t
There are two popular choices for the distribution of e(t):
normal and Student (t) distributions. We have chosen the
Student one as in this case we had the flexibility of two extra
parameters: the skewness and the degree of freedom (called
shape in R). As our aim is to find analogies, we did not go into
the details of model identification and selection in this paper,
just used the GARCH(1,1) model and its coefficients for our
February Issue Page 1 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012



purposes. The estimation is done via the so-called quasi-
maximum likelihood method, which works quite reasonably in
most cases (Francq and Zakoian, [3]).
As the stability of the GARCH coefficients is a crucial
question, we used perturbations of the daily closing values.
Here we allowed for a possible deviation from the actual
observed daily closing value of the stock. The upper bound for
this deviation was the expression (X
max
-X
close
)/4 and
analogously the lower bound was -(X
close
-X
min
)/4 (see (4)),
ensuring that the perturbed values remain well in the range of
actually observed values on the given day. See Figure 1 for an
example.
Figure 1. The observed series of the AT&T share price. Original
(black) and perturbed (yellow, broken) series, April 2003 - March 2005 (left)
and May, 2003 (right)
Figure 2. The estimated series of the GARCH coefficients and
for the original (black) and perturbed (red, broken) series of the AT&T
daily log returns, April 2004 - March 2005

Figure 3. The estimated series of the skewness and shape for the
GARCH innovations for the original (black) and perturbed (red, broken)
series of the AT&T daily log returns, April 2004 - March 2005
We see from Figures 1, 2 and 3, that while the data and thus
the log return series are quite close to their perturbed versions,
there are visible differences between the estimated parameters
of the GARCH model. This is useful in our case, as we get a
whole set of possible parameter series for a given stock. And
this means that indeed the classification approach we propose
in this paper has its merits, as there is a substantial number of
vectors we work with (see Section 3).
B. Support Vector Machines
As the next step, the popular Support Vector Machine
(SVM) approach was used to classify the perturbed series. We
have used the so-called C-classification technique, based on a
set of training vectors for each reference stock. The essence is
that the training set of points is mapped to a - possibly high-
dimensional - feature space, where a separating hyperplane
between the classes is found. Mathematically, let a training set
of k observation-label pairs (x
i
, y
i
), i=1,,k, be given, where x
i

is a vector and in the usual, dichotomous case y
i
e{-1;1}. To get
the SVMs (see for example Cortes and Vapnik, [2]) the
solution of the following optimization problem is needed:
(1) w w min
1
, , w
)
`

+

=
k
i
i
T
b
C


under the condition
(2) 0 , 1 ) ) ( w ( > > +
i i i
T
i
b x y

where C is the so-called capacity constant or cost, acting as a
penalty for misclassification of the training data. The
transformation in (2) is based on a so-called kernel function
K:
). ( ) ( ) , (
j
T
i j i
x x x x K =
In the analysis we have used the usual radial kernel:
(3) 0). ( }, || || exp{ ) , (
2
> =
j i j i
x x x x K

Although the method was originally designed for
distinguishing two classes, now there are available variants for
the multiclass problem like ours as well. The method based on
logistic regression is able to quantify the membership in a
given class by assigning probabilities to each observation and
class (these are the so called membership weights). We are
looking for strong membership weights, as these are hopefully
showing analogies between the pairs of stock indices.
III. THE DATA AND THE ALGORITHM
The stock index data was downloaded from the site Yahoo,
2011, [7]. We used data from March 2003 to November 2011,
altogether 2200 observations for each series. We have analyzed
73 such series from the New York stock exchange, where there
were no missing data in the time interval under investigation.
The data included the components of the Dow Jones industrial
average and quite a few other important stocks. The stability of
February Issue Page 2 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012



the GARCH coefficients is crucial in the success of the
classification, which was taken into account by a suitable
weighting.
The algorithm can be outlined as the following.
1. A window width was defined as n=250 observations,
corresponding approximately to one year. This amount of data
is the absolute minimum for GARCH parameter estimation. In
practice one may choose an even longer period for better
results. A new model was fitted to every 10
th
window of 250
subsequent observations. This step size was chosen by practical
reasons: the estimation of the GARCH parameters needs an
iterative procedure, which is rather quick, but in our case it has
to be repeated several thousand times, so our aim was a quicker
algorithm. It was observed that the replacement of only one
single observation did not cause a visible change in the
parameters in most of the cases, so we have not lost much
information by this procedure.
2. We have generated 10 perturbed series for each data set.
As we aimed at perturbed values, which could have been a
closing value on a given day, they were chosen as uniformly
distributed over the interval
(X
close
-(X
close
-X
min
)/4; X
close
+(X
max
-X
close
)/4). (4)
This method may be refined by using some kind of
bootstrap, but the choice of the resampling methodology is by
far not straightforward. The parameter 1/4 can also be changed,
but again this turned out to be a reasonable compromise: the
change in the parameter values is visible, but in most of the
cases the parameters of the perturbed sequence is classified to
the same class as those of the original observation.
3. The duration for the learning set was chosen as 3 years,
which resulted in l=74 runs from March 2004 (based on
observations from 2003) till March 2007. Summarizing these
runs, we got a parameter set P
ij
(a matrix with l rows and 6
columns) for stock i and perturbation number j, where
i=1,,73 and j=1,,10. The stability of the parameters was
different (as measured by the correlation between the
estimators of the original and the perturbed ones), see Table 1.
4. Having completed the calculations of the GARCH
coefficients, we turned our attention to the classification of the
results. We let the SVM run for each subcolumn P
ij
(k,q)

consisting of n=25 subsequent elements of column (parameter)
q of P
ij
from k to k+n-1 (q=1,6 and k=1,,l-n+1). This
means that we considered the coordinates separately. First we
optimized the cost and parameters in equations (1) and (3), as
proposed in Hsu et al. (2010). The parameters C=4 and =1/4
were suitable to get a 100% proper classification of the original
data for all coordinates. As a result we have got 6*50 models,
as l-n+1=50.
TABLE I. THE EMPIRICAL CORRELATION BETWEEN THE ESTIMATES FOR
THE ORIGINAL AND THE PERTURBED SERIES
parameter skewness shape
correlation (r) 0.983 0.611 0.801 0.631 0.899 0.832

5. Let us choose a later time period, for which the analogies
will be searched. In our case this estimation period started in
March, 2008 and contained as usual 25 estimators
(corresponding to a complete observation period of two years,
starting just after the last time point of the training set),
resulting in a 25 times 6 parameter matrix Q. In practical cases
one may just investigate one favorite stock, but here we
searched the possible analogies for all stocks in our database.
Having completed the tuning and SVM modeling of Step 4,
we searched for the stock i and run k, for which the estimated
25-dimensional vector resembled most to Q
(q)
, for all GARCH
parameters q=1,,6.. As we have an automatic selection
procedure in mind, we weighted the coordinates according to
their squared correlation coefficients between pairs of
perturbations (see Table 1)
} run in class to belongs Pr{
) ( 2 ) (
,
k i Q r p
q
q
q
k i
=
and chose stock i and run k for which the sum of the probability
of belonging to the class of this stock is maximal:
. max arg
6
1
) (
,
,

= q
q
k i
k i
p
In Figures 4 and 5 first the fitted pairs, then the same
diagrams for the 2 year period (Dell Sept 2003- Aug 2005 vs.
Boeing March 2007- March 2009) are shown. It has to be
mentioned that the both stock data were standardized to mean 0
and variance 1 in order to allow for the comparison.
Figure 4. The standardised time series (left panel) and log returns
(right panel) for the March 2007- March 2009 Boeing data (black) and the
best fitted data (Dell, Sept 2003- Aug 2005, yellow).
Figure 5. The fitted parameters (left panel) and skewness
(right panel) for the March 2007- March 2009 Boeing data (thick black) and
the same parameters for the perturbed vesrions of the best fitted data (Dell,
Sept 2003- Aug 2005, thin rainbow-coloured).
The left panel of Figure 4 shows a very good fit, and also
Figure 5 shows that there is indeed some similarity between the
two parameter sequences, especially the ranges are similar. It
February Issue Page 3 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012


should be noted that no scaling was applied in the SVM, as the
parameter vectors should be taken into consideration as they
are.
6. The next step was the forecast. Formally this means that
the vector of observations over the interval [Y
t
:Y
t+24
] might be
estimated by [X
i,k
:X
i,k+24
]

(k+24 <t), where [X
i,k-25
:X
i,k-1
]

is the
best estimator of [Y
t-25
:Y
t-1
] in the sense of Step 5. This method
ensures that we have a complete set of estimated values even
before the start of the time period for which we are interested.
IV. RESULTS
Figure 6 gives the continuation of our example from Figure
4, showing the observed and the forecasted series. It shows a
somewhat looser relation compared to Figure 4, but still the
results show a satisfactory coincidence between the two series.
Of course the price series had to be scaled here as well, as the
units for the different assets are not related to each other. These
observations become especially clear when comparing the
results with other, randomly chosen sequences.
Of course, the forecast was not as good for all shares. The
average squared distance between the scaled baseline series and
the forecast for all 73 stocks is 1.4, 70% of the independence
case of 2, which is the expected squared error for a completely
random estimator: if X and Y are independent, standardized
random variables, then E(X-Y)
2
=2.
We have carried out a control study, where the forecasts
were based on the simple squared distances: those periods and
stocks were chosen, where this distance was the smallest for
the reference period. The forecast was then based on the next
period, analogously to our method. These results were
disappointing, however: the average squared distance was even
higher than 2.
Of course, this is just a preliminary analysis, where a
limited number of reference vectors were used. With more such
data, the quality of the forecast should be improved as well.
Figure 6. The standardised time series (left panel) and log returns
(right panel) for the March 2009- March 2010 Boeing data (thick, black) and
the continuation of the data shown in Figure 4 (Dell, Sept 2005- Aug 2006,
thin, red).
V. DISCUSSION
There is a vast amount of research focusing on forecasting
stock indices. Most of them claim that their method ensures
considerable gains for the user. However, the mathematicians
did not become richer based on these results, so some skeptics
is advised when applying them. Among these works there are
some data mining applications, too. However we are not aware
of any publications focusing on SVM with the aim of finding
analogies between different stocks. We have shown that this
plan is feasible, it provided some evidence for its usefulness
even based on our limited database. Considering it as an
additional tool for the economists, having professional
knowledge about the backgrounds of the companies, it may
become even more powerful. There are papers, where the
amount of processed data made it necessary to use the R* trees
for finding the best fit to a given series (see Rafiei and
Mendelzon, [6] as an example), in our case this was not
needed, as the whole optimization needed just a few hours of
computing time.
There is however a need for further work, where the effect
of different parameter values (window size, deviation of the
perturbations etc) should be taken into account. An economic
investigation of the reasons behind the found results would be
very interesting. The economists may also help by grouping of
the stocks with similar properties as it would also be helpful in
reducing the needed computing time, especially for larger scale
investigations.
ACKNOWLEDGMENT
The European Union and the European Social Fund have
provided financial support to the project under the grant
agreement no. TMOP 4.2.1/B-09/KMR-2010-0003.
REFERENCES
[1] Bollerslev,T.: Generalized autoregressive conditional heteroscedasticity.
Journal of Econometrics 31, 307-327 (1986)
[2] Cortes, C. and Vapnik, V.: Support-vector network. Machine Learning
20, 273--297 (1995)
[3] Francq, C. and Zakoian, J-M: GARCH models: structure, statistical
inference, and financial applications. Wiley, New York. (2010)
[4] Hsu,C-W., Chang, C-C and Lin, C-J.: A Practical Guide to Support
Vector Classification. http://www.csie.ntu.edu.tw/~cjlin (as on
01.11.2011)
[5] R, http://www.cran.r-project.org
[6] Rafiei, D. and Mendelzon, A.: Similarity-based queries for time series
data. Proceedings of the 1997 ACM SIGMOD International Conference
on Management of Data (1997).
[7] Yahoo, http://finance.yahoo.com/q/hp?s=UCG.MI+Historical+Prices (as
on 01.11.2011)

AUTHORS PROFILE

Dr. Andrs Zemplni received his Ph.D degree from
Etvs Lornd University Budapest, Hungary, in 1989.
His main field of interest is multivariate extreme value
modeling. He is also interested in analysis of financial
time series. He is associate professor at the Department
of Probability Theory and Statistics of the Etvs
Lornd University.

Dr. Csilla Hajas received her Ph.D degree from
Kossuth Lajos University Debrecen, Hungary, in 1997.
Her main fields of interest are database theory, data
mining, statistical process and industrial statistics. She
is assistant professor at the Department of Information
Systems of Etvs Lornd University Budapest.


February Issue Page 4 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol. 4, No. 2, 2012

Dr. Tibor Nikovits received an MSc degree in
mathematics from Etvs Lornd University Budapest
in 1988 and an MSc degree in economics from
Budapest Economical University in 2003. His main
fields of interest are database theory, data mining,
financial and business information systems. He is a
senior lecturer at the Department of Information
Systems of Etvs Lornd University Budapest


February Issue Page 5 of 62 ISSN 2229 5208