182 vues

Titre original : Paper 1

Transféré par Rakeshconclave

Vous êtes sur la page 1sur 5

Mining the Stock Indices, using SVM

Andrs Zemplni

Dept of Prob. Theory and Statistics

Etvs Lornd University

Budapest, Hungary

e-mail: zempleni@math.elte.hu

Csilla Hajas

Dept of Information Systems

Etvs Lornd University

Budapest, Hungary

e-mail: sila@inf.elte.hu

Tibor Nikovits

Dept of Information Systems

Etvs Lornd University

Budapest, Hungary

e-mail: nikovits@inf.elte.h

Abstract: In this paper a new application of the Support Vector

Machines (SVM) is given. We search for analogies between an

actually observed daily stock price series and similar sequences in

the past. The time series characteristics are taken into account, as

the analogies are sought by the GARCH coefficients for 1-year

moving windows for the daily log returns. The potential

randomness of the series is emphasized by using perturbed

versions as well. The classification is based on the SVM

methodology. The results are promising, as the continuation of

the past analogy has indeed some resemblance to the future of the

actually observed stock price series.

Keywords-GARCH models; forecasting; stock prices; SVM

I. INTRODUCTION

Our aim was to find analogies between recent time series of

the log-returns of daily closing prices of some stock indices of

the New York Stock Exchange and the past of the daily log

returns of some other stock. With such an investigation it might

be possible to establish estimates for the future behavior of the

given share, by considering the development of the analogous

stock. As the reference is a past series, its present is already

known.

There are different characteristics, interesting for a broker,

including the short time expectations, but the long time

forecasts are not less important. The short time fluctuations are

mostly determined by recent news and global trends. These are

taken into consideration by the fundamentalist analyses.

Another approach is the so-called technical investigation,

where the recent behavior of the daily closing prices are taken

into consideration. For the long term, the analogies we propose,

may also be considered as a real alternative.

There are other approaches in the literature. Some authors

use either the Fourier- or the wavelet-transforms for finding

analogies (see [4], for example). However, in our case the

series can be well modeled by the GARCH coefficients, which

gives a compact description and at the same time avoids some

problems, such as the number of used Fourier coefficients or

the choice of the kernel in case of the wavelets.

The paper is organized as follows: first we introduce the

used methods needed for processing the data. These include

besides some basic statistical procedures the GARCH models,

and our main tool in the investigation, the popular Support

Vector Machine methodology. It provides a framework for

detecting the analogies.

Section 3 deals with the data, and describes the used

algorithm. It includes the simulation of perturbations of the past

observations (creating the training data set). The SVM

methodology is used to create a model, based on these

perturbed sequences. All the programming was carried out in

the free statistical software R, where both the GARCH

parameter estimation and the SVM methodology was readily

available (in packages rugarch and e1071, respectively).

For details see its homepage (R, 2011 [5]).

Section 4 contains the results. We conclude with a

discussion.

II. DATA PROCESSING TOOLS

A. GARCH models

We fitted the GARCH (Generalized AutoRegressive

Conditional Heteroscedastic) (1,1) model to the daily log

returns of the stock indices X(t):

) 1 ( log / ) ( log ) ( = A t X t X t

These processes were explicitly introduced for modeling

the development of asset prices in time (Bollerslev, [1]). This

became a popular approach, as this model accounts for the

prominent volatility changes and other stylized facts,

observable in the time series of asset daily log returns (see for

example the book of Francq and Zakoian, [3]). The actual

GARCH(1,1) model for the volatility development is the

following:

). 1 ( ) 1 ( ) (

2 2 2

+ + = t t t |c oo e o

) ( ) ( ) (

2

t e t t o c =

is the model for the innovations, where e(t) are independent

and identically distributed random variables with zero mean

and unit variance and the actual return series is

. ) ( ) ( c + = A t t

There are two popular choices for the distribution of e(t):

normal and Student (t) distributions. We have chosen the

Student one as in this case we had the flexibility of two extra

parameters: the skewness and the degree of freedom (called

shape in R). As our aim is to find analogies, we did not go into

the details of model identification and selection in this paper,

just used the GARCH(1,1) model and its coefficients for our

February Issue Page 1 of 62 ISSN 2229 5208

International Journal of Computer Information Systems,

Vol. 4, No. 2, 2012

purposes. The estimation is done via the so-called quasi-

maximum likelihood method, which works quite reasonably in

most cases (Francq and Zakoian, [3]).

As the stability of the GARCH coefficients is a crucial

question, we used perturbations of the daily closing values.

Here we allowed for a possible deviation from the actual

observed daily closing value of the stock. The upper bound for

this deviation was the expression (X

max

-X

close

)/4 and

analogously the lower bound was -(X

close

-X

min

)/4 (see (4)),

ensuring that the perturbed values remain well in the range of

actually observed values on the given day. See Figure 1 for an

example.

Figure 1. The observed series of the AT&T share price. Original

(black) and perturbed (yellow, broken) series, April 2003 - March 2005 (left)

and May, 2003 (right)

Figure 2. The estimated series of the GARCH coefficients and

for the original (black) and perturbed (red, broken) series of the AT&T

daily log returns, April 2004 - March 2005

Figure 3. The estimated series of the skewness and shape for the

GARCH innovations for the original (black) and perturbed (red, broken)

series of the AT&T daily log returns, April 2004 - March 2005

We see from Figures 1, 2 and 3, that while the data and thus

the log return series are quite close to their perturbed versions,

there are visible differences between the estimated parameters

of the GARCH model. This is useful in our case, as we get a

whole set of possible parameter series for a given stock. And

this means that indeed the classification approach we propose

in this paper has its merits, as there is a substantial number of

vectors we work with (see Section 3).

B. Support Vector Machines

As the next step, the popular Support Vector Machine

(SVM) approach was used to classify the perturbed series. We

have used the so-called C-classification technique, based on a

set of training vectors for each reference stock. The essence is

that the training set of points is mapped to a - possibly high-

dimensional - feature space, where a separating hyperplane

between the classes is found. Mathematically, let a training set

of k observation-label pairs (x

i

, y

i

), i=1,,k, be given, where x

i

is a vector and in the usual, dichotomous case y

i

e{-1;1}. To get

the SVMs (see for example Cortes and Vapnik, [2]) the

solution of the following optimization problem is needed:

(1) w w min

1

, , w

)

`

+

=

k

i

i

T

b

C

under the condition

(2) 0 , 1 ) ) ( w ( > > +

i i i

T

i

b x y

where C is the so-called capacity constant or cost, acting as a

penalty for misclassification of the training data. The

transformation in (2) is based on a so-called kernel function

K:

). ( ) ( ) , (

j

T

i j i

x x x x K =

In the analysis we have used the usual radial kernel:

(3) 0). ( }, || || exp{ ) , (

2

> =

j i j i

x x x x K

Although the method was originally designed for

distinguishing two classes, now there are available variants for

the multiclass problem like ours as well. The method based on

logistic regression is able to quantify the membership in a

given class by assigning probabilities to each observation and

class (these are the so called membership weights). We are

looking for strong membership weights, as these are hopefully

showing analogies between the pairs of stock indices.

III. THE DATA AND THE ALGORITHM

The stock index data was downloaded from the site Yahoo,

2011, [7]. We used data from March 2003 to November 2011,

altogether 2200 observations for each series. We have analyzed

73 such series from the New York stock exchange, where there

were no missing data in the time interval under investigation.

The data included the components of the Dow Jones industrial

average and quite a few other important stocks. The stability of

February Issue Page 2 of 62 ISSN 2229 5208

International Journal of Computer Information Systems,

Vol. 4, No. 2, 2012

the GARCH coefficients is crucial in the success of the

classification, which was taken into account by a suitable

weighting.

The algorithm can be outlined as the following.

1. A window width was defined as n=250 observations,

corresponding approximately to one year. This amount of data

is the absolute minimum for GARCH parameter estimation. In

practice one may choose an even longer period for better

results. A new model was fitted to every 10

th

window of 250

subsequent observations. This step size was chosen by practical

reasons: the estimation of the GARCH parameters needs an

iterative procedure, which is rather quick, but in our case it has

to be repeated several thousand times, so our aim was a quicker

algorithm. It was observed that the replacement of only one

single observation did not cause a visible change in the

parameters in most of the cases, so we have not lost much

information by this procedure.

2. We have generated 10 perturbed series for each data set.

As we aimed at perturbed values, which could have been a

closing value on a given day, they were chosen as uniformly

distributed over the interval

(X

close

-(X

close

-X

min

)/4; X

close

+(X

max

-X

close

)/4). (4)

This method may be refined by using some kind of

bootstrap, but the choice of the resampling methodology is by

far not straightforward. The parameter 1/4 can also be changed,

but again this turned out to be a reasonable compromise: the

change in the parameter values is visible, but in most of the

cases the parameters of the perturbed sequence is classified to

the same class as those of the original observation.

3. The duration for the learning set was chosen as 3 years,

which resulted in l=74 runs from March 2004 (based on

observations from 2003) till March 2007. Summarizing these

runs, we got a parameter set P

ij

(a matrix with l rows and 6

columns) for stock i and perturbation number j, where

i=1,,73 and j=1,,10. The stability of the parameters was

different (as measured by the correlation between the

estimators of the original and the perturbed ones), see Table 1.

4. Having completed the calculations of the GARCH

coefficients, we turned our attention to the classification of the

results. We let the SVM run for each subcolumn P

ij

(k,q)

consisting of n=25 subsequent elements of column (parameter)

q of P

ij

from k to k+n-1 (q=1,6 and k=1,,l-n+1). This

means that we considered the coordinates separately. First we

optimized the cost and parameters in equations (1) and (3), as

proposed in Hsu et al. (2010). The parameters C=4 and =1/4

were suitable to get a 100% proper classification of the original

data for all coordinates. As a result we have got 6*50 models,

as l-n+1=50.

TABLE I. THE EMPIRICAL CORRELATION BETWEEN THE ESTIMATES FOR

THE ORIGINAL AND THE PERTURBED SERIES

parameter skewness shape

correlation (r) 0.983 0.611 0.801 0.631 0.899 0.832

5. Let us choose a later time period, for which the analogies

will be searched. In our case this estimation period started in

March, 2008 and contained as usual 25 estimators

(corresponding to a complete observation period of two years,

starting just after the last time point of the training set),

resulting in a 25 times 6 parameter matrix Q. In practical cases

one may just investigate one favorite stock, but here we

searched the possible analogies for all stocks in our database.

Having completed the tuning and SVM modeling of Step 4,

we searched for the stock i and run k, for which the estimated

25-dimensional vector resembled most to Q

(q)

, for all GARCH

parameters q=1,,6.. As we have an automatic selection

procedure in mind, we weighted the coordinates according to

their squared correlation coefficients between pairs of

perturbations (see Table 1)

} run in class to belongs Pr{

) ( 2 ) (

,

k i Q r p

q

q

q

k i

=

and chose stock i and run k for which the sum of the probability

of belonging to the class of this stock is maximal:

. max arg

6

1

) (

,

,

= q

q

k i

k i

p

In Figures 4 and 5 first the fitted pairs, then the same

diagrams for the 2 year period (Dell Sept 2003- Aug 2005 vs.

Boeing March 2007- March 2009) are shown. It has to be

mentioned that the both stock data were standardized to mean 0

and variance 1 in order to allow for the comparison.

Figure 4. The standardised time series (left panel) and log returns

(right panel) for the March 2007- March 2009 Boeing data (black) and the

best fitted data (Dell, Sept 2003- Aug 2005, yellow).

Figure 5. The fitted parameters (left panel) and skewness

(right panel) for the March 2007- March 2009 Boeing data (thick black) and

the same parameters for the perturbed vesrions of the best fitted data (Dell,

Sept 2003- Aug 2005, thin rainbow-coloured).

The left panel of Figure 4 shows a very good fit, and also

Figure 5 shows that there is indeed some similarity between the

two parameter sequences, especially the ranges are similar. It

February Issue Page 3 of 62 ISSN 2229 5208

International Journal of Computer Information Systems,

Vol. 4, No. 2, 2012

should be noted that no scaling was applied in the SVM, as the

parameter vectors should be taken into consideration as they

are.

6. The next step was the forecast. Formally this means that

the vector of observations over the interval [Y

t

:Y

t+24

] might be

estimated by [X

i,k

:X

i,k+24

]

(k+24 <t), where [X

i,k-25

:X

i,k-1

]

is the

best estimator of [Y

t-25

:Y

t-1

] in the sense of Step 5. This method

ensures that we have a complete set of estimated values even

before the start of the time period for which we are interested.

IV. RESULTS

Figure 6 gives the continuation of our example from Figure

4, showing the observed and the forecasted series. It shows a

somewhat looser relation compared to Figure 4, but still the

results show a satisfactory coincidence between the two series.

Of course the price series had to be scaled here as well, as the

units for the different assets are not related to each other. These

observations become especially clear when comparing the

results with other, randomly chosen sequences.

Of course, the forecast was not as good for all shares. The

average squared distance between the scaled baseline series and

the forecast for all 73 stocks is 1.4, 70% of the independence

case of 2, which is the expected squared error for a completely

random estimator: if X and Y are independent, standardized

random variables, then E(X-Y)

2

=2.

We have carried out a control study, where the forecasts

were based on the simple squared distances: those periods and

stocks were chosen, where this distance was the smallest for

the reference period. The forecast was then based on the next

period, analogously to our method. These results were

disappointing, however: the average squared distance was even

higher than 2.

Of course, this is just a preliminary analysis, where a

limited number of reference vectors were used. With more such

data, the quality of the forecast should be improved as well.

Figure 6. The standardised time series (left panel) and log returns

(right panel) for the March 2009- March 2010 Boeing data (thick, black) and

the continuation of the data shown in Figure 4 (Dell, Sept 2005- Aug 2006,

thin, red).

V. DISCUSSION

There is a vast amount of research focusing on forecasting

stock indices. Most of them claim that their method ensures

considerable gains for the user. However, the mathematicians

did not become richer based on these results, so some skeptics

is advised when applying them. Among these works there are

some data mining applications, too. However we are not aware

of any publications focusing on SVM with the aim of finding

analogies between different stocks. We have shown that this

plan is feasible, it provided some evidence for its usefulness

even based on our limited database. Considering it as an

additional tool for the economists, having professional

knowledge about the backgrounds of the companies, it may

become even more powerful. There are papers, where the

amount of processed data made it necessary to use the R* trees

for finding the best fit to a given series (see Rafiei and

Mendelzon, [6] as an example), in our case this was not

needed, as the whole optimization needed just a few hours of

computing time.

There is however a need for further work, where the effect

of different parameter values (window size, deviation of the

perturbations etc) should be taken into account. An economic

investigation of the reasons behind the found results would be

very interesting. The economists may also help by grouping of

the stocks with similar properties as it would also be helpful in

reducing the needed computing time, especially for larger scale

investigations.

ACKNOWLEDGMENT

The European Union and the European Social Fund have

provided financial support to the project under the grant

agreement no. TMOP 4.2.1/B-09/KMR-2010-0003.

REFERENCES

[1] Bollerslev,T.: Generalized autoregressive conditional heteroscedasticity.

Journal of Econometrics 31, 307-327 (1986)

[2] Cortes, C. and Vapnik, V.: Support-vector network. Machine Learning

20, 273--297 (1995)

[3] Francq, C. and Zakoian, J-M: GARCH models: structure, statistical

inference, and financial applications. Wiley, New York. (2010)

[4] Hsu,C-W., Chang, C-C and Lin, C-J.: A Practical Guide to Support

Vector Classification. http://www.csie.ntu.edu.tw/~cjlin (as on

01.11.2011)

[5] R, http://www.cran.r-project.org

[6] Rafiei, D. and Mendelzon, A.: Similarity-based queries for time series

data. Proceedings of the 1997 ACM SIGMOD International Conference

on Management of Data (1997).

[7] Yahoo, http://finance.yahoo.com/q/hp?s=UCG.MI+Historical+Prices (as

on 01.11.2011)

AUTHORS PROFILE

Dr. Andrs Zemplni received his Ph.D degree from

Etvs Lornd University Budapest, Hungary, in 1989.

His main field of interest is multivariate extreme value

modeling. He is also interested in analysis of financial

time series. He is associate professor at the Department

of Probability Theory and Statistics of the Etvs

Lornd University.

Dr. Csilla Hajas received her Ph.D degree from

Kossuth Lajos University Debrecen, Hungary, in 1997.

Her main fields of interest are database theory, data

mining, statistical process and industrial statistics. She

is assistant professor at the Department of Information

Systems of Etvs Lornd University Budapest.

February Issue Page 4 of 62 ISSN 2229 5208

International Journal of Computer Information Systems,

Vol. 4, No. 2, 2012

Dr. Tibor Nikovits received an MSc degree in

mathematics from Etvs Lornd University Budapest

in 1988 and an MSc degree in economics from

Budapest Economical University in 2003. His main

fields of interest are database theory, data mining,

financial and business information systems. He is a

senior lecturer at the Department of Information

Systems of Etvs Lornd University Budapest

February Issue Page 5 of 62 ISSN 2229 5208

- Names as attributesTransféré par908908lol2
- Paper 11Transféré parRakeshconclave
- Paper 2Transféré parRakeshconclave
- Paper 3Transféré parRakeshconclave
- A Survey on Soil Data MiningTransféré paramitarya514
- Hydrology AITransféré parPlutarco
- Paper 4Transféré parRakeshconclave
- Paper 5Transféré parRakeshconclave
- Paper 15Transféré parRakeshconclave
- Paper 6Transféré parRakeshconclave
- Paper 8Transféré parRakeshconclave
- Paper 13Transféré parRakeshconclave
- Paper 9Transféré parRakeshconclave
- Paper 12Transféré parRakeshconclave
- Paper 7Transféré parRakeshconclave
- Paper 10Transféré parRakeshconclave
- Paper 14Transféré parRakeshconclave
- Paper 16Transféré parRakeshconclave
- Scholastic Book SupportVectorM Part01 2014-01-26Transféré parephifannia
- AI Syllabus CourseTransféré parOnlineDownloads
- sdarticle_103Transféré parAkram Al-muharami
- Hiram Finance Forecasting Equity Realized Volatility Using Machine Learning MethodsTransféré paralexa_sherpy
- V4I11-0368.pdfTransféré paromravi
- AI Lec 02 - Supervised Learning - Week 02Transféré parShafaq Khan
- 02__16100-JCSA ok 21-29Transféré parUma Mahesh
- motor svnTransféré parmh1999
- The Combining Approach of Svms with Ant Colony Networks: An Application of Network Intrusion DetectionTransféré parIJMER
- Wang_et_al-2016-Journal_of_Futures_Markets.pdfTransféré parNicolas Baroud
- Regularized Weighted Ensemble of DeepTransféré parAnonymous lVQ83F8mC
- 3a bivariate 2Transféré parapi-254501788

- Paper 2Transféré parRakeshconclave
- Paper 4Transféré parRakeshconclave
- Paper 3Transféré parRakeshconclave
- Paper 1Transféré parRakeshconclave
- Paper 15Transféré parRakeshconclave
- Paper 13Transféré parRakeshconclave
- Paper 12Transféré parRakeshconclave
- Paper 11Transféré parRakeshconclave
- Paper 10Transféré parRakeshconclave
- Paper 9Transféré parRakeshconclave
- Paper 8Transféré parRakeshconclave
- Paper 7Transféré parRakeshconclave
- Paper 6Transféré parRakeshconclave
- Paper 5Transféré parRakeshconclave
- Paper 4Transféré parRakeshconclave
- Paper 3Transféré parRakeshconclave
- Paper 2Transféré parRakeshconclave
- Paper 1Transféré parRakeshconclave
- Paper 16Transféré parRakeshconclave
- Paper 14Transféré parRakeshconclave
- Paper 13Transféré parRakeshconclave
- Paper 12Transféré parRakeshconclave
- Paper 10Transféré parRakeshconclave
- Paper 9Transféré parRakeshconclave
- Paper 8Transféré parRakeshconclave
- Paper 7Transféré parRakeshconclave
- Paper 6Transféré parRakeshconclave
- Paper 5Transféré parRakeshconclave
- Paper 4Transféré parRakeshconclave

- Options ErratumTransféré parAmbrosio Ortiz
- 6036 Lecture NotesTransféré parMatt Staple
- AAOC C321 Control Systems Compre PaperTransféré parTushar Gupta
- 2.1-1Transféré parArnab Gergasi
- Never Forget the Fundamentals of Process ControlTransféré parFernando Otero
- 5approxnTransféré parDhruv Kuchhal
- Handou#7_DataPathTransféré parachuu1987
- A Hybrid Approach using Fuzzy Logic and Neural Network for Enhancement of Low Contrast ImagesTransféré parijsret
- iszc361(3)Transféré parPritom Rajkhowa
- Course Compact Gec320Transféré parNelson
- Fast and Adaptive Blind Audio Source Separation Using Recursive Levenberg-Marquardt Synchrosqueezing_Fourer_2018Transféré parYanquiel Mansfarroll Gonzalez
- CS1352-Principles of Compiler Design Question BankTransféré parchetanarora
- Stilson-Smith - Alias-Free Digital Synthesis of Classic Analog Waveforms (BLIT)Transféré parHangTheBankers
- AHPslidesTransféré parMiftakhul Ulum
- Skip ListsTransféré parsonal
- Synopsis of m. Tech. Thesis on Face Detection Using Neural Network in Matlab by Lalita GurjariTransféré parLinguum
- entregas_june2018Transféré parVictor De Paula Vila
- SignalsTransféré parManojSharma
- IMPLEMENTATION OF ADAPTIVE EQUALIZATION TECHNIQUES TO MINIMIZE ISI IN COMMUNICATION SYSTEMTransféré parEr. Soumya Saraswata Dash
- LectKF10Transféré parKhalil Ullah
- Lecture#01Transféré parShaan Khalid
- v17-n2-(pp-193-228)Transféré parAishwaryaPatel
- State Space and Control TheoryTransféré parOneSetiaji SetiajiOne
- SDPB ManualTransféré parAaron Hillman
- Delta RobotTransféré parVũ Huy Hoàng
- PRESENTATION.Transféré parpawannhpc
- Simplex Method - Maximisation CaseTransféré parJoseph George Konnully
- 00+BIT+11003Transféré pardexjh
- Session VII PrinciplesTransféré parVarun Doda
- 1Transféré parMunirah Hashim