Vous êtes sur la page 1sur 5

Stock Market Prediction

Rodrigo Barbosa de Santis

May 21, 2019

1 Introduction
The objective of this work is to develop a learning machine model able to predict the market
value of a important market index, composed by 500 top listed companies in US.

2 Materials and methods

2.1 Multi-layered perceptron

The multi-layered feedforward network, also known as multi-layered perceptron (MLP) pro-
vides a general artificial neural network representation, as shown in Figure 1.
The MLP have one or more non-linear layers – known as hidden layers – and can learn a non-
linear function f (·) : Rm → Ro , where m is the number of dimensions for input and o is the
number of dimensions for output, from a given set of features X = x1 , x2 , ..., xm and a target y
for regression (Pedregosa et al., 2011).

Figure 1 – General MLP with a single hidden layer (Pedregosa et al., 2011)

The input layer is composed by a set of neurons {xi |x1 , x2 , ..., xm } which are modified by a
non-linear activation function g(·) : R set by the user – such as logistic or hyperbolic tangent

1
∑i
– and incorporated to the weighted linear summation m wi xi that generates the subsequent
hidden layers and output values.
The main advantage of MLP is their capability to learn non-linear models in real-time. On the
other hand, some drawbacks of these kind of networks is the loss function in problems with
more than one local minimum, its sensitivity to feature scaling and the requirement of hyper-
parameters tuning.

2.2 SP500 dataset

The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in
the US. The dataset includes a list of all the stocks contained therein and associated key financial
such as price, market capitalization, earnings, price/earnings ratio, price to book etc.
The dataset attributes are the following:
– x1 : 1-Year Bill Yield
– x2 : Earnings per Share
– x3 : Dividend per Share
– x4 : Current S&P 500
– y: Next Week S&P 500

2.3 Cross-validation

Cross-validation is an important concept used to estimate how accurate is a classifier for a new
set of data, avoiding a common problem known as overfitting, in which a particular model
become excessively complex and unable to classify other dataset than the training one.
The k-fold cross-validation, one of the most applied along with grid search, randomly splits a
dataset D into k mutually exclusive subsets (or folds) of practically equal size. Figure 2 exem-
plifies the iterative procedure of cross-validation through data division into k subsets.

k=1
k=2 Training set
k=3 Validation set
k=4
Figure 2 – Division of the dataset into K = 4 folds

2.4 Performance metrics

For assessing the methods classification performance, it was applied the evaluation metrics mean
squared error (M SE), which computes the difference between distance between the predicted
and desired values, given by: (Pedregosa et al., 2011)

2
1 ∑
N −1
M SE(y, ŷ) = (yi − ŷi )2
N i=0

where N is the number of samples, ŷi is the estimated target output, yi is the corresponding
(correct) target output.

3 Development
Two different approaches are adopted to split training and set. Firstly, data are split between two
groups, the one with indexes 1, 3, 5, 7, ..., 505 and the other with indexes 2, 4, 6, ..., 504.
A second approach is considering the training set being a historical series of the first 300 weeks,
and using the subsequently 204 weeks for validation and test.
The networks were trained using the K-fold cross validation technique, with K = 4, adopting
the parameters in Table 1 for the estimator.

Table 1 – Parameters adopted for the learning classifier

Estimator Parameters Values


Alpha (α) 10−5
MLP Activation (ρ) logistic
Hidden layers [15]

The algorithm is implemented using Python 2.7.8 (Van Rossum, 1998), including the following
libraries:
1. Matplotlib (http://matplotlib.org/) – a library that provides a groupf of 2D charts and image
functions;
2. NumPy (http://www.numpy.org/) – A large set of functions that allows arrays manipula-
tion;
3. Scikit-learn (http://scikit-learn.org/) – Machine learning library in Python.

4 Results
The mean squared errors achieved in training and testing sets for the first data partition was
49.828 and 37.319, respectively. The same metric results considering the second data partition
was 46.972 for training and 2, 769.970 for test. Figure 3 shows a scatter chart of desired against
predicted values.
Figure 4 shows the predicted values for long unknown data, obtained with the second data sep-
aration scheme. Data predicted by the model were lower than the performed by the index, since
the market uptrend level could not be correctly perceived from the historical data seen by the
network during its training.

3
(a) First data partition approach (b) Second data partition approach

Figure 3 – Mean Squared Error M SE results

Figure 4 – Historic data plot (in blue) comparison with the predicted by the model (in green).

5 Conclusion
Although the MLP regressor presented a great accuracy for predicting market value in a short-
term vision, when utilized for long-term periods, the model return worse results when applied
in long-term predictions.
Future works for improving the model are considering other attributes related to the market
value, such as economic and copulation indexes, or incorporating specialist judgments as inputs
for long-term prediction.

References
Pedregosa, F., Varoquaux, G., Gramfort, A. and Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M. & Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, vol. 12, pp. 2825-2830.
Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (1999). Neural and adaptive systems: funda-
mentals through simulations with CD-ROM. John Wiley & Sons, Inc..

4
Van Rossum, G. (1998). Python: a computer language. Version 2.7.8. Amsterdam, Stichting
Mathematisch Centrum. (http://www.python.org).

Vous aimerez peut-être aussi