Vous êtes sur la page 1sur 52

1

TIME SERIES FORECASTING


BY USING

WAVELET KERNEL SUPPORT VECTOR MACHINES

LSE Time Series Reading Group

By Ali Habibnia

(a.habibnia@lse.ac.uk)

31 Oct 2012

Outline
2

Introduction to Statistical Learning and SVM


SVM & SVR Formula
Wavelet as a Kernel Function
Study 1:
Forecasting volatility based on wavelet support vector machine,
Written by Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng

Study 2:

Suggestion for further research + Q&A

Forecasting Volatility in Financial Markets By Introducing a GAAssisted SVR-Garch Model,


Written by Ali Habibnia

SVMs History
3

The Study on Statistical Learning Theory was


started in the 1960s by Vladimir Vapnik. He is
well-known as a founder (together with Professor
Alexey Chervonenkis) of this theory.
He has also developed the theory of the support
vector machines (for linear and nonlinear input
output knowledge discovery) in the framework of
statistical learning theory in 1992.
Prof. Vapnik has been awarded the 2012
Benjamin Franklin medal in Computer and
Cognitive Science from the Franklin Institute.

History and motivation


4

SVMs (a novel ANN) is a supervised learning algorithm for

Pattern Recognition

Regression Estimation Non Parametric (Applications for function estimation started ~ 1995 called
Support Vector Regression)

Remarkable characteristics of SVMs

Good generalization performance: SVMs implement the Structural Risk Minimization Principle
which seeks to minimize the upper bound of the generalization error rather than only minimize
the training error.

Absence of local minima: Training SMV is equivalent to solving a linearly constrained


quadratic programming problem. Hence the solution of SVMs is unique and globally optimal.

It has a simple geometrical interpretation in a high-dimensional feature space


that is nonlinearly related to input space

The Advantages of SVM(R)


5

Based on a strong and nice Theory:


In contrast to previous black box learning approaches, SVMs allow for some intuition
and human understanding.

Training is relatively easy:


No local optimal, unlike in neural network
Training time does not depend on dimensionality of feature space, only on fixed input
space thanks to the kernel trick.

Generally avoids over-fitting:


Trade-off between complexity and error can be controlled explicitly.

Generalize well even in high dimensional spaces under small training set
conditions. Also it is robust to noise.

Linear Classifiers

g(x) is a linear function:

x2

wT x + b > 0

g (x) = wT x + b
A hyper-plane in the feature
space

(Unit-length) normal vector of the


hyper-plane:
n=

w
w

wT x + b < 0

x1

Linear Classifiers
denotes +1

How would you classify these


points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!

x2

denotes -1

x1

Linear Classifiers
denotes +1

How would you classify these


points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!

x2

denotes -1

x1

Linear Classifiers
denotes +1

How would you classify these


points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!

x2

denotes -1

x1

Linear Classifiers
denotes +1

x2

denotes -1

Which one is the best?

x1

Large Margin Linear Classifier

The linear discriminant function


(classifier) with the maximum
margin is the best
Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
Why it is the best?
q

Robust to outliners and thus


strong generalization ability

x2
safe zone

Margin

x1

Large Margin Linear Classifier

Given a set of data points:

{(xi , yi )}, i = 1, 2, L , n
T

For yi = +1, w xi + b > 0

x2
safe zone

Margin
n

For yi = 1, wT xi + b < 0
n

With a scale transformation on


both w and b, the above is
equivalent to
T

For yi = +1, w xi + b 1

For yi = 1, wT xi + b 1

Support Vectors

x1

Large Margin Linear Classifier

We know that
wT x + + b = 1

x2
safe zone

Margin
n

wT x + b = 1
n

The margin width is:


M = (x + x ) n
w
2
= (x + x )
=
w
w

Support Vectors

x1

Large Margin Linear Classifier

x2

Formulation:
maximize

2
w

safe zone

Margin
n

such that
For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1
Support Vectors

x1

This is the simplest kind of SVM (Called


an LSVM)
x2

Formulation:
1
minimize
w
2

safe zone
2

Margin
n

such that

For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1

such that
T

yi ( w xi + b) 1

Support Vectors

x1

The Optimization Problem Solution


16

Quadratic
programming
with linear
constraints

minimize

s.t.

1
w
2

yi ( w T xi + b) 1

Lagrangian
Function
1
minimize L p ( w, b, i ) =
w
2

s.t.

i ( yi ( w T xi + b) 1)

i 0

i =1

The Optimization Problem Solution


17

n
1
2
minimize Lp (w, b, i ) = w i ( yi ( wT xi + b) 1)
2
i =1

s.t.

Lp
w
Lp
b

i 0
n

=0

w = i yi xi
i =1

=0

y
i

i =1

=0

The Optimization Problem Solution


18

The Optimization Problem Solution


19

From KKT condition, we know:


x2

i ( yi (wT xi + b) 1) = 0
n
n

Thus, only support vectors have

x+

i 0

The solution has the form:


n

w = i yi xi =
i =1

x+

x-

Support Vectors

T
get
b
from
y
(
w
xi + b) 1 = 0,

y
x

i i i
i
iSV
where xi is support vector

x1

Soft Margin Classification


20

Slack variables i can be added to allow misclassification of


difficult or noisy examples.

11
2

What should our quadratic


optimization criterion be?
Minimize
R
1
w.w + C k
2
k =1

Hard Margin v.s. Soft Margin


21

The old formulation:


Find w and b such that
(w) = wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1

The new formulation incorporating slack variables:


Find w and b such that
(w) = wTw + Ci is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1- i and i 0 for all i

Parameter C can be viewed as a way to control overfitting.

Non-linear SVMs: Feature spaces


22

General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:

: x (x)

The Kernel Trick


23

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation
: x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
n

A kernel function is some function that corresponds to an inner product in some


expanded feature space.

Kernel methods map the data into higher dimensional spaces in the hope that in this
higher-dimensional space the data could become more easily separated or better
structured.

The Kernel Trick


24

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

This mapping function, however, hardly needs to be computed because of a tool


called the kernel trick.
The kernel trick is a mathematical tool which can be applied to any algorithm which
solely depends on the dot product between two vectors. Wherever a dot product is
used, it is replaced by a kernel function.
Linear Kernel
Polynomial Kernel
Gaussian Kernel
Exponential Kernel
Laplacian Kernel
ANOVA Kernel
Hyperbolic Tangent (Sigmoid) Kernel
Rational Quadratic Kernel
Multiquadric Kernel
Inverse Multiquadric Kernel
Circular Kernel
Spherical Kernel

13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.

Wave Kernel
Power Kernel
Log Kernel
Spline Kernel
B-Spline Kernel
Bessel Kernel
Cauchy Kernel
Chi-Square Kernel
Histogram Intersection Kernel
Generalized Histogram Intersection Kernel
Generalized T-Student Kernel
Bayesian Kernel
Wavelet Kernel

What Functions are Kernels?


25

For some functions K(xi,xj) checking that


K(xi,xj)= (xi) T(xj) can be cumbersome.
n Mercers theorem:
Every semi-positive definite symmetric function is a kernel
n Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
n

K(x1,x1)

K(x1,x2)

K(x1,x3)

K(x2,x1)

K(x2,x2)

K(x2,x3)

K(xN,x1)

K(xN,x2)

K(xN,x3)

K(x1,xN)
K(x2,xN)

K(xN,xN)

Examples of Kernel Functions


26

K (xi , x j ) = xTi x j

Linear kernel:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

K (xi , x j ) = (1 + xTi x j ) p

K (xi , x j ) = exp(
q

Sigmoid:

xi x j
2
T
0 i

K (xi , x j ) = tanh( x x j + 1 )

Nonlinear SVM: Optimization


27

Some Issues
28

Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
Choice of kernel parameters
- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or
cross-validation to set such parameters.
Optimization criterion Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested

Support vector regression


29

Maximum margin hyperplane only applies to classification


However, idea of support vectors and kernel functions can be used for
regression
Basic method is the same as in linear regression: minimize error
Difference A: ignore errors smaller than e and use absolute error instead
of squared error
Difference B: simultaneously aim to maximize flatness of function
User-specified parameter e defines tube

Ordinary Least Squares (OLS)


Solution:

f(x)

Loss = (wX + b ) Y

f (x) = wx + b

dLoss
=0
dw

(X
x

X w = X TY

Support Vector Regression (SVR)


Solution:

f(x)

f (x) = wx + b

+
0
-

Min

1 T
w w
2

Constraints:

yi wT xi b

wT xi + b yi

Support Vector Regression (SVR)


Minimise:N

f(x)

f (x) = wx + b

+
0
-

1 T
w w + C i + i*
2
i =1

Constraints:
T

yi w xi b + i

wT xi + b yi + i*

i , i* 0

*
x

Lagrange Optimisation
N
1 T
L = w w + C i + i*
2
i =1

Target

i + i yi + wT xi + b
i =1
N

i* + i* yi + wT xi + b
i =1
N

i i + *i i*
i =1

Regression:

)
)

y (x ) = i i* xi , x + b
i =1

Constraints

Nonlinear Regression
f(x)

f(x)

+
0
-

+
0
-

(x)

Regression Formulas
N

Linear:

i =1

Nonlinear:
General:

y (x ) = i i* (xi ), (x ) + b
i =1

y (x ) = i i* xi , x + b

y (x ) = i i* K (xi , x ) + b
i =1

Kernel Types
l
l

Linear:

K ( x, xi ) = x, xi

Polynomial:

K ( x, xi ) = x, xi d

Radial basis function:

x
i

K ( x, xi ) = exp
2

Exponential RBF:

x xi

K ( x, xi ) = exp

Wavelet Kernel
37

The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:

Where a and c are the wavelet dilation and translation coefficients, respectively (the form
presented above is a simplification). A translation-invariant version of this kernel can be
given as:

Where in both h(x) denotes a mother wavelet function. i.e:

A simple View of Wavelet Theory


38

Fourier Analysis
n

Breaks down a signal into constituent sinusoids of different


frequencies

In other words: Transform the view of the signal from timebase to frequency-base.

A simple View of Wavelet Theory


39

Whats wrong with Fourier?


n
n
n

By using Fourier Transform , we loose the time information :


WHEN did a particular event take place ?
FT can not locate drift, trends, abrupt changes, beginning
and ends of events, etc.
Calculating use complex numbers.

Wavelets vs. Fourier Transform


In Fourier transform (FT) we represent a signal in
terms of sinusoids
FT provides a signal which is localized only in the
frequency domain
It does not give any information of the signal in the
time domain

40

Wavelets vs. Fourier Transform


Basis functions of the wavelet transform (WT) are
small waves located in different times
They are obtained using scaling and translation of a
scaling function and wavelet function
Therefore, the WT is localized in both time and
frequency

41

Wavelets vs. Fourier Transform


If a signal has a discontinuity, FT produces many
coefficients with large magnitude (significant
coefficients)
But WT generates a few significant coefficients
around the discontinuity
Nonlinear approximation is a method to benchmark
the approximation power of a transform

42

Wavelets vs. Fourier Transform

In nonlinear approximation we keep only a few significant


coefficients of a signal and set the rest to zero
Then we reconstruct the signal using the significant
coefficients
WT produces a few significant coefficients for the signals
with discontinuities
Thus, we obtain better results for WT nonlinear
approximation when compared with the FT
43

Wavelets vs. Fourier Transform

Most natural signals are smooth with a few discontinuities


(are piece-wise smooth)
Speech and natural images are such signals
Hence, WT has better capability for representing these
signal when compared with the FT
Good nonlinear approximation results in efficiency in
several applications such as compression and denoising
44

Study 1: Forecasting volatility based on wavelet support vector


machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
45

combine SVM with wavelet theory to construct a


multidimensional wavelet kernel function to predict the
conditional volatility of stock market returns based on GARCH
model.
General Kernel function in SVM cannot capture the cluster
feature of volatility accurately.
wavelet function yields features that describe of the volatility
time series both at various locations and at varying time
granularities.
The prediction performance of SVM is greatly dependent upon
the selection of kernel functions.
j and k denote the dilation and
translation

Study 1: Forecasting volatility based on wavelet support vector


machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
46

Study 1: Forecasting volatility based on wavelet support vector


machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
47

Study 1: Forecasting volatility based on wavelet support vector


machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
48

Study 2: Forecasting Volatility in Financial Markets By Introducing


a GA-Assisted SVR-Garch Model, Ali Habibnia
49

No structured way being available to choose the free


parameters of SVR and kernel function, these parameters are
usually set by researcher in trial and error (Grid Search and
Cross Validation) procedure which is not optimal.
In this study a novel method, named as GA assisted SVR has
been introduced, which a genetic algorithm simultaneously
searches for SVRs optimal parameters and kernel parameter
(in this study: a radial basis function (RBF)).
The SVM(R) tries to get the best fit with the data, not relying on
any prior knowledge, and it only concentrates on minimizing the
prediction error with a given machine complexity.

FTSE 100 index from 04


Jan 2005 to 29 Jun 2012.

Study 2: Forecasting Volatility in Financial Markets By Introducing


a GA-Assisted SVR-Garch Model, Ali Habibnia
50

51

Suggestion for further research + Q&A

52

Thanks for your patience ;)

LSE Time Series Reading Group

By Ali Habibnia
a.habibnia@lse.ac.uk
31 Oct 2012

Vous aimerez peut-être aussi