Time Series Forecasting by Using Wavelet Kernel SVM

1
TIME SERIES FORECASTING

BY USING
WAVELET KERNEL SUPPORT VECTOR MACHINES
LSE Time Series Reading Group
By Ali Habibnia
(a.habibnia@lse.ac.uk)
31 Oct 2012
Outline
2
Introduction to Statistical Learning and SVM

SVM & SVR Formula
Wavelet as a Kernel Function
Study 1:
Forecasting volatility based on wavelet support vector machine,
Written by Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
Study 2:
Suggestion for further research + Q&A
Forecasting Volatility in Financial Markets By Introducing a GAAssisted SVR-Garch Model,

Written by Ali Habibnia
SVMs History
3
The Study on Statistical Learning Theory was

started in the 1960s by Vladimir Vapnik. He is
well-known as a founder (together with Professor
Alexey Chervonenkis) of this theory.
He has also developed the theory of the support
vector machines (for linear and nonlinear input
output knowledge discovery) in the framework of
statistical learning theory in 1992.
Prof. Vapnik has been awarded the 2012
Benjamin Franklin medal in Computer and
Cognitive Science from the Franklin Institute.
History and motivation

4
SVMs (a novel ANN) is a supervised learning algorithm for
Pattern Recognition
Regression Estimation Non Parametric (Applications for function estimation started ~ 1995 called
Support Vector Regression)
Remarkable characteristics of SVMs
Good generalization performance: SVMs implement the Structural Risk Minimization Principle
which seeks to minimize the upper bound of the generalization error rather than only minimize
the training error.
Absence of local minima: Training SMV is equivalent to solving a linearly constrained

quadratic programming problem. Hence the solution of SVMs is unique and globally optimal.
It has a simple geometrical interpretation in a high-dimensional feature space

that is nonlinearly related to input space
The Advantages of SVM(R)

5
Based on a strong and nice Theory:

In contrast to previous black box learning approaches, SVMs allow for some intuition
and human understanding.
Training is relatively easy:

No local optimal, unlike in neural network
Training time does not depend on dimensionality of feature space, only on fixed input
space thanks to the kernel trick.
Generally avoids over-fitting:

Trade-off between complexity and error can be controlled explicitly.
Generalize well even in high dimensional spaces under small training set
conditions. Also it is robust to noise.
Linear Classifiers
g(x) is a linear function:
x2
wT x + b > 0
g (x) = wT x + b
A hyper-plane in the feature
space
(Unit-length) normal vector of the

hyper-plane:
n=
w
w
wT x + b < 0
x1
Linear Classifiers
denotes +1
How would you classify these

points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!
x2
denotes -1
x1
Linear Classifiers
denotes +1

error rate?
x2
denotes -1
x1
Linear Classifiers
denotes +1

error rate?
x2
denotes -1
x1
Linear Classifiers
denotes +1
x2
denotes -1
Which one is the best?
x1
Large Margin Linear Classifier
The linear discriminant function

(classifier) with the maximum
margin is the best
Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
Why it is the best?
q
Robust to outliners and thus

strong generalization ability
x2
safe zone
Margin
x1
Given a set of data points:
{(xi , yi )}, i = 1, 2, L , n
T
For yi = +1, w xi + b > 0
x2
safe zone
Margin
n
For yi = 1, wT xi + b < 0
n
With a scale transformation on

both w and b, the above is
equivalent to
T
For yi = +1, w xi + b 1
For yi = 1, wT xi + b 1
Support Vectors
x1
We know that
wT x + + b = 1
x2
safe zone
Margin
n
wT x + b = 1
n
The margin width is:

M = (x + x ) n
w
2
= (x + x )
=
w
w
Support Vectors
x1
x2
Formulation:
maximize
2
w
safe zone
Margin
n
such that
For yi = +1, wT xi + b 1
Support Vectors
x1
This is the simplest kind of SVM (Called

an LSVM)
x2
Formulation:
1
minimize
w
2
safe zone
2
Margin
n
such that
For yi = +1, wT xi + b 1
such that
T
yi ( w xi + b) 1
Support Vectors
x1
The Optimization Problem Solution

16
Quadratic
programming
with linear
constraints
minimize
s.t.
1
w
2
yi ( w T xi + b) 1
Lagrangian
Function
1
minimize L p ( w, b, i ) =
w
2
s.t.
i ( yi ( w T xi + b) 1)
i 0
i =1

17
n
1
2
minimize Lp (w, b, i ) = w i ( yi ( wT xi + b) 1)
2
i =1
s.t.
Lp
w
Lp
b
i 0
n
=0
w = i yi xi
i =1
=0
y
i
i =1
=0

18

19
From KKT condition, we know:

x2
i ( yi (wT xi + b) 1) = 0
n
n
Thus, only support vectors have
x+
i 0
The solution has the form:

n
w = i yi xi =
i =1
x+
x-
Support Vectors
T
get
b
from
y
(
w
xi + b) 1 = 0,
y
x
i i i
i
iSV
where xi is support vector
x1
Soft Margin Classification

20
Slack variables i can be added to allow misclassification of

difficult or noisy examples.
11
2
What should our quadratic

optimization criterion be?
Minimize
R
1
w.w + C k
2
k =1
Hard Margin v.s. Soft Margin

21
The old formulation:

Find w and b such that
(w) = wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1
The new formulation incorporating slack variables:

Find w and b such that
(w) = wTw + Ci is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1- i and i 0 for all i
Parameter C can be viewed as a way to control overfitting.
Non-linear SVMs: Feature spaces

22
General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:
: x (x)
The Kernel Trick

23
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation
: x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
n
A kernel function is some function that corresponds to an inner product in some

expanded feature space.
Kernel methods map the data into higher dimensional spaces in the hope that in this
higher-dimensional space the data could become more easily separated or better
structured.
The Kernel Trick

24
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
This mapping function, however, hardly needs to be computed because of a tool

called the kernel trick.
The kernel trick is a mathematical tool which can be applied to any algorithm which
solely depends on the dot product between two vectors. Wherever a dot product is
used, it is replaced by a kernel function.
Linear Kernel
Polynomial Kernel
Gaussian Kernel
Exponential Kernel
Laplacian Kernel
ANOVA Kernel
Hyperbolic Tangent (Sigmoid) Kernel
Rational Quadratic Kernel
Multiquadric Kernel
Inverse Multiquadric Kernel
Circular Kernel
Spherical Kernel
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Wave Kernel
Power Kernel
Log Kernel
Spline Kernel
B-Spline Kernel
Bessel Kernel
Cauchy Kernel
Chi-Square Kernel
Histogram Intersection Kernel
Generalized Histogram Intersection Kernel
Generalized T-Student Kernel
Bayesian Kernel
Wavelet Kernel
What Functions are Kernels?

25
For some functions K(xi,xj) checking that

K(xi,xj)= (xi) T(xj) can be cumbersome.
n Mercers theorem:
Every semi-positive definite symmetric function is a kernel
n Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
n
K(x1,x1)
K(x1,x2)
K(x1,x3)
K(x2,x1)
K(x2,x2)
K(x2,x3)
K(xN,x1)
K(xN,x2)
K(xN,x3)
K(x1,xN)
K(x2,xN)
K(xN,xN)
Examples of Kernel Functions

26
K (xi , x j ) = xTi x j
Linear kernel:
Polynomial kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
K (xi , x j ) = (1 + xTi x j ) p
K (xi , x j ) = exp(
q
Sigmoid:
xi x j
2
T
0 i
K (xi , x j ) = tanh( x x j + 1 )
Nonlinear SVM: Optimization

27
Some Issues
28
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
Choice of kernel parameters
- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or
cross-validation to set such parameters.
Optimization criterion Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
Support vector regression

29
Maximum margin hyperplane only applies to classification

However, idea of support vectors and kernel functions can be used for
regression
Basic method is the same as in linear regression: minimize error
Difference A: ignore errors smaller than e and use absolute error instead
of squared error
Difference B: simultaneously aim to maximize flatness of function
User-specified parameter e defines tube
Ordinary Least Squares (OLS)

Solution:
f(x)
Loss = (wX + b ) Y
f (x) = wx + b
dLoss
=0
dw
(X
x
X w = X TY
Support Vector Regression (SVR)

Solution:
f(x)
f (x) = wx + b
+
0
-
Min
1 T
w w
2
Constraints:
yi wT xi b
wT xi + b yi
Support Vector Regression (SVR)

Minimise:N
f(x)
f (x) = wx + b
+
0
-
1 T
w w + C i + i*
2
i =1
Constraints:
T
yi w xi b + i
wT xi + b yi + i*
i , i* 0
*
x
Lagrange Optimisation
N
1 T
L = w w + C i + i*
2
i =1
Target
i + i yi + wT xi + b
i =1
N
i* + i* yi + wT xi + b
i =1
N
i i + *i i*
i =1
Regression:
)
)
y (x ) = i i* xi , x + b
i =1
Constraints
Nonlinear Regression
f(x)
f(x)
+
0
-
+
0
-
(x)
Regression Formulas
N
Linear:
i =1
Nonlinear:
General:
y (x ) = i i* (xi ), (x ) + b
i =1
y (x ) = i i* xi , x + b
y (x ) = i i* K (xi , x ) + b
i =1
Kernel Types
l
l
Linear:
K ( x, xi ) = x, xi
Polynomial:
K ( x, xi ) = x, xi d
Radial basis function:
x
i
K ( x, xi ) = exp
2
Exponential RBF:
x xi
K ( x, xi ) = exp
Wavelet Kernel
37
The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:
Where a and c are the wavelet dilation and translation coefficients, respectively (the form
presented above is a simplification). A translation-invariant version of this kernel can be
given as:
Where in both h(x) denotes a mother wavelet function. i.e:
A simple View of Wavelet Theory

38
Fourier Analysis
n
Breaks down a signal into constituent sinusoids of different

frequencies
In other words: Transform the view of the signal from timebase to frequency-base.
A simple View of Wavelet Theory

39
Whats wrong with Fourier?

n
n
n
By using Fourier Transform , we loose the time information :

WHEN did a particular event take place ?
FT can not locate drift, trends, abrupt changes, beginning
and ends of events, etc.
Calculating use complex numbers.
Wavelets vs. Fourier Transform

In Fourier transform (FT) we represent a signal in
terms of sinusoids
FT provides a signal which is localized only in the
frequency domain
It does not give any information of the signal in the
time domain
40

Basis functions of the wavelet transform (WT) are
small waves located in different times
They are obtained using scaling and translation of a
scaling function and wavelet function
Therefore, the WT is localized in both time and
frequency
41

If a signal has a discontinuity, FT produces many
coefficients with large magnitude (significant
coefficients)
But WT generates a few significant coefficients
around the discontinuity
Nonlinear approximation is a method to benchmark
the approximation power of a transform
42
In nonlinear approximation we keep only a few significant

coefficients of a signal and set the rest to zero
Then we reconstruct the signal using the significant
coefficients
WT produces a few significant coefficients for the signals
with discontinuities
Thus, we obtain better results for WT nonlinear
approximation when compared with the FT
43
Most natural signals are smooth with a few discontinuities

(are piece-wise smooth)
Speech and natural images are such signals
Hence, WT has better capability for representing these
signal when compared with the FT
Good nonlinear approximation results in efficiency in
several applications such as compression and denoising
44
Study 1: Forecasting volatility based on wavelet support vector

machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
45
combine SVM with wavelet theory to construct a

multidimensional wavelet kernel function to predict the
conditional volatility of stock market returns based on GARCH
model.
General Kernel function in SVM cannot capture the cluster
feature of volatility accurately.
wavelet function yields features that describe of the volatility
time series both at various locations and at varying time
granularities.
The prediction performance of SVM is greatly dependent upon
the selection of kernel functions.
j and k denote the dilation and
translation

46

47

48
Study 2: Forecasting Volatility in Financial Markets By Introducing

a GA-Assisted SVR-Garch Model, Ali Habibnia
49
No structured way being available to choose the free

parameters of SVR and kernel function, these parameters are
usually set by researcher in trial and error (Grid Search and
Cross Validation) procedure which is not optimal.
In this study a novel method, named as GA assisted SVR has
been introduced, which a genetic algorithm simultaneously
searches for SVRs optimal parameters and kernel parameter
(in this study: a radial basis function (RBF)).
The SVM(R) tries to get the best fit with the data, not relying on
any prior knowledge, and it only concentrates on minimizing the
prediction error with a given machine complexity.
FTSE 100 index from 04

Jan 2005 to 29 Jun 2012.
Study 2: Forecasting Volatility in Financial Markets By Introducing

a GA-Assisted SVR-Garch Model, Ali Habibnia
50
51
Suggestion for further research + Q&A
52
Thanks for your patience ;)
LSE Time Series Reading Group
By Ali Habibnia
a.habibnia@lse.ac.uk
31 Oct 2012

Time Series Forecasting by Using Wavelet Kernel SVM

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Time Series Forecasting by Using Wavelet Kernel SVM

Transféré par

Droits d'auteur :

Formats disponibles

1

TIME SERIES FORECASTING

WAVELET KERNEL SUPPORT VECTOR MACHINES

LSE Time Series Reading Group

Introduction to Statistical Learning and SVM

Suggestion for further research + Q&A

Forecasting Volatility in Financial Markets By Introducing a GAAssisted SVR-Garch Model,

The Study on Statistical Learning Theory was

History and motivation

SVMs (a novel ANN) is a supervised learning algorithm for

Remarkable characteristics of SVMs

Absence of local minima: Training SMV is equivalent to solving a linearly constrained

It has a simple geometrical interpretation in a high-dimensional feature space

The Advantages of SVM(R)

Based on a strong and nice Theory:

Training is relatively easy:

Generally avoids over-fitting:

g(x) is a linear function:

(Unit-length) normal vector of the

How would you classify these

How would you classify these

How would you classify these

Which one is the best?

Large Margin Linear Classifier

The linear discriminant function

Robust to outliners and thus

Large Margin Linear Classifier

Given a set of data points:

For yi = +1, w xi + b > 0

With a scale transformation on

Large Margin Linear Classifier

The margin width is:

Large Margin Linear Classifier

This is the simplest kind of SVM (Called

The Optimization Problem Solution

The Optimization Problem Solution

The Optimization Problem Solution

The Optimization Problem Solution

From KKT condition, we know:

Thus, only support vectors have

The solution has the form:

Soft Margin Classification

Slack variables i can be added to allow misclassification of

What should our quadratic

Hard Margin v.s. Soft Margin

The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Non-linear SVMs: Feature spaces

The Kernel Trick

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

A kernel function is some function that corresponds to an inner product in some

The Kernel Trick

This mapping function, however, hardly needs to be computed because of a tool

What Functions are Kernels?

For some functions K(xi,xj) checking that

Examples of Kernel Functions

Gaussian (Radial-Basis Function (RBF) ) kernel:

Nonlinear SVM: Optimization

Support vector regression

Maximum margin hyperplane only applies to classification

Ordinary Least Squares (OLS)

Support Vector Regression (SVR)

Support Vector Regression (SVR)