Vous êtes sur la page 1sur 22

Linear classiers, a perceptron family

Vclav Hlav
Czech Technical University in Prague
Faculty of Electrical Engineering, Department of Cybernetics
Center for Machine Perception
http://cmp.felk.cvut.cz/hlavac, hlavac@fel.cvut.cz
Courtesy: M.I. Schlesinger, V. Franc.
Outline of the talk:
A classier, dichotomy, a multi-class
classier.
A linear discriminant function.
Learning a linear classier.
Perceptron and its learning.
-solution.
Learning for innite training sets.
2/22
A classier
Analyzed object is represented by
X a space of observations, a vector space of dimension n.
Y a set of hidden states.
The aim of the classication is to determine a relation between X and Y , i.e. to
nd a discriminant function f : X Y .
Classier q : X J maps observations X
n
set of class indices J, J =
1, . . . , |Y |.
Mutual exclusion of classes is required
X = X
1
X
2
. . . X
|Y |
,
X
i
X
j
= , i = j, i, j = 1 . . . |Y |.
3/22
Classier, an illustration

A classier partitions the observation space X into class-labelled regions X


i
,
i = 1, . . . , |Y |.

Classication determines to which region X


i
an observed feature vector x
belongs.

Borders between regions are called decision boundaries.


X
1
X
2
X
3
X
4
X
1
X
2
X
3
X - background
4
X
1
X
2
X
3
X
4
Several possible arrangements of classes.
4/22
A multi-class decision strategy

Discriminant functions f
i
(x) should have the property ideally:
f
i
(x) > f
j
(x) for x class i , i = j.
f (x)
1
f (x)
2
f (x)
| | Y
max
x
y
Strategy: j = argmax
j
f
j
(x)

However, it is uneasy to nd such a discriminant function. Most good


classiers are dichotomic (as perceptron, SVM).

The usual solution: One-against-All classier, One-against-One classier.


5/22
Linear discriminant function q(x)

f
j
(x) = w
j
, x +b
j
, where denotes a scalar product.

A strategy j = argmax
j
f
j
(x) divides X into |Y | convex regions.
y=1
y=2
y=3
y=4
y=5
y=6
6/22
Dichotomy, two classes only
|Y | = 2, i.e. two hidden states (typically also classes)
q(x) =
_
_
_
y = 1 , if w, x +b 0 ,
y = 2 , if w, x +b < 0 .
x
1
x
2
S
w
1
x
1
b
w
2
x
2
w
3
x
3
w
n
x
n
y
1
activation
function
threshold
weights
Perceptron by F. Rosenblatt 1957
7/22
Learning linear classiers
The aim of learning is to estimate classier parameters w
i
, b
i
for i.
The learning algorithms dier by

The character of training set


1. Finite set consisting of individual observations and hidden states, i.e.,
{(x
1
, y
1
) . . . (x
L
, y
L
)}.
2. Innite sets described by Gaussian distributions.

Learning task formulations.


8/22
Learning tasks formulations
For nite training sets
Empirical risk minimization: dummy dummy dummy dummy dummy

True risk is approximated by R


emp
(q(x, )) =
1
L
L

i=1
W(q(x
i
, ), y
i
),
where W is a penalty.

Learning is based on the empirical minimization principle

= argmin

R
emp
(q(x, )).

Examples of learning algorithms: Perceptron, Back-propagation.


Structural risk minimization: dummy dummy dummy dummy dummy

True risk is approximated by a guaranteed risk (a regularizer securing


upper bound of the risk is added to the empirical risk,
Vapnik-Chervonenkis theory of learning).

Example: Support Vector Machine (SVM).


9/22
Perceptron learning: Task formulation
Input: T = {(x
1
, y
1
) . . . (x
L
, y
L
)}, k
i
{1, 2},
i = 1, . . . , L, dim(x
i
) = n.
Output: a weight vector w, oset b
for j {1, . . . , L} satisfying:
w, x
j
+b 0 for y = 1,
w, x
j
+b < 0 for y = 2.
X
n
w, x + b = 0
The task can be formally transcribed to a single
inequality w

, x

j
0 by embedding it into n + 1
dimensional space, where w

= [w b],
x

=
_
[x 1] for y = 1 ,
[x 1] for y = 2 .
We drop the primes and go back to w, x notation.
X
n+1
w, x = 0
10/22
Perceptron learning: the algorithm 1957
Input: T = {x
1
, x
2
, . . . , x
L
}.
Output: a weight vector w.
The Perceptron algorithm
(F. Rosenblatt):
1. w
1
= 0.
2. A wrongly classied observation x
j
is sought, i.e., w
t
, x
j
< 0,
j {1, . . . , L}.
3. If there is no misclassied
observation then the algorithm
terminates otherwise
w
t+1
= w
t
+x
j
.
4. Goto 2.
w
t
w
t+1
x
t
0
Perceptron update rule
w ,x = 0
t
11/22
Noviko theorem, 1962

Proves that the Perceptron algorithm converges in a


nite number steps if the solution exists.

Let X denotes a convex hull of points (set of ob-


servations) X.

Let D = max
i
|x
i
|, m = min
xX
|x
i
| > 0.
Noviko theorem:
If the data are linearly separable then there exists a number
t


D
2
m
2
, such that the vector w
t
satises
w
t
, x
j
> 0, j {1, . . . , L} .

What if the data is not separable?

How to terminate the perceptron learning?


D
m
origin
convex hull
12/22
Idea of the Noviko theorem proof
Let express bounds for |w
t
|
2
:
Upper bound:
|w
t+1
|
2
= |w
t
+x
t
|
2
= |w
t
|
2
+ 2 x
t
, w
t

. .
0
+|x
t
|
2
|w
t
|
2
+|x
t
|
2
|w
t
|
2
+D
2
.
|w
0
|
2
= 0, |w
1
|
2
D
2
, |w
2
|
2
2D
2
, . . .
. . . , |w
t+1
|
2
t D
2
, . . .
Lower bound: is given analogically
|w
t+1
|
2
> t
2
m
2
.
Solution: t
2
m
2
t D
2
t
D
2
m
2
.
|w |
t
2
t
13/22
An alternative training algorithm
Kozinec (1973)
Input: T = {x
1
, x
2
, . . . x
L
}.
Output: a weight vector w

.
1. w
1
= x
j
, i.e., any observation.
2. A wrongly classied observation x
t
is
sought, i.e., w
t
, x
j
< b, j J.
3. If there is no wrongly classied
observation then the algorithm nishes
otherwise
w
t+1
= (1 k) w
t
+x
t
k, k R,
where
k = argmin
k
|(1 k) w
t
+x
t
k|.
4. Goto 2.
w
t
w
t+1
b
x
t
0
w ,x = 0
t
Kozinec
14/22
Perceptron learning
as an optimization problem (1)
Perceptron algorithm, batch version, handling non-separability, another
perspective:

Input: T = {x
1
, x
2
, . . . , x
L
}.

Output: a weight vector w minimsing


J(w) = |{x X: w
t
, x < 0}|
or, equivalently
J(w) =

xX: w
t
,x<0
1 .
What would the most common optimization method, the gradient descent,
perform?
w
t
= w J(w) .
The gradient of J(w) is either 0 or undened. The gradient minimization cannot
proceed.
15/22
Perceptron learning
as an Optimization problem (2)
Let us redene the cost function:
J
p
(w) =

xX: w,x<0
w, x .
J
p
(w) =
J
w
=

xX: w,x<0
x .

The Perceptron algorithm is a gradient descent method for J


p
(w).

Learning by the empirical risk minimization is just an instance of an


optimization problem.

Either gradient minimization (backpropagation in neural networks) or convex


(quadratic) minimization (called convex programming in mathematical
literature) is used.
16/22
Perceptron algorithm
Classier learning, non-separable case, batch version
Input: T = {x
1
, x
2
, . . . x
L
}.
Output: a weight vector w

.
1. w
1
= 0, E = |T| = L, w

= 0 .
2. Find all misclassied observations X

= {x X: w
t
, x < 0}.
3. if |X

| < E then E = |X

|; w

= w
t
, t
lu
= t.
4. if tc(w

, t, t
lu
) then terminate else w
t+1
= w
t
+
t

xX

x.
5. Goto 2.

The algorithm converges with probability 1 to the optimal solution.

The convergence rate is not known.

The termination condition tc(.) is a complex function of the quality of the


best solution, time since the last update t t
lu
and requirements on the
solution.
17/22
The optimal separating plane
and the closest point to the convex hull
The problem of the optimal separation by a hyperplane
w

= argmax
w
min
j
_
w
|w|
, x
j
_
(1)
can be converted to a seek for the closest point to a convex hull (denoted by the
overline)
x

= argmin
xX
|x| .
It holds that x

solves also the problem (1).


Recall that the classier that maximizes the separation minimizes the structural
risk R
str
.
18/22
The convex hull, an illustration
w* = m
X
min
j
_
w
|w|
, x
j
_
m |w| , w X .
lower bound upper bound
19/22
-solution

The aim is to speed up the algorithm.

The allowed uncertainty is introduced.


|w
t
| min
j
_
w
t
|w
t
|
, x
j
_

origin
y
w
t
w
t+1
w =(1-y)w+y x
t+1 t t
20/22
Kozinec and the -solution
The second step of Kozinec algorithm is modied to:
A wrongly classied observation x
t
is sought, i.e.,
|w
t
| min
j
_
w
t
|w
t
|
, x
t
_

m
0

t
|w |
t
21/22
Learning task formulation
for innite training sets
The generalization of the Andersons task by M.I. Schlesinger (1972) solves a
quadratic optimization task.

It solves the learning problem for a linear classier and two hidden states
only.

It is assumed that a class-conditional distribution p


X|Y
(x| y) corresponding
to both hidden states are multi-dimensional Gaussian distributions.

The mathematical expectation


y
and the covariance matrix
y
, y = 1, 2,
of these probability distributions are not known.

The Generalized Anderson task (abbreviated GAndersonT) is an extension of


Anderson-Bahadur task (1962) which solved the problem when each of two
classes is modelled by a single Gaussian.
22/22
GAndersonT illustrated in the 2D space
Illustration of the statistical model, i.e., a mixture of Gaussians.
y=1
y=2
m
2
m
1
m
3
m
4
m
5
q

The parameters of individual Gaussians


i
,
i
, i = 1, 2, . . . are known.

Weights of the Gaussian components are unknown.

Vous aimerez peut-être aussi