Académique Documents
Professionnel Documents
Culture Documents
Vclav Hlav
Czech Technical University in Prague
Faculty of Electrical Engineering, Department of Cybernetics
Center for Machine Perception
http://cmp.felk.cvut.cz/hlavac, hlavac@fel.cvut.cz
Courtesy: M.I. Schlesinger, V. Franc.
Outline of the talk:
A classier, dichotomy, a multi-class
classier.
A linear discriminant function.
Learning a linear classier.
Perceptron and its learning.
-solution.
Learning for innite training sets.
2/22
A classier
Analyzed object is represented by
X a space of observations, a vector space of dimension n.
Y a set of hidden states.
The aim of the classication is to determine a relation between X and Y , i.e. to
nd a discriminant function f : X Y .
Classier q : X J maps observations X
n
set of class indices J, J =
1, . . . , |Y |.
Mutual exclusion of classes is required
X = X
1
X
2
. . . X
|Y |
,
X
i
X
j
= , i = j, i, j = 1 . . . |Y |.
3/22
Classier, an illustration
Discriminant functions f
i
(x) should have the property ideally:
f
i
(x) > f
j
(x) for x class i , i = j.
f (x)
1
f (x)
2
f (x)
| | Y
max
x
y
Strategy: j = argmax
j
f
j
(x)
f
j
(x) = w
j
, x +b
j
, where denotes a scalar product.
A strategy j = argmax
j
f
j
(x) divides X into |Y | convex regions.
y=1
y=2
y=3
y=4
y=5
y=6
6/22
Dichotomy, two classes only
|Y | = 2, i.e. two hidden states (typically also classes)
q(x) =
_
_
_
y = 1 , if w, x +b 0 ,
y = 2 , if w, x +b < 0 .
x
1
x
2
S
w
1
x
1
b
w
2
x
2
w
3
x
3
w
n
x
n
y
1
activation
function
threshold
weights
Perceptron by F. Rosenblatt 1957
7/22
Learning linear classiers
The aim of learning is to estimate classier parameters w
i
, b
i
for i.
The learning algorithms dier by
i=1
W(q(x
i
, ), y
i
),
where W is a penalty.
= argmin
R
emp
(q(x, )).
, x
j
0 by embedding it into n + 1
dimensional space, where w
= [w b],
x
=
_
[x 1] for y = 1 ,
[x 1] for y = 2 .
We drop the primes and go back to w, x notation.
X
n+1
w, x = 0
10/22
Perceptron learning: the algorithm 1957
Input: T = {x
1
, x
2
, . . . , x
L
}.
Output: a weight vector w.
The Perceptron algorithm
(F. Rosenblatt):
1. w
1
= 0.
2. A wrongly classied observation x
j
is sought, i.e., w
t
, x
j
< 0,
j {1, . . . , L}.
3. If there is no misclassied
observation then the algorithm
terminates otherwise
w
t+1
= w
t
+x
j
.
4. Goto 2.
w
t
w
t+1
x
t
0
Perceptron update rule
w ,x = 0
t
11/22
Noviko theorem, 1962
Let D = max
i
|x
i
|, m = min
xX
|x
i
| > 0.
Noviko theorem:
If the data are linearly separable then there exists a number
t
D
2
m
2
, such that the vector w
t
satises
w
t
, x
j
> 0, j {1, . . . , L} .
. .
0
+|x
t
|
2
|w
t
|
2
+|x
t
|
2
|w
t
|
2
+D
2
.
|w
0
|
2
= 0, |w
1
|
2
D
2
, |w
2
|
2
2D
2
, . . .
. . . , |w
t+1
|
2
t D
2
, . . .
Lower bound: is given analogically
|w
t+1
|
2
> t
2
m
2
.
Solution: t
2
m
2
t D
2
t
D
2
m
2
.
|w |
t
2
t
13/22
An alternative training algorithm
Kozinec (1973)
Input: T = {x
1
, x
2
, . . . x
L
}.
Output: a weight vector w
.
1. w
1
= x
j
, i.e., any observation.
2. A wrongly classied observation x
t
is
sought, i.e., w
t
, x
j
< b, j J.
3. If there is no wrongly classied
observation then the algorithm nishes
otherwise
w
t+1
= (1 k) w
t
+x
t
k, k R,
where
k = argmin
k
|(1 k) w
t
+x
t
k|.
4. Goto 2.
w
t
w
t+1
b
x
t
0
w ,x = 0
t
Kozinec
14/22
Perceptron learning
as an optimization problem (1)
Perceptron algorithm, batch version, handling non-separability, another
perspective:
Input: T = {x
1
, x
2
, . . . , x
L
}.
xX: w
t
,x<0
1 .
What would the most common optimization method, the gradient descent,
perform?
w
t
= w J(w) .
The gradient of J(w) is either 0 or undened. The gradient minimization cannot
proceed.
15/22
Perceptron learning
as an Optimization problem (2)
Let us redene the cost function:
J
p
(w) =
xX: w,x<0
w, x .
J
p
(w) =
J
w
=
xX: w,x<0
x .
.
1. w
1
= 0, E = |T| = L, w
= 0 .
2. Find all misclassied observations X
= {x X: w
t
, x < 0}.
3. if |X
| < E then E = |X
|; w
= w
t
, t
lu
= t.
4. if tc(w
, t, t
lu
) then terminate else w
t+1
= w
t
+
t
xX
x.
5. Goto 2.
= argmax
w
min
j
_
w
|w|
, x
j
_
(1)
can be converted to a seek for the closest point to a convex hull (denoted by the
overline)
x
= argmin
xX
|x| .
It holds that x
t
|w |
t
21/22
Learning task formulation
for innite training sets
The generalization of the Andersons task by M.I. Schlesinger (1972) solves a
quadratic optimization task.
It solves the learning problem for a linear classier and two hidden states
only.