Vous êtes sur la page 1sur 19

Articial Intelligence:

Representation and Problem


Solving
15-381
January 16, 2007
Neural Networks
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Topics

decision boundaries

linear discriminants

perceptron

gradient learning

neural networks
2
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The Iris dataset with decision tree boundaries
3
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The optimal decision boundary for C
2
vs C
3
4
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p(petal length |C
2
)
p(petal length |C
3
)

optimal decision boundary is


determined from the statistical
distribution of the classes

optimal only if model is correct

assigns precise degree of uncertainty


to classication
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Optimal decision boundary
5
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
1
Optimal decision boundary
p(petal length |C
2
) p(petal length |C
3
)
p(C
2
|

petal length) p(C
3
|

petal length)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Can we do better?
6
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p(petal length |C
2
)
p(petal length |C
3
)

only way is to use more information

DTs use both petal width and petal


length
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Arbitrary decision boundaries would be more powerful
7
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
Decision boundaries
could be non-linear
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
Dening a decision boundary
8

consider just two classes

want points on one side of line in


class 1, otherwise class 2.

2D linear discriminant function:

This denes a 2D plane which


leads to the decision:
The decision boundary:
y = m
T
x +b = 0
x

class 1 if y 0,
class 2 if y < 0.
m
1
x
1
+m
2
x
2
= b
x
2
=
m
1
x
1
+b
m
2
Or in terms of scalars:
y = m
T
x +b
= m
1
x
1
+m
2
x
2
+b
=

i
m
i
x
i
+b
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Linear separability
9
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)

Two classes are linearly separable if they can be


separated by a linear combination of attributes
- 1D: threshold
- 2D: line
- 3D: plane
- M-D: hyperplane
linearly separable
not linearly separable
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Diagraming the classier as a neural network

The feedforward neural network is


specied by weights w
i
and bias b:

It can written equivalently as

where w
0
= b is the bias and a
dummy input x
0
that is always 1.
10
x
1
x
2
x
M

y
w1 w2 wM
b
x
1
x
2
x
M

y
w1 w2 wM
x
0
=1
w0
y = w
T
x =
M

i=0
w
i
x
i
y = w
T
x +b
=
M

i=1
w
i
x
i
+b
output unit
input units
bias
weights
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Determining, ie learning, the optimal linear discriminant
11

First we must dene an objective function, ie the goal of learning

Simple idea: adjust weights so that output y(x


n
) matches class c
n

Objective: minimize sum-squared error over all patterns x


n
:

Note the notation x


n
denes a pattern vector:

We can dene the desired class as:


E =
1
2
N

n=1
(w
T
x
n
c
n
)
2
x
n
= {x
1
, . . . , x
M
}
n
c
n
=

0 x
n
class 1
1 x
n
class 2
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Weve seen this before: curve tting
12
example from Bishop (2006), Pattern Recognition and Machine Learning
t = sin(2x) + noise
0 1
!1
0
1
t
x
y(x
n
, w)
t
n
x
n
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Neural networks compared to polynomial curve tting
13
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
y(x, w) = w
0
+ w
1
x + w
2
x
2
+ + w
M
x
M
=
M

j=0
w
j
x
j
E(w) =
1
2
N

n=1
[y(x
n
, w) t
n
]
2
example from Bishop (2006), Pattern Recognition and Machine Learning
For the linear
network, M=1 and
there are multiple
input dimensions
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
General form of a linear network

A linear neural network is simply a


linear transformation of the input.

Or, in matrix-vector form:

Multiple outputs corresponds to


multivariate regression
14
y = Wx
y
j
=
M

i=0
w
i,j
x
i
x
1
x
i
x
M

y
i
w

x
0
=1
y
1
y
K


x
y
W
outputs
weights
inputs
bias
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Training the network: Optimization by gradient descent
15

We can adjust the weights incrementally


to minimize the objective function.

This is called gradient descent

Or gradient ascent if were maximizing.

The gradient descent rule for weight w


i
is:

Or in vector form:

For gradient ascent, the sign


of the gradient step changes.
w
1
w
2
w
3
w
4
w
2
w
1
w
t+1
i
= w
t
i

E
w
i
w
t+1
= w
t

E
w
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Computing the gradient

Idea: minimize error by gradient descent

Take the derivative of the objective function wrt the weights:

And in vector form:


16
E =
1
2
N

n=1
(w
T
x
n
c
n
)
2
E
w
i
=
2
2
N

n=1
(w
0
x
0,n
+ +w
i
x
i,n
+ +w
M
x
M,n
c
n
)x
i,n
=
N

n=1
(w
T
x
n
c
n
)x
i,n
E
w
=
N

n=1
(w
T
x
n
c
n
)x
n
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
17
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
18
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
19
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
20
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
21
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Each iteration updates the gradient:

Epsilon is a small value:


" = 0.1/N

Epsilon too large:


- learning diverges

Epsilon too small:


- convergence slow
22
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N

n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary

Learning converges onto the solution


that minimizes the error.

For linear networks, this is


guaranteed to converge to the
minimum

It is also possible to derive a closed-


form solution (covered later)
23
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Learning is slow when epsilon is too small

Here, larger step sizes would


converge more quickly to the
minimum
24
w
Error
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Divergence when epsilon is too large

If the step size is too large, learning


can oscillate between different sides
of the minimum
25
w
Error
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Multi-layer networks
26

Can we extend our network to


multiple layers? We have:

Or in matrix form

Thus a two-layer linear network is


equivalent to a one-layer linear
network with weights U=VW.

It is not more powerful.


x
y
W
V
z
y
j
=

i
w
i,j
x
i
z
j
=

k
v
j,k
y
j
=

k
v
j,k

i
w
i,j
x
i
z = Vy
= VWx
How do we address this?
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Non-linear neural networks

Idea introduce a non-linearity:

Now, multiple layers are not equivalent

Common nonlinearities:
- threshold
- sigmoid
27
y
j
= f(

i
w
i,j
x
i
)
z
j
= f(

k
v
j,k
y
j
)
= f(

k
v
j,k
f(

i
w
i,j
x
i
))
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
sigmoid
y =

0 x < 0
1 x 0
y =
1
1 + exp(x)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Modeling logical operators

A one-layer binary-threshold
network can implement the logical
operators AND and OR, but not
XOR.

Why not?
28
x1
x2
x1
x2
x1
x2
x1 AND x2 x1 OR x2 x1 XOR x2
y
j
= f(

i
w
i,j
x
i
)
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
y =

0 x < 0
1 x 0
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Posterior odds interpretation of a sigmoid
29
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The general classication/regression problem
30
Data
D = {x
1
, . . . , x
T
}
x
i
= {x
1
, . . . , x
N
}
i
desired output
y = {y
1
, . . . , y
K
}
model
= {
1
, . . . ,
M
}
Given data, we want to learn a model that
can correctly classify novel observations or
map the inputs to the outputs
y
i
=

1 if x
i
C
i
class i,
0 otherwise
for classication:
input is a set of T observations, each an
N-dimensional vector (binary, discrete, or
continuous)
model (e.g. a decision tree) is dened by
M parameters, e.g. a multi-layer
neural network.
regression for arbitrary y.
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks

Error function is dened as before,


where we use the target vector tn to
dene the desired output for network
output yn.

The forward pass computes the


outputs at each layer:
A general multi-layer neural network
31
E =
1
2
N

n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M

y
i
w

x
0
=1
y
1
y
K


x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w

y
0
=1
y
l
j
= f(

i
w
l
i,j
y
l1
j
)
l = {1, . . . , L}
x y
0
output = y
L
input
layer 1
layer 2
output
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Deriving the gradient for a sigmoid neural network

Mathematical procedure for train


is gradient descient: same as
before, except the gradients are
more complex to derive.

Convenient fact for the sigmoid


non-linearity:

backward pass computes the


gradients: back-propagation
32
d(x)
dx
=
d
dx
1
1 + exp (x)
= (x)(1 (x))
E =
1
2
N

n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M

y
i
w

x
0
=1
y
1
y
K


x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w

y
0
=1
input
layer 1
layer 2
output
W
t+1
= W
t
+
E
W
New problem: local minima
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Applications: Driving (output is analog: steering direction)
33
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Real image input is augmented to avoid overtting
34
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
Hand-written digits: LeNet
35
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
LeNet
36
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
http://yann.lecun.com/exdb/lenet/
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Object recognition

LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition
with Invariance to Pose and Lighting. Proceedings of CVPR 2004.

http://www.cs.nyu.edu/~yann/research/norb/
37
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Summary

Decision boundaries
- Bayes optimal
- linear discriminant
- linear separability

Classication vs regression

Optimization by gradient descent

Degeneracy of a multi-layer linear network

Non-linearities:: threshold, sigmoid, others?

Issues:
- very general architecture, can solve many problems
- large number of parameters: need to avoid overtting
- usually requires a large amount of data, or special architecture
- local minima, training can be slow, need to set stepsize
38

Vous aimerez peut-être aussi