Neural Networks: Artificial Intelligence: Representation and Problem Solving

Articial Intelligence:
Representation and Problem

Solving
15-381
January 16, 2007
Neural Networks
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Topics
decision boundaries
linear discriminants
perceptron
gradient learning
neural networks
2
The Iris dataset with decision tree boundaries
3
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
The optimal decision boundary for C
2
vs C
3
4
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p(petal length |C
2
)
p(petal length |C
3
)
optimal decision boundary is

determined from the statistical
distribution of the classes
optimal only if model is correct
assigns precise degree of uncertainty

to classication
Optimal decision boundary
5
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
1
Optimal decision boundary
p(petal length |C
2
) p(petal length |C
3
)
p(C
2
|

petal length) p(C
3
|

petal length)
Can we do better?
6
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p(petal length |C
2
)
p(petal length |C
3
)
only way is to use more information
DTs use both petal width and petal

length
Arbitrary decision boundaries would be more powerful
7
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
Decision boundaries
could be non-linear
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
Dening a decision boundary
8
consider just two classes
want points on one side of line in

class 1, otherwise class 2.
2D linear discriminant function:
This denes a 2D plane which

leads to the decision:
The decision boundary:
y = m
T
x +b = 0
x
class 1 if y 0,
class 2 if y < 0.
m
1
x
1
+m
2
x
2
= b
x
2
=
m
1
x
1
+b
m
2
Or in terms of scalars:
y = m
T
x +b
= m
1
x
1
+m
2
x
2
+b
=
i
m
i
x
i
+b
Linear separability
9
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l

w
i
d
t
h

(
c
m
)
Two classes are linearly separable if they can be

separated by a linear combination of attributes
- 1D: threshold
- 2D: line
- 3D: plane
- M-D: hyperplane
linearly separable
not linearly separable
Diagraming the classier as a neural network
The feedforward neural network is

specied by weights w
i
and bias b:
It can written equivalently as
where w
0
= b is the bias and a
dummy input x
0
that is always 1.
10
x
1
x
2
x
M

y
w1 w2 wM
b
x
1
x
2
x
M

y
w1 w2 wM
x
0
=1
w0
y = w
T
x =
M
i=0
w
i
x
i
y = w
T
x +b
=
M
i=1
w
i
x
i
+b
output unit
input units
bias
weights
Determining, ie learning, the optimal linear discriminant
11
First we must dene an objective function, ie the goal of learning
Simple idea: adjust weights so that output y(x

n
) matches class c
n
Objective: minimize sum-squared error over all patterns x

n
:
Note the notation x

n
denes a pattern vector:
We can dene the desired class as:

E =
1
2
N
n=1
(w
T
x
n
c
n
)
2
x
n
= {x
1
, . . . , x
M
}
n
c
n
=
0 x
n
class 1
1 x
n
class 2
Weve seen this before: curve tting
12
example from Bishop (2006), Pattern Recognition and Machine Learning
t = sin(2x) + noise
0 1
!1
0
1
t
x
y(x
n
, w)
t
n
x
n
Neural networks compared to polynomial curve tting
13
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
y(x, w) = w
0
+ w
1
x + w
2
x
2
+ + w
M
x
M
=
M
j=0
w
j
x
j
E(w) =
1
2
N
n=1
[y(x
n
, w) t
n
]
2
example from Bishop (2006), Pattern Recognition and Machine Learning
For the linear
network, M=1 and
there are multiple
input dimensions
General form of a linear network
A linear neural network is simply a

linear transformation of the input.
Or, in matrix-vector form:
Multiple outputs corresponds to

multivariate regression
14
y = Wx
y
j
=
M
i=0
w
i,j
x
i
x
1
x
i
x
M

y
i
w
x
0
=1
y
1
y
K

x
y
W
outputs
weights
inputs
bias
Training the network: Optimization by gradient descent
15
We can adjust the weights incrementally

to minimize the objective function.
This is called gradient descent
Or gradient ascent if were maximizing.
The gradient descent rule for weight w

i
is:
Or in vector form:
For gradient ascent, the sign

of the gradient step changes.
w
1
w
2
w
3
w
4
w
2
w
1
w
t+1
i
= w
t
i

E
w
i
w
t+1
= w
t
E
w
Computing the gradient
Idea: minimize error by gradient descent
Take the derivative of the objective function wrt the weights:
And in vector form:

16
E =
1
2
N
n=1
(w
T
x
n
c
n
)
2
E
w
i
=
2
2
N
n=1
(w
0
x
0,n
+ +w
i
x
i,n
+ +w
M
x
M,n
c
n
)x
i,n
=
N
n=1
(w
T
x
n
c
n
)x
i,n
E
w
=
N
n=1
(w
T
x
n
c
n
)x
n
Simulation: learning the decision boundary
Each iteration updates the gradient:
Epsilon is a small value:

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
17
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
18
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
19
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
20
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
21
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve

" = 0.1/N
Epsilon too large:

- learning diverges
Epsilon too small:

- convergence slow
22
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
E
w
i
=
N
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i

E
w
i
Learning Curve
Learning converges onto the solution

that minimizes the error.
For linear networks, this is

guaranteed to converge to the
minimum
It is also possible to derive a closed-

form solution (covered later)
23
3 4 5 6 7
1
1.5
2
2.5
x
1
x
2
0 5 10 15
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
iteration
E
r
r
o
r
Learning Curve
Learning is slow when epsilon is too small
Here, larger step sizes would

converge more quickly to the
minimum
24
w
Error
Divergence when epsilon is too large
If the step size is too large, learning

can oscillate between different sides
of the minimum
25
w
Error
Multi-layer networks
26
Can we extend our network to

multiple layers? We have:
Or in matrix form
Thus a two-layer linear network is

equivalent to a one-layer linear
network with weights U=VW.
It is not more powerful.

x
y
W
V
z
y
j
=
i
w
i,j
x
i
z
j
=
k
v
j,k
y
j
=
k
v
j,k
i
w
i,j
x
i
z = Vy
= VWx
How do we address this?
Non-linear neural networks
Idea introduce a non-linearity:
Now, multiple layers are not equivalent
Common nonlinearities:
- threshold
- sigmoid
27
y
j
= f(
i
w
i,j
x
i
)
z
j
= f(
k
v
j,k
y
j
)
= f(
k
v
j,k
f(
i
w
i,j
x
i
))
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
sigmoid
y =
0 x < 0
1 x 0
y =
1
1 + exp(x)
Modeling logical operators
A one-layer binary-threshold
network can implement the logical
operators AND and OR, but not
XOR.
Why not?
28
x1
x2
x1
x2
x1
x2
x1 AND x2 x1 OR x2 x1 XOR x2
y
j
= f(
i
w
i,j
x
i
)
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
y =
0 x < 0
1 x 0
Posterior odds interpretation of a sigmoid
29
The general classication/regression problem
30
Data
D = {x
1
, . . . , x
T
}
x
i
= {x
1
, . . . , x
N
}
i
desired output
y = {y
1
, . . . , y
K
}
model
= {
1
, . . . ,
M
}
Given data, we want to learn a model that
can correctly classify novel observations or
map the inputs to the outputs
y
i
=
1 if x
i
C
i
class i,
0 otherwise
for classication:
input is a set of T observations, each an
N-dimensional vector (binary, discrete, or
continuous)
model (e.g. a decision tree) is dened by
M parameters, e.g. a multi-layer
neural network.
regression for arbitrary y.
Error function is dened as before,

where we use the target vector tn to
dene the desired output for network
output yn.
The forward pass computes the

outputs at each layer:
A general multi-layer neural network
31
E =
1
2
N
n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M

y
i
w
x
0
=1
y
1
y
K

x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w
y
0
=1
y
l
j
= f(
i
w
l
i,j
y
l1
j
)
l = {1, . . . , L}
x y
0
output = y
L
input
layer 1
layer 2
output
Deriving the gradient for a sigmoid neural network
Mathematical procedure for train

is gradient descient: same as
before, except the gradients are
more complex to derive.
Convenient fact for the sigmoid

non-linearity:
backward pass computes the

gradients: back-propagation
32
d(x)
dx
=
d
dx
1
1 + exp (x)
= (x)(1 (x))
E =
1
2
N
n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M

y
i
w
x
0
=1
y
1
y
K

x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w
y
0
=1
input
layer 1
layer 2
output
W
t+1
= W
t
+
E
W
New problem: local minima
Applications: Driving (output is analog: steering direction)
33
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
Real image input is augmented to avoid overtting
34
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
Hand-written digits: LeNet
35
LeNet
36
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
Very low error rate (<< 1%
Object recognition
LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition
with Invariance to Pose and Lighting. Proceedings of CVPR 2004.
http://www.cs.nyu.edu/~yann/research/norb/
37
Summary
Decision boundaries
- Bayes optimal
- linear discriminant
- linear separability
Classication vs regression
Optimization by gradient descent
Degeneracy of a multi-layer linear network
Non-linearities:: threshold, sigmoid, others?
Issues:
- very general architecture, can solve many problems
- large number of parameters: need to avoid overtting
- usually requires a large amount of data, or special architecture
- local minima, training can be slow, need to set stepsize
38

Neural Networks: Artificial Intelligence: Representation and Problem Solving

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Neural Networks: Artificial Intelligence: Representation and Problem Solving

Transféré par

Droits d'auteur :

Formats disponibles

Articial Intelligence:

Representation and Problem

optimal decision boundary is

optimal only if model is correct

assigns precise degree of uncertainty

only way is to use more information

DTs use both petal width and petal

consider just two classes

want points on one side of line in

2D linear discriminant function:

This denes a 2D plane which

Two classes are linearly separable if they can be

The feedforward neural network is

It can written equivalently as

First we must dene an objective function, ie the goal of learning

Simple idea: adjust weights so that output y(x

Objective: minimize sum-squared error over all patterns x

Note the notation x

We can dene the desired class as:

A linear neural network is simply a

Or, in matrix-vector form:

Multiple outputs corresponds to

We can adjust the weights incrementally

This is called gradient descent

Or gradient ascent if were maximizing.

The gradient descent rule for weight w

For gradient ascent, the sign

Idea: minimize error by gradient descent

Take the derivative of the objective function wrt the weights:

And in vector form:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Each iteration updates the gradient:

Epsilon is a small value:

Epsilon too large:

Epsilon too small:

Learning converges onto the solution

For linear networks, this is

It is also possible to derive a closed-

Here, larger step sizes would

If the step size is too large, learning

Can we extend our network to

Thus a two-layer linear network is

It is not more powerful.

Idea introduce a non-linearity: