Académique Documents
Professionnel Documents
Culture Documents
decision boundaries
linear discriminants
perceptron
gradient learning
neural networks
2
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The Iris dataset with decision tree boundaries
3
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l
w
i
d
t
h
(
c
m
)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The optimal decision boundary for C
2
vs C
3
4
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l
w
i
d
t
h
(
c
m
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p(petal length |C
2
)
p(petal length |C
3
)
class 1 if y 0,
class 2 if y < 0.
m
1
x
1
+m
2
x
2
= b
x
2
=
m
1
x
1
+b
m
2
Or in terms of scalars:
y = m
T
x +b
= m
1
x
1
+m
2
x
2
+b
=
i
m
i
x
i
+b
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Linear separability
9
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
petal length (cm)
p
e
t
a
l
w
i
d
t
h
(
c
m
)
where w
0
= b is the bias and a
dummy input x
0
that is always 1.
10
x
1
x
2
x
M
y
w1 w2 wM
b
x
1
x
2
x
M
y
w1 w2 wM
x
0
=1
w0
y = w
T
x =
M
i=0
w
i
x
i
y = w
T
x +b
=
M
i=1
w
i
x
i
+b
output unit
input units
bias
weights
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Determining, ie learning, the optimal linear discriminant
11
n=1
(w
T
x
n
c
n
)
2
x
n
= {x
1
, . . . , x
M
}
n
c
n
=
0 x
n
class 1
1 x
n
class 2
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Weve seen this before: curve tting
12
example from Bishop (2006), Pattern Recognition and Machine Learning
t = sin(2x) + noise
0 1
!1
0
1
t
x
y(x
n
, w)
t
n
x
n
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Neural networks compared to polynomial curve tting
13
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
0 1
!1
0
1
y(x, w) = w
0
+ w
1
x + w
2
x
2
+ + w
M
x
M
=
M
j=0
w
j
x
j
E(w) =
1
2
N
n=1
[y(x
n
, w) t
n
]
2
example from Bishop (2006), Pattern Recognition and Machine Learning
For the linear
network, M=1 and
there are multiple
input dimensions
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
General form of a linear network
i=0
w
i,j
x
i
x
1
x
i
x
M
y
i
w
x
0
=1
y
1
y
K
x
y
W
outputs
weights
inputs
bias
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Training the network: Optimization by gradient descent
15
Or in vector form:
E
w
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Computing the gradient
n=1
(w
T
x
n
c
n
)
2
E
w
i
=
2
2
N
n=1
(w
0
x
0,n
+ +w
i
x
i,n
+ +w
M
x
M,n
c
n
)x
i,n
=
N
n=1
(w
T
x
n
c
n
)x
i,n
E
w
=
N
n=1
(w
T
x
n
c
n
)x
n
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
n=1
(w
T
x
n
c
n
)x
i,n
w
t+1
i
= w
t
i
E
w
i
Learning Curve
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Simulation: learning the decision boundary
Or in matrix form
i
w
i,j
x
i
z
j
=
k
v
j,k
y
j
=
k
v
j,k
i
w
i,j
x
i
z = Vy
= VWx
How do we address this?
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Non-linear neural networks
Common nonlinearities:
- threshold
- sigmoid
27
y
j
= f(
i
w
i,j
x
i
)
z
j
= f(
k
v
j,k
y
j
)
= f(
k
v
j,k
f(
i
w
i,j
x
i
))
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
sigmoid
y =
0 x < 0
1 x 0
y =
1
1 + exp(x)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Modeling logical operators
A one-layer binary-threshold
network can implement the logical
operators AND and OR, but not
XOR.
Why not?
28
x1
x2
x1
x2
x1
x2
x1 AND x2 x1 OR x2 x1 XOR x2
y
j
= f(
i
w
i,j
x
i
)
!5 !4 !3 !2 !1 0 1 2 3 4 5
0
1
x
f
(
x
)
threshold
y =
0 x < 0
1 x 0
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Posterior odds interpretation of a sigmoid
29
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
The general classication/regression problem
30
Data
D = {x
1
, . . . , x
T
}
x
i
= {x
1
, . . . , x
N
}
i
desired output
y = {y
1
, . . . , y
K
}
model
= {
1
, . . . ,
M
}
Given data, we want to learn a model that
can correctly classify novel observations or
map the inputs to the outputs
y
i
=
1 if x
i
C
i
class i,
0 otherwise
for classication:
input is a set of T observations, each an
N-dimensional vector (binary, discrete, or
continuous)
model (e.g. a decision tree) is dened by
M parameters, e.g. a multi-layer
neural network.
regression for arbitrary y.
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M
y
i
w
x
0
=1
y
1
y
K
x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w
y
0
=1
y
l
j
= f(
i
w
l
i,j
y
l1
j
)
l = {1, . . . , L}
x y
0
output = y
L
input
layer 1
layer 2
output
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Deriving the gradient for a sigmoid neural network
n=1
(y
n
(x
n
, W
1:L
) t
n
)
2
x
1
x
i
x
M
y
i
w
x
0
=1
y
1
y
K
x
0
=1
y
i y
1
y
K
y
i y
1
y
K
y
0
=1
w
y
0
=1
input
layer 1
layer 2
output
W
t+1
= W
t
+
E
W
New problem: local minima
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Applications: Driving (output is analog: steering direction)
33
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Real image input is augmented to avoid overtting
34
24
Real Example
network with 1 layer
(4 hidden units)
D. Pomerleau. Neural network
perception for mobile robot
guidance. Kluwer Academic
Publishing, 1993.
Learns to drive on roads
Demonstrated at highway
speeds over 100s of miles
Training data:
Images +
corresponding
steering angle
Important:
Conditioning of
training data to
generate new
examples !avoids
overfitting
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
Hand-written digits: LeNet
35
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
LeNet
36
23
Real Example
Takes as input image of handwritten digit
Each pixel is an input unit
Complex network with many layers
Output is digit class
Tested on large (50,000+) database of handwritten samples
Real-time
Used commercially
Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner. Gradient-based learning
applied to document recognition.
Proceedings of the IEEE, november
1998.
http://yann.lecun.com/exdb/lenet/
Very low error rate (<< 1%
http://yann.lecun.com/exdb/lenet/
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Object recognition
LeCun, Huang, Bottou (2004). Learning Methods for Generic Object Recognition
with Invariance to Pose and Lighting. Proceedings of CVPR 2004.
http://www.cs.nyu.edu/~yann/research/norb/
37
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Neural Networks
Summary
Decision boundaries
- Bayes optimal
- linear discriminant
- linear separability
Classication vs regression
Issues:
- very general architecture, can solve many problems
- large number of parameters: need to avoid overtting
- usually requires a large amount of data, or special architecture
- local minima, training can be slow, need to set stepsize
38