Académique Documents
Professionnel Documents
Culture Documents
NTU Logo.png
1/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Neural Networks
The term neural network has evolved to encompass a large class of models and
learning methods that were originally inspired by thoughts on how the brain might
work. The goal is to mimic the functions and mechanism of our brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: state-of-the-art technique for many applications.
Before we go into neural networks, it may be better to learn a bit about the structure
of our human brains:
• Our brain consists of approximately 100 billion of some specific type of cell known
as neuron. Each of these neurons is typically connect with 10,000 other neurons.
These neurons will not regenerate.
• It is widely accepted that these neurons are responsible for our ability for
memorizing, learning, generalizing and thinking.
The exact function of these neurons is still a mystery but a very simple mathematical
model that mimics these neurons provide a surprising good performance in pattern
recognition, classification, and prediction. NTU Logo.png
2/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
3/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Artificial Neuron
To mimic this neuron, we have the following aritifcial neuron:
4/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Activation Functions
Except for linear function, there are some commonly used activation functions:
Notice that tanh(x) = 2 × logistic(x) − 1, so the first two are equivalent except at the
output units. Sometimes Gaussian radial basis functions (exp(−x 2 /σ)) are used,
producing what is known as a radial basis function network. NTU Logo.png
5/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Neural Networks
NTU Logo.png
6/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
2 The number of neurons in the input and output layer are determined by the nature
of the problem, but the number of neurons in the hidden layer is user-defined.
3 Within each layer, neurons are not connected to each other. Neurons in one layer
are connected only to neurons in the next layer (feed-forward).
4 Each line joining the neuron is associated by a weight wij . These weights are
unknown parameters need to be estimated from the training dataset.
NTU Logo.png
7/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
NTU Logo.png
8/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Training set: (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (N) , y (N) ), where the response y (i) is one of
1 0 0 0
0 1 0 0
, , , .
0 0 1 0
0 0 0 1
| {z } | {z } | {z } | {z }
Pedestrian Car Motorcycle Truck
NTU Logo.png
We want our output hΘ (x) is close to either one of them above.
9/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
10/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
AND/OR Function
In order to build up a network that fits the XNOR example, we start with a slightly
simpler one and show a network that fits the AND/OR function. We can use only
simple logistic regression to compute the logical AND/OR function.
yAND ≈ hΘ (x) = σ(−30 + 20x1 + 20x2 ) yOR ≈ hΘ (x) = σ(−10 + 20x1 + 20x2 )
x1 x2 hΘ (x) x1 x2 hΘ (x)
0 0 σ(−30) ≈ 0 0 0 σ(−10) ≈ 0
0 1 σ(−10) ≈ 0 0 1 σ(10) ≈ 1
1 0 σ(−10) ≈ 0 1 0 σ(10) ≈ 1
1 1 σ(10) ≈ 1 1 1 σ(10) ≈ 1
NTU Logo.png
11/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
NOT Function
x1 hΘ (x)
0 σ(10) ≈ 1
1 σ(−10) ≈ 0
x1 x2 hΘ (x)
0 0 σ(10) ≈ 1
0 1 σ(−10) ≈ 0
1 0 σ(−10) ≈ 0
1 1 σ(−30) ≈ 0
NTU Logo.png
12/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
XNOR Function
(2)
a1 = yAND ≈ σ(−30 + 20x1 + 20x2 );
(2)
a2 = y(NOT x1 ) AND (NOT x2 ) ≈ σ(10 − 20x1 − 20x2 );
(2) (2)
yXNOR ≈ yOR (a) ≈ σ(−10 + 20a1 + 20a2 ).
(2) (2)
x1 x2 a1 a2 hΘ (x)
0 0 0 1 1
0 1 0 0 0
1 0 0 0 0
1 1 1 0 1
NTU Logo.png
13/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
For illustration, we consider binary response Y = 0 or 1, and both fh (x) and fo (x) are
sigmoid functions σ(x) = 1/(1 + e −x ). With the training data {(x (i) , y (i) )}N
i=1 , we use
the following cost function:
N h
X i N
X
J(Θ) = −y (i) log(hΘ (x (i) )) − (1 − y (i) ) log(1 − hΘ (x (i) )) =: J (i) (Θ).
i=1 i=1
NTU Logo.png
14/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
We adopt the gradient descent approach to minimize J(Θ). For w being either one of
(1) (1) (2)
bh , wh,j , b (2) , wh , h = 1, . . . , H; j = 1, . . . , p, we update w as follows.
∂J(Θ)
w ←w −α .
∂w
15/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Gradient Computation
Denote by
p
(i,2) (1)
X (1) (i) (i) (i,2)
zh = bh + wh,j xj , ah = σ(zh ),
j=1
H
X (2) (i)
z (i,3) = b (2) + wh ah , hΘ (x (i) ) = σ(z (i,3) ).
h=1
(2) (2)
We first work on w = b , wh (output layer):
∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) (i)
(2)
= =: δ (i,3 ) ah ,
∂wh ∂z (i,3) ∂w (2)
h
(i)
∂J (Θ) ∂J (i) (Θ) ∂z (i,3)
= = δ (i,3 ) ,
∂b (2) ∂z (i,3) ∂b (2)
where
∂J (i) (Θ) σ 0 (z (i,3) ) (−σ 0 (z (i,3) ))
δ (i,3 ) = = −y (i) − (1 − y (i) )
∂z (i,3) σ(z (i,3) ) 1 − σ(z (i,3) )
= −y (i) (1 − σ(z (i,3) )) + (1 − y (i) )σ(z (i,3) ) = σ(z (i,3 )
) − y (i) .
NTU Logo.png
16/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
17/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Backpropagation Algorithm
NTU Logo.png
18/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
• Online learning can also be carried out – processing each observation one at a
time, updating the gradient after each training case, and cycling through the
training cases many times.
• Online learning allows the network to handle very large training sets, and also to
update the weights as new observations come in. It is the situation for perceptual
learning in humans. This may work better if many training cases are redundant.
• The learning rate α for batch learning is usually taken to be a constant. However,
for online learning to converge a (local) maximum, we will have to decrease α
with time: α = αi . This learning is a form
P of stochastic approximation; results in
this field ensure convergence if αi → 0, i αi = ∞, and i αi2 < ∞ (satisfied,
P
for example, by αi = 1/i β , 0.5 < β ≤ 1). NTU Logo.png
19/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Practical Considerations
Starting values Note that the use of exact zero weights leads to zero derivatives and
perfect symmetry, and the algorithm never moves. Starting values for
weights are usually chosen to be random values near zero.
Overfitting Often neural networks have too many weights and will overfit the data.
An early stopping rule can be used to avoid it, since the weights start
at a highly regularized (linear) solution. Cross-validation is useful for
this job. Another explicit method is shrinking the weights, for example
by introducing a ridge penalty.
Scaling of the inputs It is best to standardize all inputs to have mean zero and
standard deviation one. With standardized inputs, it is typical to take
the starting weights as random uniform variables over the range
[−0.7, 0.7].
Number of hidden units and layers It is better to have too many hidden units than too
few. It is most common to put down a reasonably large number
(5-100) of units and train them with regularization. Choice of the
number of hidden layers is guided by background knowledge and
experimentation.
Multiple Minima One must at least try a number of random starting configurations,
and choose the solution giving lowest (penalized) error.NTU Logo.png
20/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
NTU Logo.png
21/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
NTU Logo.png
22/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
In R, we can compute the fitted values (for the whole training data) with the codes:
w1=matrix(iris.nn$wts[1:10],nr=2,byrow=T)
w2=matrix(iris.nn$wts[11:13],nr=1,byrow=T)
sigmoid=function(x){1/(1+exp(-x))}
x=cbind(1,d[,1:4])
h=sigmoid(w1%*%t(x))
h=rbind(1,h)
out=t(w2%*%h)
Alternatively,
iris.nn$fitted.values or predict(iris.nn,d) NTU Logo.png
23/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
A major problem of ANN is that the solution depends on the starting values. When we
run nnet() each time, we may obtain different results. Therefore, we have to run
nnet() several times and save the best result. Here is an improved version of the
nnet() function:
NTU Logo.png
24/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Two-Class Classification
Let’s relabel Species to 0 for classes 2 and 3 (Hence Species becomes binary
response). Then we perform two-class classification (in fact, class 1 versus [classes 2
and 3]). When the response is binary, the nnet uses logistic output function as default.
NTU Logo.png
25/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
(`)
i=1 an `
(`) (`)
where the bias b (`) = (b1 , . . . , bn` )T and we assumed all units in all hidden layers
have the same activation function fh . In addition, we assume all units in the output
layer have the same activation function fo . Then, we have for k = 1, . . . , K ,
nL−1 !
(L) (L−1)
X (L−1) (L−1)
(hΘ (X ))k = ak = fo bk + wk,i ai , Θ = {b (`) , W (`) }L−1
`=1 ,
i=1
i.e. hΘ (X ) = fo b (L−1 ) + W (L−1 ) a(L−1 ) .
NTU Logo.png
26/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
27/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Cost Function
Suppose that we have N measurements (x (1) , y (1) ), . . . , (x (N) , y (N) ) and that the
output of the network is hΘ (x). Then the parameter set Θ is chosen to minimize the
error function.
For regression, we use sum-of-squared errors as our measure of fit (error function):
N X
X K N
X
(i)
J(Θ) = (yk − (hΘ (x (i) ))k )2 = ky (i) − hΘ (x (i) )k2 .
i=1 k=1 i=1
For classification, we typically use the discrepancy function for multiple logistic
regressions:
N X
X K
(i) (i)
J(Θ) = − [yk log(hΘ (x (i) ))k + (1 − yk ) log(1 − (hΘ (x (i) ))k )]
i=1 k=1
or cross-entropy (deviance):
N X
X K
(i)
J(Θ) = − yk log(hΘ (x (i) ))k .
i=1 k=1
NTU Logo.png
28/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
where the second term is a penalty introduced to address the potential overfitting
(`−1) (`−1)
problem and the tuning parameter λ is known as weight decay. Let wj,0 := bj .
The gradient can be easily derived using the chain rule for differentiation. This can be
computed by a forward and backward sweep over the network, keeping track only of
quantities local to each unit (units to which it is connected). NTU Logo.png
29/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Gradient Computation
Suppose we have data {(x (i) , y (i) )}N
i=1 , x
(i)
∈ Rp , y (i) ∈ RK . Denote
n`−1
(i,1) (i) (i,`−1) (i,`)
X (`−1) (i,`−1)
aj = xj , a0 = 1, zj := wj,j 0 aj 0 , ` = 2, . . . , L
j 0 =0
(i,`) (i,`) (i,L)
aj = fh (zj ), ` = 2, . . . , L − 1, (hΘ (x (i) ))k = fo (zk ),
We can apply for the chain rule for partial derivative to give
(i,`)
∂J (i) (Θ) ∂J (i) (Θ) ∂zj ∂J (i) (Θ) (i,`−1) (i,`) (i,`−1)
(`−1)
= (i,`) (`−1)
= (i,`)
aj 0 =: δj aj 0 .
∂wj,j 0 ∂zj ∂wj,j 0 ∂zj
(i,`)
Here, δj represents the “error” of unit j in layer ` for sample (x (i) , y (i) ).
For the output layer, we have
(i,L) (i)
∂J (i) (Θ) fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = (i,L)
= (i,L) (i,L)
).
∂zk fo (zk )(1 − fo (zk ))
For the hidden layers, we have, for ` = 2, . . . , L − 1,
n`+1 n`+1
∂J (i) (Θ) X ∂J (i) (Θ) ∂zm(i,`+1) X
wm,j fh0 (zj
(i,`) (i,`+1) (`) (i,`)
δj = (i,`)
= (i,`+1) (i,`)
= δm ).
∂zj m=1 ∂zm ∂zj m=1 NTU Logo.png
30/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Backpropagation Formula
We obtain the following backpropagation formula (equation):
(i,L) (i)
fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = ) (i,L) (i,L)
,
fo (zk )(1 − fo (zk ))
n`+1
X (`) (i,`+1)
fh0 (zj )
(i,`) (i,`)
δj = wm,j δm , `=L − 1, L − 2, . . . , 2,
m=1
which tells us that the value of δ for a particular hidden unit can be obtained by
propagating the δ’s backwards from units higher up in the network. Therefore,
N N
(
(`−1)
∂J(Θ) X ∂J (i) (Θ) ∂Penalty X (i,`) (i,`−1) λwj,j 0 , if j 0 6= 0,
= + = δ aj0 +
(`−1)
∂w 0 i=1 ∂w 0
(`−1) (`−1)
∂w 0 i=1
j
0, if j 0 = 0.
j,j j,j j,j
31/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
32/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
33/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Backpropagation Algorithm
(`)
1 Randomly initialize weights {wj,j 0 }`=1,...,L−1;j=1,...,n`+1 ,j 0 =0,1,...,n` .
(`)
2 Set ∆j,j 0 = 0 for all ` = 1, . . . , L − 1; j = 1, . . . , n`+1 , j 0 = 0, 1, . . . , n`
3 For i = 1, . . . , N,
• Set a(1) = x (i) ;
(`)
• With the current weights {w 0 }, perform forward propagation to compute a(`) for
j,j
` = 2, . . . , L − 1;
• Using y (i) , compute δ (L) = a(L) − y (i) ;
• Apply backpropagation formula to compute δ (L−1) , δ (L−2) , . . . , δ (2) ;
(`) (`) (`+1) `
• Set ∆ 0 ← ∆ 0 + δj
j,j j,j
aj 0 or ∆(`) ← ∆(`) + δ (`+1) [1, (a` )T ].
4 Set (
(`)
∂J(Θ) (`) λwj,j 0 , if j 0 6= 0,
= ∆j,j 0 +
(`)
∂wj,j 0 0, if j 0 = 0.
5 Update weights with the learning rate α > 0:
(`) (`) ∂J(Θ)
wj,j 0 ← wj,j 0 − α (`)
.
∂wj,j 0
6 Repeat steps 2-5 until convergence. NTU Logo.png
34/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
A Sophisticated Application
NTU Logo.png
35/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
A Sophisticated Application
There are 320 digits in the training set and 160 in the test set, each with 256 pixel
inputs.
Figure: Examples of training cases from ZIP code data. Each image is a 16 × 16 8-bit
grayscale representation of a handwritten digit.
NTU Logo.png
36/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
A Sophisticated Application
37/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
A Sophisticated Application
NTU Logo.png
39/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Checklist
2 Understand what is neural network and its whole model structure. Better to know
the difference between (logistic) regression and neural networks (Check out the
nonlinear classification example that is to demonstrate the usefulness of neural
networks).
2 Know how to write down the equations characterizing neural networks. Given the
weights and inputs, know how to compute the output and draw the architecture
diagram of the neural network.
2 Know the backpropagation theory for the simplest neural network (one hidden
layer, one output).
2 Know how to implement neural networks in R and know how to read the R results.
40/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng