Vous êtes sur la page 1sur 40

Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Lecture 6: Artificial Neural Networks

PUN Chi Seng

MH4510 Statistical Learning and Data Mining

NTU Logo.png

1/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Human Brain and Neurons

Neural Networks
The term neural network has evolved to encompass a large class of models and
learning methods that were originally inspired by thoughts on how the brain might
work. The goal is to mimic the functions and mechanism of our brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: state-of-the-art technique for many applications.

Before we go into neural networks, it may be better to learn a bit about the structure
of our human brains:
• Our brain consists of approximately 100 billion of some specific type of cell known
as neuron. Each of these neurons is typically connect with 10,000 other neurons.
These neurons will not regenerate.
• It is widely accepted that these neurons are responsible for our ability for
memorizing, learning, generalizing and thinking.

The exact function of these neurons is still a mystery but a very simple mathematical
model that mimics these neurons provide a surprising good performance in pattern
recognition, classification, and prediction. NTU Logo.png

2/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Human Brain and Neurons

Neurons in the Brain

Figure: Source: heart.cbl.utoronto.ca/˜berj/projects.html

Within the neuron, there are four items:


• Dendrites are for receiving information;
• Cell body is for processing information;
• Axon carries processed information to other neurons;
• Synapse is the junction between axon end and dendrites of other neurons.
These neurons are connected to form a huge and very complicate network.
NTU Logo.png

3/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Human Brain and Neurons

Artificial Neuron
To mimic this neuron, we have the following aritifcial neuron:

• X1 , X2 , . . . , Xp are the inputs received from our neurons or environment.


• The total input I is formed from the linear combinations of these inputs with
weights w0 , w1 , . . . , wp . (w0 is known as the bias).
• The activation function (or transfer function) f converts the input I to output V :
V = f (I ).
• The output V will go to other neurons as input.
NTU Logo.png

4/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Human Brain and Neurons

Activation Functions
Except for linear function, there are some commonly used activation functions:

Notice that tanh(x) = 2 × logistic(x) − 1, so the first two are equivalent except at the
output units. Sometimes Gaussian radial basis functions (exp(−x 2 /σ)) are used,
producing what is known as a radial basis function network. NTU Logo.png

5/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

Neural Networks

A feed-forward neural network is a series of (logistic) regression models stacked on top


of each other, with the final layer being either another logistic or linear regression
model, depending on whether we are solving a classification or regression problem.

• For regression, there is typically


only one output unit.

• For K -class classification, there


are K output units with the kth
unit modeling the probability of
class k. There are K target
measurements
Yk , k = 1, . . . , K , each being
coded as a 0-1 variable for the
kth class. Figure: A generic feed-forward network with a single
hidden layer.

NTU Logo.png

6/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

Remarks on Feed-Forward Neural Network

1 The number of hidden layer can be zero, one, two, . . . , etc.

2 The number of neurons in the input and output layer are determined by the nature
of the problem, but the number of neurons in the hidden layer is user-defined.

3 Within each layer, neurons are not connected to each other. Neurons in one layer
are connected only to neurons in the next layer (feed-forward).

4 Each line joining the neuron is associated by a weight wij . These weights are
unknown parameters need to be estimated from the training dataset.

5 From a statistical point of view, neural networks perform nonlinear regression:


Given enough hidden units and enough training samples, they can closely
approximate any function.

NTU Logo.png

7/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

Single Output Unit


(2) (1) (1) (1) (1)
a1 = fh (b1 + w1,1 X1 + w1,2 X2 + w1,3 X3 )
(2) (1) (1) (1) (1)
a2 = fh (b2 + w2,1 X1 + w2,2 X2 + w2,3 X3 )
(2) (1) (1) (1) (1)
a3 = fh (b3 + w3,1 X1 + w3,2 X2 + w3,3 X3 )
(3)
hΘ (x) = a1
(2) (2) (2) (2) (2) (2) (2)
= fo (b1 + w1,1 a1 + w1,2 a2 + w1,3 a3 )

Another network architecture:

NTU Logo.png

8/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

Multiple Output Units: One-vs-all

Training set: (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (N) , y (N) ), where the response y (i) is one of
       
1 0 0 0
0 1 0 0
  ,  ,   ,  .
0 0 1 0
0 0 0 1
| {z } | {z } | {z } | {z }
Pedestrian Car Motorcycle Truck
NTU Logo.png
We want our output hΘ (x) is close to either one of them above.
9/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

Non-Linear Classification Example: XOR/XNOR

Figure: y = 1 : o; y = 0 : ×; Left: A complex learning problem; Right: A simplified version


(XOR operator).

Let x1 , x2 be the binary variable (0 or 1). yXOR = x1 XOR x2 , yXNOR = x1 XNOR x2 :


x1 x2 yXOR yXNOR
0 0 0 1
0 1 1 0
1 0 1 0
1 1 0 1 NTU Logo.png

10/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

AND/OR Function
In order to build up a network that fits the XNOR example, we start with a slightly
simpler one and show a network that fits the AND/OR function. We can use only
simple logistic regression to compute the logical AND/OR function.

yAND ≈ hΘ (x) = σ(−30 + 20x1 + 20x2 ) yOR ≈ hΘ (x) = σ(−10 + 20x1 + 20x2 )

x1 x2 hΘ (x) x1 x2 hΘ (x)
0 0 σ(−30) ≈ 0 0 0 σ(−10) ≈ 0
0 1 σ(−10) ≈ 0 0 1 σ(10) ≈ 1
1 0 σ(−10) ≈ 0 1 0 σ(10) ≈ 1
1 1 σ(10) ≈ 1 1 1 σ(10) ≈ 1
NTU Logo.png

11/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

NOT Function

yNOT ≈ hΘ (x) = σ(10 − 20x1 )

x1 hΘ (x)
0 σ(10) ≈ 1
1 σ(−10) ≈ 0

y(NOT x1 ) AND (NOT x2 ) ≈ hΘ (x) = σ(10−20x1 −20x2 )

x1 x2 hΘ (x)
0 0 σ(10) ≈ 1
0 1 σ(−10) ≈ 0
1 0 σ(−10) ≈ 0
1 1 σ(−30) ≈ 0

NTU Logo.png

12/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Representation

XNOR Function

(2)
a1 = yAND ≈ σ(−30 + 20x1 + 20x2 );
(2)
a2 = y(NOT x1 ) AND (NOT x2 ) ≈ σ(10 − 20x1 − 20x2 );
(2) (2)
yXNOR ≈ yOR (a) ≈ σ(−10 + 20a1 + 20a2 ).

(2) (2)
x1 x2 a1 a2 hΘ (x)
0 0 0 1 1
0 1 0 0 0
1 0 0 0 0
1 1 1 0 1
NTU Logo.png

13/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Simplest Neural Network: One hidden layer, One output


Let X = (X1 , . . . , Xp ) and Y be the inputs and response. Consider a feed-forward
neural network with a single hidden layer, where we have H activation units:
p
!
(1)
X (1)
ah = fh bh + wh,j Xj ,
j=1

and the output layer, where we have


H
!
X (2)
hΘ (X ) = Out = fo b (2) + wh ah .
h=1

For illustration, we consider binary response Y = 0 or 1, and both fh (x) and fo (x) are
sigmoid functions σ(x) = 1/(1 + e −x ). With the training data {(x (i) , y (i) )}N
i=1 , we use
the following cost function:
N h
X i N
X
J(Θ) = −y (i) log(hΘ (x (i) )) − (1 − y (i) ) log(1 − hΘ (x (i) )) =: J (i) (Θ).
i=1 i=1

NTU Logo.png

14/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Gradient Descent Approach


Since there are many parameters to be estimated in a neural network, we usually apply
regularization method and consider cost function like
p
N H H
!
X (i) λ X X (1) 2 X (2) 2
J(Θ) = J (Θ) + (wh,j ) + (wh ) ,
i=1
2 j=1 h=1 h=1

where λ is a tuning parameter. Noting that we do NOT regularize the intercepts.

We adopt the gradient descent approach to minimize J(Θ). For w being either one of
(1) (1) (2)
bh , wh,j , b (2) , wh , h = 1, . . . , H; j = 1, . . . , p, we update w as follows.

∂J(Θ)
w ←w −α .
∂w

The key to this update is to derive


N
X ∂J (i) (Θ)
∂J(Θ)
= + λw ,
∂w i=1
∂w
(1)
Noting that for w = bh or b (2) , we do not have the term λw above. NTU Logo.png

15/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Gradient Computation
Denote by
p
(i,2) (1)
X (1) (i) (i) (i,2)
zh = bh + wh,j xj , ah = σ(zh ),
j=1
H
X (2) (i)
z (i,3) = b (2) + wh ah , hΘ (x (i) ) = σ(z (i,3) ).
h=1

(2) (2)
We first work on w = b , wh (output layer):
∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) (i)
(2)
= =: δ (i,3 ) ah ,
∂wh ∂z (i,3) ∂w (2)
h
(i)
∂J (Θ) ∂J (i) (Θ) ∂z (i,3)
= = δ (i,3 ) ,
∂b (2) ∂z (i,3) ∂b (2)
where
∂J (i) (Θ) σ 0 (z (i,3) ) (−σ 0 (z (i,3) ))
δ (i,3 ) = = −y (i) − (1 − y (i) )
∂z (i,3) σ(z (i,3) ) 1 − σ(z (i,3) )
= −y (i) (1 − σ(z (i,3) )) + (1 − y (i) )σ(z (i,3) ) = σ(z (i,3 )
) − y (i) .
NTU Logo.png

16/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Gradient Computation (cont’)


(1) (1)
We now work on w = bh , wh,j (hidden layer):
(i,2)
∂J (i) (Θ) ∂J (i) (Θ) ∂zh (i,2 ) (i)
(1)
= (i,2) (1)
=: δh xj ,
∂wh,j ∂zh ∂wh,j
∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,2) (i,2 )
(1)
= (i,2) (1)
= δh ,
∂bh ∂zh ∂bh
where

(i,2 ) ∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) ∂z (i,3)


δh = (i,2)
= (i,3) (i,2)
= δ (i,3) (i,2) ,
∂zh ∂z ∂z ∂z
h h
(i)
∂z (i,3) ∂z (i,3) ∂ah
= wh σ 0 (zh
(2) (i,2) (2) (i,2) (i,2)
(i,2)
= (i) (i,2)
) = wh σ(zh )(1 − σ(zh ))
∂zh ∂ah ∂zh
(2) (i) (i)
= wh ah (1 − ah ).

Hence, we have the backpropagation formula:


(i,2 ) (2 ) (i) (i)
δh = δ (i,3 ) wh ah (1 − ah ).
NTU Logo.png

17/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Backpropagation Algorithm

1 Randomly initialize weights.


(1) (1) (2)
2 For w = bh , wh,j , b (2) , wh , h = 1, . . . , H; j = 1, . . . , p,
PN ∂J (i) (Θ)
(I) Set ∆ = 0 (∆ represents i=1 ∂w
);
(II) For i = 1, . . . , N,
(i) (i)
(a) With the current weights, compute ah , hΘ (x );
(i,3) (i,2) (2) (i) (i)
(b) δ = hΘ (x (i) ) − y (i) ; by backpropagation formula, δh = δ (i,3) wh ah (1 − ah );
(c) Set
∂J (i) (Θ)
∆←∆+ .
∂w
(III) Update w :
w ← w − α(∆ + λw ),
(1)
where for w = bh or b (2) , we do not have the term λw above.
3 Repeat the second step until convergence.

NTU Logo.png

18/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Discussion of Backpropagation Algorithm and Online Learning


• The advantages of back-propagation are its simple, local nature.
• In the backpropagation algorithm, each hidden unit passes and receives information
only to and from units that share a connection. Hence it can be implemented
efficiently on a parallel architecture computer.
• The procedure in the previous slide is sometimes called “batch” learning, since the
derivatives from the training cases are handled as one batch.

• Online learning can also be carried out – processing each observation one at a
time, updating the gradient after each training case, and cycling through the
training cases many times.
• Online learning allows the network to handle very large training sets, and also to
update the weights as new observations come in. It is the situation for perceptual
learning in humans. This may work better if many training cases are redundant.
• The learning rate α for batch learning is usually taken to be a constant. However,
for online learning to converge a (local) maximum, we will have to decrease α
with time: α = αi . This learning is a form
P of stochastic approximation; results in
this field ensure convergence if αi → 0, i αi = ∞, and i αi2 < ∞ (satisfied,
P
for example, by αi = 1/i β , 0.5 < β ≤ 1). NTU Logo.png

19/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Neural Networks: Learning

Practical Considerations
Starting values Note that the use of exact zero weights leads to zero derivatives and
perfect symmetry, and the algorithm never moves. Starting values for
weights are usually chosen to be random values near zero.
Overfitting Often neural networks have too many weights and will overfit the data.
An early stopping rule can be used to avoid it, since the weights start
at a highly regularized (linear) solution. Cross-validation is useful for
this job. Another explicit method is shrinking the weights, for example
by introducing a ridge penalty.
Scaling of the inputs It is best to standardize all inputs to have mean zero and
standard deviation one. With standardized inputs, it is typical to take
the starting weights as random uniform variables over the range
[−0.7, 0.7].
Number of hidden units and layers It is better to have too many hidden units than too
few. It is most common to put down a reasonably large number
(5-100) of units and train them with regularization. Choice of the
number of hidden layers is guided by background knowledge and
experimentation.
Multiple Minima One must at least try a number of random starting configurations,
and choose the solution giving lowest (penalized) error.NTU Logo.png
20/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

An R Example: Iris Data

Example: Iris Data


Species=1, 2, or 3. Consider Species as quantitative response (by force, for
illustration purpose) and run neural networks on Iris data.

NTU Logo.png

21/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

An R Example: Iris Data

Architecture Diagram of the artificial neural networks

NTU Logo.png

22/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

An R Example: Iris Data

Compute the output manually

h1 = σ(−85.74 − 26.3i1 − 30.57i2 + 51.6i3 + 49.7i4 ),


h2 = σ(3.92 + 4.01i1 + 5.37i2 − 11.08i3 − 7.67i4 ),
o = 1.99 + 0.99h1 − h2 ,

where σ(x) = 1/(1 + e −x ).


Remark: this case is linear output. If linout=F (default in nnet), then o = σ(· · · ).

In R, we can compute the fitted values (for the whole training data) with the codes:
w1=matrix(iris.nn$wts[1:10],nr=2,byrow=T)
w2=matrix(iris.nn$wts[11:13],nr=1,byrow=T)
sigmoid=function(x){1/(1+exp(-x))}
x=cbind(1,d[,1:4])
h=sigmoid(w1%*%t(x))
h=rbind(1,h)
out=t(w2%*%h)

Alternatively,
iris.nn$fitted.values or predict(iris.nn,d) NTU Logo.png

23/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

An R Example: Iris Data

Improved version of artificial neural networks

A major problem of ANN is that the solution depends on the starting values. When we
run nnet() each time, we may obtain different results. Therefore, we have to run
nnet() several times and save the best result. Here is an improved version of the
nnet() function:

NTU Logo.png

24/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

An R Example: Iris Data

Two-Class Classification

Let’s relabel Species to 0 for classes 2 and 3 (Hence Species becomes binary
response). Then we perform two-class classification (in fact, class 1 versus [classes 2
and 3]). When the response is binary, the nnet uses logistic output function as default.

NTU Logo.png

25/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Formulation of Neural Networks


• Let X = (X1 , X2 , . . . , Xp ) and Y = (Y1 , Y2 , . . . , YK ) be the inputs and responses;
• L be the number of layers (1st layer is input layer, Lth layer is output layer); n` be
the number of units in layer `, ` = 1, . . . , L (n1 = p, nL = K );
(`) (`−1)
• aj be the “activation” unit j in layer `; W (`−1) = {wj,j 0 }j=1,...,n` ;j 0 =0,1,...,n`−1
be the matrix of weights controlling function mapping from layer ` − 1 to layer `.
(1)
Then we have, for ` = 2, . . . , L − 1; j = 1, . . . , n` , aj = Xj and
 (`) 
n`−1
X (`−1) (`−1)
! a1  
(`) (`−1)
aj = fh bj + wj,i ai , i.e. a :=  :  = fh b (`−1 ) + W (`−1 ) a(`−1 )
(`)

(`)
i=1 an `
(`) (`)
where the bias b (`) = (b1 , . . . , bn` )T and we assumed all units in all hidden layers
have the same activation function fh . In addition, we assume all units in the output
layer have the same activation function fo . Then, we have for k = 1, . . . , K ,
nL−1 !
(L) (L−1)
X (L−1) (L−1)
(hΘ (X ))k = ak = fo bk + wk,i ai , Θ = {b (`) , W (`) }L−1
`=1 ,
i=1
 
i.e. hΘ (X ) = fo b (L−1 ) + W (L−1 ) a(L−1 ) .
NTU Logo.png

26/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Activation Functions for Neural Networks


We have a(1) = X ,
 
a(`) = fh b (`−1) + W (`−1) a(`−1) , ` = 2, . . . , L − 1,
 
hΘ (X ) = fo b (L−1) + W (L−1) a(L−1) ,

x ) = (f (x1 ), . . . , f (xn ))T .


where for a univariate function f , f (~

The activation function fh (x), x ∈ R is usually chosen to be the sigmoid (logistic)


function
1
fh (x) = σ(x) = .
1 + e −x

For the output function fo (T ), T ∈ RK ,


• For regression, we typically choose the identity function fo (T ) = T .
• For K -class classification, we usually use sigmoid function
fo (T ) = (σ(T1 ), . . . , σ(TK )) or the softmax function
!
e T1 e TK
fo (T ) = PK , · · · , PK .
Tk Tk
k=1 e k=1 e NTU Logo.png

27/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Cost Function
Suppose that we have N measurements (x (1) , y (1) ), . . . , (x (N) , y (N) ) and that the
output of the network is hΘ (x). Then the parameter set Θ is chosen to minimize the
error function.

For regression, we use sum-of-squared errors as our measure of fit (error function):
N X
X K N
X
(i)
J(Θ) = (yk − (hΘ (x (i) ))k )2 = ky (i) − hΘ (x (i) )k2 .
i=1 k=1 i=1

For classification, we typically use the discrepancy function for multiple logistic
regressions:
N X
X K
(i) (i)
J(Θ) = − [yk log(hΘ (x (i) ))k + (1 − yk ) log(1 − (hΘ (x (i) ))k )]
i=1 k=1

or cross-entropy (deviance):
N X
X K
(i)
J(Θ) = − yk log(hΘ (x (i) ))k .
i=1 k=1
NTU Logo.png

28/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Gradient Descent Approach


For illustration, we consider a classification problem with the following error function:
N X
X K
(i) (i)
J(Θ) = − [yk log(hΘ (x (i) ))k + (1 − yk ) log(1 − (hΘ (x (i) ))k )]
i=1 k=1
L n n`−1 N L n n`−1
λ X X̀ X (`−1) 2 X λ X X̀ X (`−1) 2
+ (wj,j 0 ) =: J (i) (Θ) + (wj,j 0 ) ,
2 j=1 0 i=1
2 j=1 0
`=2 j =1 `=2 j =1

where the second term is a penalty introduced to address the potential overfitting
(`−1) (`−1)
problem and the tuning parameter λ is known as weight decay. Let wj,0 := bj .

The generic approach to minimizing J(Θ) is by gradient descent, called


Backpropagation algorithm in this setting:

(`−1) (`−1) ∂J(Θ)


wj,j 0 ← wj,j 0 −α (`−1)
.
∂wj,j 0

The gradient can be easily derived using the chain rule for differentiation. This can be
computed by a forward and backward sweep over the network, keeping track only of
quantities local to each unit (units to which it is connected). NTU Logo.png

29/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Gradient Computation
Suppose we have data {(x (i) , y (i) )}N
i=1 , x
(i)
∈ Rp , y (i) ∈ RK . Denote
n`−1
(i,1) (i) (i,`−1) (i,`)
X (`−1) (i,`−1)
aj = xj , a0 = 1, zj := wj,j 0 aj 0 , ` = 2, . . . , L
j 0 =0
(i,`) (i,`) (i,L)
aj = fh (zj ), ` = 2, . . . , L − 1, (hΘ (x (i) ))k = fo (zk ),
We can apply for the chain rule for partial derivative to give
(i,`)
∂J (i) (Θ) ∂J (i) (Θ) ∂zj ∂J (i) (Θ) (i,`−1) (i,`) (i,`−1)
(`−1)
= (i,`) (`−1)
= (i,`)
aj 0 =: δj aj 0 .
∂wj,j 0 ∂zj ∂wj,j 0 ∂zj
(i,`)
Here, δj represents the “error” of unit j in layer ` for sample (x (i) , y (i) ).
For the output layer, we have
(i,L) (i)
∂J (i) (Θ) fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = (i,L)
= (i,L) (i,L)
).
∂zk fo (zk )(1 − fo (zk ))
For the hidden layers, we have, for ` = 2, . . . , L − 1,
n`+1 n`+1
∂J (i) (Θ) X ∂J (i) (Θ) ∂zm(i,`+1) X
wm,j fh0 (zj
(i,`) (i,`+1) (`) (i,`)
δj = (i,`)
= (i,`+1) (i,`)
= δm ).
∂zj m=1 ∂zm ∂zj m=1 NTU Logo.png

30/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Backpropagation Formula
We obtain the following backpropagation formula (equation):
(i,L) (i)
fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = ) (i,L) (i,L)
,
fo (zk )(1 − fo (zk ))
n`+1
X (`) (i,`+1)
fh0 (zj )
(i,`) (i,`)
δj = wm,j δm , `=L − 1, L − 2, . . . , 2,
m=1

which tells us that the value of δ for a particular hidden unit can be obtained by
propagating the δ’s backwards from units higher up in the network. Therefore,
N N
(
(`−1)
∂J(Θ) X ∂J (i) (Θ) ∂Penalty X (i,`) (i,`−1) λwj,j 0 , if j 0 6= 0,
= + = δ aj0 +
(`−1)
∂w 0 i=1 ∂w 0
(`−1) (`−1)
∂w 0 i=1
j
0, if j 0 = 0.
j,j j,j j,j

If we assume sigmoid functions are used for fh and fo , then we have


(i,L) (i)
δk = (hΘ (x (i) ))k − yk ,
n`+1
(i,`) (i,`) (i,`)
X (`)(i,`+1)
δj = aj (1 − aj ) wm,j δm , ` = L − 1, L − 2, . . . , 2.
m=1 NTU Logo.png

31/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Intuition of Backpropagation Formula

(3) (3) (4)


δ1 ≈ w1,1 δ1 ,
(3) (3) (4)
δ2 ≈ w1,2 δ1 ,
(2) (2) (3) (2) (3)
δ1 ≈ w1,1 δ1 + w2,1 δ2 ,
(2) (2) (3) (2) (3)
δ2 ≈ w1,2 δ1 + w2,2 δ2 .
NTU Logo.png

32/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Vectorization of Gradient Computation


Activation forward propagation:
 
a(i,1) = x (i) , a(i,`) = fh b (`−1) + W (`−1) a(i,`−1) , ` = 2, . . . , L − 1,
 
hΘ (x (i) ) = fo b (L−1) + W (L−1) a(i,L−1) ,

Error backward propagation:


(i,`) (i,`)
Let δ (i,`) = (δ1 , . . . , δn` )T , ` = 2, . . . , L. We have

δ (i,L) = hΘ (x (i) ) − y (i) ,


δ (i,`) = [(W (`) )T δ (i,`+1) ]. ∗ a(i,`) . ∗ (1 − a(i,`) ), ` = L − 1, L − 2, . . . , 2.
and we obtain the gradient
N
∂J(Θ) X
= δ (i,`) (a(i,`−1) )T + λW (`−1) ,
∂W (`−1) i=1
N
∂J(Θ) X
= δ (i,`) .
∂b (`−1) i=1 NTU Logo.png

33/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Multi-Layer, Multi-Output Neural Networks

Backpropagation Algorithm
(`)
1 Randomly initialize weights {wj,j 0 }`=1,...,L−1;j=1,...,n`+1 ,j 0 =0,1,...,n` .
(`)
2 Set ∆j,j 0 = 0 for all ` = 1, . . . , L − 1; j = 1, . . . , n`+1 , j 0 = 0, 1, . . . , n`

3 For i = 1, . . . , N,
• Set a(1) = x (i) ;
(`)
• With the current weights {w 0 }, perform forward propagation to compute a(`) for
j,j
` = 2, . . . , L − 1;
• Using y (i) , compute δ (L) = a(L) − y (i) ;
• Apply backpropagation formula to compute δ (L−1) , δ (L−2) , . . . , δ (2) ;
(`) (`) (`+1) `
• Set ∆ 0 ← ∆ 0 + δj
j,j j,j
aj 0 or ∆(`) ← ∆(`) + δ (`+1) [1, (a` )T ].

4 Set (
(`)
∂J(Θ) (`) λwj,j 0 , if j 0 6= 0,
= ∆j,j 0 +
(`)
∂wj,j 0 0, if j 0 = 0.
5 Update weights with the learning rate α > 0:
(`) (`) ∂J(Θ)
wj,j 0 ← wj,j 0 − α (`)
.
∂wj,j 0
6 Repeat steps 2-5 until convergence. NTU Logo.png

34/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

A Sophisticated Application

R Example Revisited: Multi-Class Classification


In R, the default output function is softmax function for multi-class (class size K > 2)
classification. Remember to make the response vector as factor type. The outputs
would be the probabilities of the corresponding classes. Then the classification is the
one with the largest probability.

NTU Logo.png

35/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

A Sophisticated Application

Example: ZIP Code Data (ESL 11.7)


This example is a character recognition task: classification of handwritten numerals.
This problem captured the attention of the machine learning and neural network
community for many years, and has remained a benchmark problem in the field.
Details may be found in Le Cun (1989).

There are 320 digits in the training set and 160 in the test set, each with 256 pixel
inputs.

Figure: Examples of training cases from ZIP code data. Each image is a 16 × 16 8-bit
grayscale representation of a handwritten digit.
NTU Logo.png

36/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

A Sophisticated Application

Architecture of the Networks: ZIP Code Data


Net-1: No hidden layer (=multi-logit model).
Net-2: 1 hidden layer with 12 units fully
connected.
Net-3: 2 hidden layers, locally connected.
• 1st hidden layer : 8 × 8, each takes input
from a 3 × 3 patch of the input layer; two
adjacent units overlap one row or column;
• 2nd hidden layer : 4 × 4, each takes inputs
from a 5 × 5 patch
Net-4: 2 hidden layers, locally connected with
weight sharing.
• 1st hidden layer : 8 × 8, each takes input
from a 3 × 3 patch of the input layer; same
set of 9 weights (weight sharing);
• 2nd hidden layer : 4 × 4, same as Net-3
with no weight sharing
Net-5: 2 hidden layers, locally connected, two
levels of weight sharing. NTU Logo.png

37/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

A Sophisticated Application

Results: ZIP Code Data


The networks all have sigmoidal output units, and were all fit with the RSS.

Network Architecture # Links # Weights % Correct


Net-1: Single layer network 2570 2570 80.0%
Net-2: Two layer network 3214 3214 87.0%
Net-3: Locally connected 1226 1226 88.5%
Net-4: Constrained network 1 2266 1132 94.0%
Net-5: Constrained network 2 5194 1060 98.4%
NTU Logo.png
Table: Test set performance of 5 neural networks on a handwritten digit classification example.
38/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Summary of Neural Networks

Pros and Cons of Neural Network as a Classifier

Strength • High tolerance to noisy data.


• Well-suited for continuous, ordinal and nominal-valued inputs and
outputs.
• Successful on an array of real-world data, e.g., hand-written
letters.
• Algorithms are inherently parallel.

Weakness • Long training time.


• Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
• Poor interpretability: Work like a black box and difficult to
interpret the symbolic meaning behind the learned weights and of
“hidden units” in the network.

NTU Logo.png

39/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Summary of Neural Networks

Checklist
2 Understand what is neural network and its whole model structure. Better to know
the difference between (logistic) regression and neural networks (Check out the
nonlinear classification example that is to demonstrate the usefulness of neural
networks).

2 Know how to write down the equations characterizing neural networks. Given the
weights and inputs, know how to compute the output and draw the architecture
diagram of the neural network.

2 Understand the basic idea of backpropagation algorithm for fitting a neural


network. That is basically a gradient descent method with some refinement.

2 Know the backpropagation theory for the simplest neural network (one hidden
layer, one output).

2 Know how to implement neural networks in R and know how to read the R results.

2 Learn the practical considerations of neural networks. NTU Logo.png

40/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng

Vous aimerez peut-être aussi