Artificial Neural Networks

Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary
Lecture 6: Artificial Neural Networks
PUN Chi Seng
MH4510 Statistical Learning and Data Mining
NTU Logo.png
1/ 40 Lecture 6: Artificial Neural Networks MH4510 Statistical Learning and Data Mining PUN Chi Seng
Human Brain and Neurons
Neural Networks
The term neural network has evolved to encompass a large class of models and
learning methods that were originally inspired by thoughts on how the brain might
work. The goal is to mimic the functions and mechanism of our brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: state-of-the-art technique for many applications.
Before we go into neural networks, it may be better to learn a bit about the structure
of our human brains:
• Our brain consists of approximately 100 billion of some specific type of cell known
as neuron. Each of these neurons is typically connect with 10,000 other neurons.
These neurons will not regenerate.
• It is widely accepted that these neurons are responsible for our ability for
memorizing, learning, generalizing and thinking.
The exact function of these neurons is still a mystery but a very simple mathematical
model that mimics these neurons provide a surprising good performance in pattern
recognition, classification, and prediction. NTU Logo.png
Neurons in the Brain
Figure: Source: heart.cbl.utoronto.ca/˜berj/projects.html
Within the neuron, there are four items:

• Dendrites are for receiving information;
• Cell body is for processing information;
• Axon carries processed information to other neurons;
• Synapse is the junction between axon end and dendrites of other neurons.
These neurons are connected to form a huge and very complicate network.
NTU Logo.png
Artificial Neuron
To mimic this neuron, we have the following aritifcial neuron:
• X1 , X2 , . . . , Xp are the inputs received from our neurons or environment.

• The total input I is formed from the linear combinations of these inputs with
weights w0 , w1 , . . . , wp . (w0 is known as the bias).
• The activation function (or transfer function) f converts the input I to output V :
V = f (I ).
• The output V will go to other neurons as input.
NTU Logo.png
Activation Functions
Except for linear function, there are some commonly used activation functions:
Notice that tanh(x) = 2 × logistic(x) − 1, so the first two are equivalent except at the
output units. Sometimes Gaussian radial basis functions (exp(−x 2 /σ)) are used,
producing what is known as a radial basis function network. NTU Logo.png
Neural Networks: Representation
Neural Networks
A feed-forward neural network is a series of (logistic) regression models stacked on top

of each other, with the final layer being either another logistic or linear regression
model, depending on whether we are solving a classification or regression problem.
• For regression, there is typically

only one output unit.
• For K -class classification, there

are K output units with the kth
unit modeling the probability of
class k. There are K target
measurements
Yk , k = 1, . . . , K , each being
coded as a 0-1 variable for the
kth class. Figure: A generic feed-forward network with a single
hidden layer.
NTU Logo.png
Remarks on Feed-Forward Neural Network
1 The number of hidden layer can be zero, one, two, . . . , etc.
2 The number of neurons in the input and output layer are determined by the nature
of the problem, but the number of neurons in the hidden layer is user-defined.
3 Within each layer, neurons are not connected to each other. Neurons in one layer
are connected only to neurons in the next layer (feed-forward).
4 Each line joining the neuron is associated by a weight wij . These weights are
unknown parameters need to be estimated from the training dataset.
5 From a statistical point of view, neural networks perform nonlinear regression:

Given enough hidden units and enough training samples, they can closely
approximate any function.
NTU Logo.png
Single Output Unit

(2) (1) (1) (1) (1)
a1 = fh (b1 + w1,1 X1 + w1,2 X2 + w1,3 X3 )
(2) (1) (1) (1) (1)
a2 = fh (b2 + w2,1 X1 + w2,2 X2 + w2,3 X3 )
(2) (1) (1) (1) (1)
a3 = fh (b3 + w3,1 X1 + w3,2 X2 + w3,3 X3 )
(3)
hΘ (x) = a1
(2) (2) (2) (2) (2) (2) (2)
= fo (b1 + w1,1 a1 + w1,2 a2 + w1,3 a3 )
Another network architecture:
NTU Logo.png
Multiple Output Units: One-vs-all
Training set: (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (N) , y (N) ), where the response y (i) is one of
       
1 0 0 0
0 1 0 0
  ,  ,   ,  .
0 0 1 0
0 0 0 1
| {z } | {z } | {z } | {z }
Pedestrian Car Motorcycle Truck
NTU Logo.png
We want our output hΘ (x) is close to either one of them above.
Non-Linear Classification Example: XOR/XNOR
Figure: y = 1 : o; y = 0 : ×; Left: A complex learning problem; Right: A simplified version

(XOR operator).
Let x1 , x2 be the binary variable (0 or 1). yXOR = x1 XOR x2 , yXNOR = x1 XNOR x2 :

x1 x2 yXOR yXNOR
0 0 0 1
0 1 1 0
1 0 1 0
1 1 0 1 NTU Logo.png
AND/OR Function
In order to build up a network that fits the XNOR example, we start with a slightly
simpler one and show a network that fits the AND/OR function. We can use only
simple logistic regression to compute the logical AND/OR function.
yAND ≈ hΘ (x) = σ(−30 + 20x1 + 20x2 ) yOR ≈ hΘ (x) = σ(−10 + 20x1 + 20x2 )
x1 x2 hΘ (x) x1 x2 hΘ (x)
0 0 σ(−30) ≈ 0 0 0 σ(−10) ≈ 0
0 1 σ(−10) ≈ 0 0 1 σ(10) ≈ 1
1 0 σ(−10) ≈ 0 1 0 σ(10) ≈ 1
1 1 σ(10) ≈ 1 1 1 σ(10) ≈ 1
NTU Logo.png
NOT Function
yNOT ≈ hΘ (x) = σ(10 − 20x1 )
x1 hΘ (x)
0 σ(10) ≈ 1
1 σ(−10) ≈ 0
y(NOT x1 ) AND (NOT x2 ) ≈ hΘ (x) = σ(10−20x1 −20x2 )
x1 x2 hΘ (x)
0 0 σ(10) ≈ 1
0 1 σ(−10) ≈ 0
1 0 σ(−10) ≈ 0
1 1 σ(−30) ≈ 0
NTU Logo.png
XNOR Function
(2)
a1 = yAND ≈ σ(−30 + 20x1 + 20x2 );
(2)
a2 = y(NOT x1 ) AND (NOT x2 ) ≈ σ(10 − 20x1 − 20x2 );
(2) (2)
yXNOR ≈ yOR (a) ≈ σ(−10 + 20a1 + 20a2 ).
(2) (2)
x1 x2 a1 a2 hΘ (x)
0 0 0 1 1
0 1 0 0 0
1 0 0 0 0
1 1 1 0 1
NTU Logo.png
Neural Networks: Learning
Simplest Neural Network: One hidden layer, One output

Let X = (X1 , . . . , Xp ) and Y be the inputs and response. Consider a feed-forward
neural network with a single hidden layer, where we have H activation units:
p
!
(1)
X (1)
ah = fh bh + wh,j Xj ,
j=1
and the output layer, where we have

H
!
X (2)
hΘ (X ) = Out = fo b (2) + wh ah .
h=1
For illustration, we consider binary response Y = 0 or 1, and both fh (x) and fo (x) are
sigmoid functions σ(x) = 1/(1 + e −x ). With the training data {(x (i) , y (i) )}N
i=1 , we use
the following cost function:
N h
X i N
X
J(Θ) = −y (i) log(hΘ (x (i) )) − (1 − y (i) ) log(1 − hΘ (x (i) )) =: J (i) (Θ).
i=1 i=1
NTU Logo.png
Gradient Descent Approach

Since there are many parameters to be estimated in a neural network, we usually apply
regularization method and consider cost function like
p
N H H
!
X (i) λ X X (1) 2 X (2) 2
J(Θ) = J (Θ) + (wh,j ) + (wh ) ,
i=1
2 j=1 h=1 h=1
where λ is a tuning parameter. Noting that we do NOT regularize the intercepts.
We adopt the gradient descent approach to minimize J(Θ). For w being either one of
(1) (1) (2)
bh , wh,j , b (2) , wh , h = 1, . . . , H; j = 1, . . . , p, we update w as follows.
∂J(Θ)
w ←w −α .
∂w
The key to this update is to derive

N
X ∂J (i) (Θ)
∂J(Θ)
= + λw ,
∂w i=1
∂w
(1)
Noting that for w = bh or b (2) , we do not have the term λw above. NTU Logo.png
Gradient Computation
Denote by
p
(i,2) (1)
X (1) (i) (i) (i,2)
zh = bh + wh,j xj , ah = σ(zh ),
j=1
H
X (2) (i)
z (i,3) = b (2) + wh ah , hΘ (x (i) ) = σ(z (i,3) ).
h=1
(2) (2)
We first work on w = b , wh (output layer):
∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) (i)
(2)
= =: δ (i,3 ) ah ,
∂wh ∂z (i,3) ∂w (2)
h
(i)
∂J (Θ) ∂J (i) (Θ) ∂z (i,3)
= = δ (i,3 ) ,
∂b (2) ∂z (i,3) ∂b (2)
where
∂J (i) (Θ) σ 0 (z (i,3) ) (−σ 0 (z (i,3) ))
δ (i,3 ) = = −y (i) − (1 − y (i) )
∂z (i,3) σ(z (i,3) ) 1 − σ(z (i,3) )
= −y (i) (1 − σ(z (i,3) )) + (1 − y (i) )σ(z (i,3) ) = σ(z (i,3 )
) − y (i) .
NTU Logo.png
Gradient Computation (cont’)

(1) (1)
We now work on w = bh , wh,j (hidden layer):
(i,2)
∂J (i) (Θ) ∂J (i) (Θ) ∂zh (i,2 ) (i)
(1)
= (i,2) (1)
=: δh xj ,
∂wh,j ∂zh ∂wh,j
∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,2) (i,2 )
(1)
= (i,2) (1)
= δh ,
∂bh ∂zh ∂bh
where
(i,2 ) ∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) ∂z (i,3)

δh = (i,2)
= (i,3) (i,2)
= δ (i,3) (i,2) ,
∂zh ∂z ∂z ∂z
h h
(i)
∂z (i,3) ∂z (i,3) ∂ah
= wh σ 0 (zh
(2) (i,2) (2) (i,2) (i,2)
(i,2)
= (i) (i,2)
) = wh σ(zh )(1 − σ(zh ))
∂zh ∂ah ∂zh
(2) (i) (i)
= wh ah (1 − ah ).
Hence, we have the backpropagation formula:

(i,2 ) (2 ) (i) (i)
δh = δ (i,3 ) wh ah (1 − ah ).
NTU Logo.png
Backpropagation Algorithm
1 Randomly initialize weights.

(1) (1) (2)
2 For w = bh , wh,j , b (2) , wh , h = 1, . . . , H; j = 1, . . . , p,
PN ∂J (i) (Θ)
(I) Set ∆ = 0 (∆ represents i=1 ∂w
);
(II) For i = 1, . . . , N,
(i) (i)
(a) With the current weights, compute ah , hΘ (x );
(i,3) (i,2) (2) (i) (i)
(b) δ = hΘ (x (i) ) − y (i) ; by backpropagation formula, δh = δ (i,3) wh ah (1 − ah );
(c) Set
∂J (i) (Θ)
∆←∆+ .
∂w
(III) Update w :
w ← w − α(∆ + λw ),
(1)
where for w = bh or b (2) , we do not have the term λw above.
3 Repeat the second step until convergence.
NTU Logo.png
Discussion of Backpropagation Algorithm and Online Learning

• The advantages of back-propagation are its simple, local nature.
• In the backpropagation algorithm, each hidden unit passes and receives information
only to and from units that share a connection. Hence it can be implemented
efficiently on a parallel architecture computer.
• The procedure in the previous slide is sometimes called “batch” learning, since the
derivatives from the training cases are handled as one batch.
• Online learning can also be carried out – processing each observation one at a
time, updating the gradient after each training case, and cycling through the
training cases many times.
• Online learning allows the network to handle very large training sets, and also to
update the weights as new observations come in. It is the situation for perceptual
learning in humans. This may work better if many training cases are redundant.
• The learning rate α for batch learning is usually taken to be a constant. However,
for online learning to converge a (local) maximum, we will have to decrease α
with time: α = αi . This learning is a form
P of stochastic approximation; results in
this field ensure convergence if αi → 0, i αi = ∞, and i αi2 < ∞ (satisfied,
P
for example, by αi = 1/i β , 0.5 < β ≤ 1). NTU Logo.png
Practical Considerations
Starting values Note that the use of exact zero weights leads to zero derivatives and
perfect symmetry, and the algorithm never moves. Starting values for
weights are usually chosen to be random values near zero.
Overfitting Often neural networks have too many weights and will overfit the data.
An early stopping rule can be used to avoid it, since the weights start
at a highly regularized (linear) solution. Cross-validation is useful for
this job. Another explicit method is shrinking the weights, for example
by introducing a ridge penalty.
Scaling of the inputs It is best to standardize all inputs to have mean zero and
standard deviation one. With standardized inputs, it is typical to take
the starting weights as random uniform variables over the range
[−0.7, 0.7].
Number of hidden units and layers It is better to have too many hidden units than too
few. It is most common to put down a reasonably large number
(5-100) of units and train them with regularization. Choice of the
number of hidden layers is guided by background knowledge and
experimentation.
Multiple Minima One must at least try a number of random starting configurations,
and choose the solution giving lowest (penalized) error.NTU Logo.png
An R Example: Iris Data
Example: Iris Data

Species=1, 2, or 3. Consider Species as quantitative response (by force, for
illustration purpose) and run neural networks on Iris data.
NTU Logo.png
Architecture Diagram of the artificial neural networks
NTU Logo.png
Compute the output manually
h1 = σ(−85.74 − 26.3i1 − 30.57i2 + 51.6i3 + 49.7i4 ),

h2 = σ(3.92 + 4.01i1 + 5.37i2 − 11.08i3 − 7.67i4 ),
o = 1.99 + 0.99h1 − h2 ,
where σ(x) = 1/(1 + e −x ).

Remark: this case is linear output. If linout=F (default in nnet), then o = σ(· · · ).
In R, we can compute the fitted values (for the whole training data) with the codes:
w1=matrix(iris.nn$wts[1:10],nr=2,byrow=T)
w2=matrix(iris.nn$wts[11:13],nr=1,byrow=T)
sigmoid=function(x){1/(1+exp(-x))}
x=cbind(1,d[,1:4])
h=sigmoid(w1%*%t(x))
h=rbind(1,h)
out=t(w2%*%h)
Alternatively,
iris.nn$fitted.values or predict(iris.nn,d) NTU Logo.png
Improved version of artificial neural networks
A major problem of ANN is that the solution depends on the starting values. When we
run nnet() each time, we may obtain different results. Therefore, we have to run
nnet() several times and save the best result. Here is an improved version of the
nnet() function:
NTU Logo.png
Two-Class Classification
Let’s relabel Species to 0 for classes 2 and 3 (Hence Species becomes binary
response). Then we perform two-class classification (in fact, class 1 versus [classes 2
and 3]). When the response is binary, the nnet uses logistic output function as default.
NTU Logo.png
Multi-Layer, Multi-Output Neural Networks
Formulation of Neural Networks

• Let X = (X1 , X2 , . . . , Xp ) and Y = (Y1 , Y2 , . . . , YK ) be the inputs and responses;
• L be the number of layers (1st layer is input layer, Lth layer is output layer); n` be
the number of units in layer `, ` = 1, . . . , L (n1 = p, nL = K );
(`) (`−1)
• aj be the “activation” unit j in layer `; W (`−1) = {wj,j 0 }j=1,...,n` ;j 0 =0,1,...,n`−1
be the matrix of weights controlling function mapping from layer ` − 1 to layer `.
(1)
Then we have, for ` = 2, . . . , L − 1; j = 1, . . . , n` , aj = Xj and
 (`) 
n`−1
X (`−1) (`−1)
! a1
(`) (`−1)
aj = fh bj + wj,i ai , i.e. a :=  :  = fh b (`−1 ) + W (`−1 ) a(`−1 )
(`)
(`)
i=1 an `
(`) (`)
where the bias b (`) = (b1 , . . . , bn` )T and we assumed all units in all hidden layers
have the same activation function fh . In addition, we assume all units in the output
layer have the same activation function fo . Then, we have for k = 1, . . . , K ,
nL−1 !
(L) (L−1)
X (L−1) (L−1)
(hΘ (X ))k = ak = fo bk + wk,i ai , Θ = {b (`) , W (`) }L−1
`=1 ,
i=1

i.e. hΘ (X ) = fo b (L−1 ) + W (L−1 ) a(L−1 ) .
NTU Logo.png
Activation Functions for Neural Networks

We have a(1) = X ,

a(`) = fh b (`−1) + W (`−1) a(`−1) , ` = 2, . . . , L − 1,

hΘ (X ) = fo b (L−1) + W (L−1) a(L−1) ,
x ) = (f (x1 ), . . . , f (xn ))T .

where for a univariate function f , f (~
The activation function fh (x), x ∈ R is usually chosen to be the sigmoid (logistic)

function
1
fh (x) = σ(x) = .
1 + e −x
For the output function fo (T ), T ∈ RK ,

• For regression, we typically choose the identity function fo (T ) = T .
• For K -class classification, we usually use sigmoid function
fo (T ) = (σ(T1 ), . . . , σ(TK )) or the softmax function
!
e T1 e TK
fo (T ) = PK , · · · , PK .
Tk Tk
k=1 e k=1 e NTU Logo.png
Cost Function
Suppose that we have N measurements (x (1) , y (1) ), . . . , (x (N) , y (N) ) and that the
output of the network is hΘ (x). Then the parameter set Θ is chosen to minimize the
error function.
For regression, we use sum-of-squared errors as our measure of fit (error function):
N X
X K N
X
(i)
J(Θ) = (yk − (hΘ (x (i) ))k )2 = ky (i) − hΘ (x (i) )k2 .
i=1 k=1 i=1
For classification, we typically use the discrepancy function for multiple logistic
regressions:
N X
X K
(i) (i)
J(Θ) = − [yk log(hΘ (x (i) ))k + (1 − yk ) log(1 − (hΘ (x (i) ))k )]
i=1 k=1
or cross-entropy (deviance):
N X
X K
(i)
J(Θ) = − yk log(hΘ (x (i) ))k .
i=1 k=1
NTU Logo.png
Gradient Descent Approach

For illustration, we consider a classification problem with the following error function:
N X
X K
(i) (i)
J(Θ) = − [yk log(hΘ (x (i) ))k + (1 − yk ) log(1 − (hΘ (x (i) ))k )]
i=1 k=1
L n n`−1 N L n n`−1
λ X X̀ X (`−1) 2 X λ X X̀ X (`−1) 2
+ (wj,j 0 ) =: J (i) (Θ) + (wj,j 0 ) ,
2 j=1 0 i=1
2 j=1 0
`=2 j =1 `=2 j =1
where the second term is a penalty introduced to address the potential overfitting
(`−1) (`−1)
problem and the tuning parameter λ is known as weight decay. Let wj,0 := bj .
The generic approach to minimizing J(Θ) is by gradient descent, called

Backpropagation algorithm in this setting:
(`−1) (`−1) ∂J(Θ)

wj,j 0 ← wj,j 0 −α (`−1)
.
∂wj,j 0
The gradient can be easily derived using the chain rule for differentiation. This can be
computed by a forward and backward sweep over the network, keeping track only of
quantities local to each unit (units to which it is connected). NTU Logo.png
Gradient Computation
Suppose we have data {(x (i) , y (i) )}N
i=1 , x
(i)
∈ Rp , y (i) ∈ RK . Denote
n`−1
(i,1) (i) (i,`−1) (i,`)
X (`−1) (i,`−1)
aj = xj , a0 = 1, zj := wj,j 0 aj 0 , ` = 2, . . . , L
j 0 =0
(i,`) (i,`) (i,L)
aj = fh (zj ), ` = 2, . . . , L − 1, (hΘ (x (i) ))k = fo (zk ),
We can apply for the chain rule for partial derivative to give
(i,`)
∂J (i) (Θ) ∂J (i) (Θ) ∂zj ∂J (i) (Θ) (i,`−1) (i,`) (i,`−1)
(`−1)
= (i,`) (`−1)
= (i,`)
aj 0 =: δj aj 0 .
∂wj,j 0 ∂zj ∂wj,j 0 ∂zj
(i,`)
Here, δj represents the “error” of unit j in layer ` for sample (x (i) , y (i) ).
For the output layer, we have
(i,L) (i)
∂J (i) (Θ) fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = (i,L)
= (i,L) (i,L)
).
∂zk fo (zk )(1 − fo (zk ))
For the hidden layers, we have, for ` = 2, . . . , L − 1,
n`+1 n`+1
∂J (i) (Θ) X ∂J (i) (Θ) ∂zm(i,`+1) X
wm,j fh0 (zj
(i,`) (i,`+1) (`) (i,`)
δj = (i,`)
= (i,`+1) (i,`)
= δm ).
∂zj m=1 ∂zm ∂zj m=1 NTU Logo.png
Backpropagation Formula
We obtain the following backpropagation formula (equation):
(i,L) (i)
fo (zk ) − yk
fo0 (zk
(i,L) (i,L)
δk = ) (i,L) (i,L)
,
fo (zk )(1 − fo (zk ))
n`+1
X (`) (i,`+1)
fh0 (zj )
(i,`) (i,`)
δj = wm,j δm , `=L − 1, L − 2, . . . , 2,
m=1
which tells us that the value of δ for a particular hidden unit can be obtained by
propagating the δ’s backwards from units higher up in the network. Therefore,
N N
(
(`−1)
∂J(Θ) X ∂J (i) (Θ) ∂Penalty X (i,`) (i,`−1) λwj,j 0 , if j 0 6= 0,
= + = δ aj0 +
(`−1)
∂w 0 i=1 ∂w 0
(`−1) (`−1)
∂w 0 i=1
j
0, if j 0 = 0.
j,j j,j j,j
If we assume sigmoid functions are used for fh and fo , then we have

(i,L) (i)
δk = (hΘ (x (i) ))k − yk ,
n`+1
(i,`) (i,`) (i,`)
X (`)(i,`+1)
δj = aj (1 − aj ) wm,j δm , ` = L − 1, L − 2, . . . , 2.
m=1 NTU Logo.png
Intuition of Backpropagation Formula
(3) (3) (4)

δ1 ≈ w1,1 δ1 ,
(3) (3) (4)
δ2 ≈ w1,2 δ1 ,
(2) (2) (3) (2) (3)
δ1 ≈ w1,1 δ1 + w2,1 δ2 ,
(2) (2) (3) (2) (3)
δ2 ≈ w1,2 δ1 + w2,2 δ2 .
NTU Logo.png
Vectorization of Gradient Computation

Activation forward propagation:

a(i,1) = x (i) , a(i,`) = fh b (`−1) + W (`−1) a(i,`−1) , ` = 2, . . . , L − 1,

hΘ (x (i) ) = fo b (L−1) + W (L−1) a(i,L−1) ,
Error backward propagation:

(i,`) (i,`)
Let δ (i,`) = (δ1 , . . . , δn` )T , ` = 2, . . . , L. We have
δ (i,L) = hΘ (x (i) ) − y (i) ,

δ (i,`) = [(W (`) )T δ (i,`+1) ]. ∗ a(i,`) . ∗ (1 − a(i,`) ), ` = L − 1, L − 2, . . . , 2.
and we obtain the gradient
N
∂J(Θ) X
= δ (i,`) (a(i,`−1) )T + λW (`−1) ,
∂W (`−1) i=1
N
∂J(Θ) X
= δ (i,`) .
∂b (`−1) i=1 NTU Logo.png
Backpropagation Algorithm
(`)
1 Randomly initialize weights {wj,j 0 }`=1,...,L−1;j=1,...,n`+1 ,j 0 =0,1,...,n` .
(`)
2 Set ∆j,j 0 = 0 for all ` = 1, . . . , L − 1; j = 1, . . . , n`+1 , j 0 = 0, 1, . . . , n`
3 For i = 1, . . . , N,
• Set a(1) = x (i) ;
(`)
• With the current weights {w 0 }, perform forward propagation to compute a(`) for
j,j
` = 2, . . . , L − 1;
• Using y (i) , compute δ (L) = a(L) − y (i) ;
• Apply backpropagation formula to compute δ (L−1) , δ (L−2) , . . . , δ (2) ;
(`) (`) (`+1) `
• Set ∆ 0 ← ∆ 0 + δj
j,j j,j
aj 0 or ∆(`) ← ∆(`) + δ (`+1) [1, (a` )T ].
4 Set (
(`)
∂J(Θ) (`) λwj,j 0 , if j 0 6= 0,
= ∆j,j 0 +
(`)
∂wj,j 0 0, if j 0 = 0.
5 Update weights with the learning rate α > 0:
(`) (`) ∂J(Θ)
wj,j 0 ← wj,j 0 − α (`)
.
∂wj,j 0
6 Repeat steps 2-5 until convergence. NTU Logo.png
A Sophisticated Application
R Example Revisited: Multi-Class Classification

In R, the default output function is softmax function for multi-class (class size K > 2)
classification. Remember to make the response vector as factor type. The outputs
would be the probabilities of the corresponding classes. Then the classification is the
one with the largest probability.
NTU Logo.png
Example: ZIP Code Data (ESL 11.7)

This example is a character recognition task: classification of handwritten numerals.
This problem captured the attention of the machine learning and neural network
community for many years, and has remained a benchmark problem in the field.
Details may be found in Le Cun (1989).
There are 320 digits in the training set and 160 in the test set, each with 256 pixel
inputs.
Figure: Examples of training cases from ZIP code data. Each image is a 16 × 16 8-bit
grayscale representation of a handwritten digit.
NTU Logo.png
Architecture of the Networks: ZIP Code Data

Net-1: No hidden layer (=multi-logit model).
Net-2: 1 hidden layer with 12 units fully
connected.
Net-3: 2 hidden layers, locally connected.
• 1st hidden layer : 8 × 8, each takes input
from a 3 × 3 patch of the input layer; two
adjacent units overlap one row or column;
• 2nd hidden layer : 4 × 4, each takes inputs
from a 5 × 5 patch
Net-4: 2 hidden layers, locally connected with
weight sharing.
• 1st hidden layer : 8 × 8, each takes input
from a 3 × 3 patch of the input layer; same
set of 9 weights (weight sharing);
• 2nd hidden layer : 4 × 4, same as Net-3
with no weight sharing
Net-5: 2 hidden layers, locally connected, two
levels of weight sharing. NTU Logo.png
Results: ZIP Code Data

The networks all have sigmoidal output units, and were all fit with the RSS.
Network Architecture # Links # Weights % Correct

Net-1: Single layer network 2570 2570 80.0%
Net-2: Two layer network 3214 3214 87.0%
Net-3: Locally connected 1226 1226 88.5%
Net-4: Constrained network 1 2266 1132 94.0%
Net-5: Constrained network 2 5194 1060 98.4%
NTU Logo.png
Table: Test set performance of 5 neural networks on a handwritten digit classification example.
Summary of Neural Networks
Pros and Cons of Neural Network as a Classifier
Strength • High tolerance to noisy data.

• Well-suited for continuous, ordinal and nominal-valued inputs and
outputs.
• Successful on an array of real-world data, e.g., hand-written
letters.
• Algorithms are inherently parallel.
Weakness • Long training time.

• Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
• Poor interpretability: Work like a black box and difficult to
interpret the symbolic meaning behind the learned weights and of
“hidden units” in the network.
NTU Logo.png
Summary of Neural Networks
Checklist
2 Understand what is neural network and its whole model structure. Better to know
the difference between (logistic) regression and neural networks (Check out the
nonlinear classification example that is to demonstrate the usefulness of neural
networks).
2 Know how to write down the equations characterizing neural networks. Given the
weights and inputs, know how to compute the output and draw the architecture
diagram of the neural network.
2 Understand the basic idea of backpropagation algorithm for fitting a neural

network. That is basically a gradient descent method with some refinement.
2 Know the backpropagation theory for the simplest neural network (one hidden
layer, one output).
2 Know how to implement neural networks in R and know how to read the R results.
2 Learn the practical considerations of neural networks. NTU Logo.png

Artificial Neural Networks

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Artificial Neural Networks

Transféré par

Droits d'auteur :

Formats disponibles

Human Brain and Neurons Feed-Forward Neural Networks Fitting Neural Networks Example General Setting Example Summary

Lecture 6: Artificial Neural Networks

PUN Chi Seng

MH4510 Statistical Learning and Data Mining

Human Brain and Neurons

Human Brain and Neurons

Neurons in the Brain

Figure: Source: heart.cbl.utoronto.ca/˜berj/projects.html

Within the neuron, there are four items:

Human Brain and Neurons

• X1 , X2 , . . . , Xp are the inputs received from our neurons or environment.

Human Brain and Neurons

Neural Networks: Representation

A feed-forward neural network is a series of (logistic) regression models stacked on top

• For regression, there is typically

• For K -class classification, there

Neural Networks: Representation

Remarks on Feed-Forward Neural Network

1 The number of hidden layer can be zero, one, two, . . . , etc.

5 From a statistical point of view, neural networks perform nonlinear regression:

Neural Networks: Representation

Single Output Unit

Another network architecture:

Neural Networks: Representation

Multiple Output Units: One-vs-all

Neural Networks: Representation

Non-Linear Classification Example: XOR/XNOR

Figure: y = 1 : o; y = 0 : ×; Left: A complex learning problem; Right: A simplified version

Let x1 , x2 be the binary variable (0 or 1). yXOR = x1 XOR x2 , yXNOR = x1 XNOR x2 :

Neural Networks: Representation

Neural Networks: Representation

yNOT ≈ hΘ (x) = σ(10 − 20x1 )

y(NOT x1 ) AND (NOT x2 ) ≈ hΘ (x) = σ(10−20x1 −20x2 )

Neural Networks: Representation

Neural Networks: Learning

Simplest Neural Network: One hidden layer, One output

and the output layer, where we have

Neural Networks: Learning

Gradient Descent Approach

where λ is a tuning parameter. Noting that we do NOT regularize the intercepts.

The key to this update is to derive

Neural Networks: Learning

Neural Networks: Learning

Gradient Computation (cont’)

(i,2 ) ∂J (i) (Θ) ∂J (i) (Θ) ∂z (i,3) ∂z (i,3)

Hence, we have the backpropagation formula:

Neural Networks: Learning

1 Randomly initialize weights.

Neural Networks: Learning

Discussion of Backpropagation Algorithm and Online Learning

Neural Networks: Learning

An R Example: Iris Data

Example: Iris Data

An R Example: Iris Data

Architecture Diagram of the artificial neural networks

An R Example: Iris Data

Compute the output manually

h1 = σ(−85.74 − 26.3i1 − 30.57i2 + 51.6i3 + 49.7i4 ),

where σ(x) = 1/(1 + e −x ).

An R Example: Iris Data

Improved version of artificial neural networks

An R Example: Iris Data

Multi-Layer, Multi-Output Neural Networks