Académique Documents
Professionnel Documents
Culture Documents
The human brain is a highly complex, nonlinear and parallel computer. It has the
capability to organize its structural constituents, known as neurons, so as to perform
certain computations many times faster than the fastest digital computer in existence
today.
A neural network is massively parallel distributed processor made up of simple
processing units, which has a natural propensity for storing experiential knowledge and
making it available for use.
It resembles the brain in two respects:
1. Knowledge is acquired by the network from its environment through process.
2. Interneuron connection strengths, known as synaptic weights, are used to
store the acquired knowledge.
The procedure used to perform the learning process is called a learning
algorithm. Its function is to modify the synaptic weights of the network to attain a
desired design objective.
Neural networks are also referred to as neurocomputers, connectionist
networks, parallel distributed processors.
http://rajakishor.co.cc Page 45
2. Input-output mapping
Supervised learning
Working through training samples or task examples.
3. Adaptivity
Adapting the synaptic weights to change in the surrounding
environments.
4. Evidential response
5. Contextual information
6. Fault tolerance
7. VLSI implementability
8. Uniformity of analysis and design
9. Neurobiological analogy
logy
H uman B rain
The human nervous system may be viewed as a three-stage
three system.
Central to the nervous system is the brain. It is represented by the neural net.
The brain continually receives the information, perceives it, and makes appropriate
decisions. The arrows pointing from left to right indicate the forward transmission of
information – bearing signals
als through the system. The arrows pointing from right to left
signify the presence of feedback in the system.
The receptors convert stimuli from the human body or the external environment
into electrical impulses that convey information to the neural net (the brain). The
effectors convert electrical impulses generated by the neural net into discernible
responsible as system outputs.
Typically, neurons are five to six orders of magnitude slower than silicon gates.
Events in the silicon chip happen in the 10-9µs – range, whereas neural events happen in
the 10-3µs – range.
http://rajakishor.co.cc Page 46
It is estimated that there are approximately 10 billion neurons and 60 trillion
synapses or connections in the human brain.
Synapses are elementary structural and functional units that mediate the
interactions between neurons. The most common kind of synapse is a chemical synapse.
A chemical synapse operates as follows. A pre-synaptic
pre synaptic process liberates a
transmitter substance that diffuses across the synaptic junction between neurons and
then acts on a post-synaptic
synaptic process. Thus, a synapse converts a pre-synaptic
pre synaptic electric
signal into a chemical signal and then back into a post synaptic electrical signal.
Structural organization of levels in the brain
http://rajakishor.co.cc Page 47
A neural microcircuit refers to an assembly of synapses organized into patterns
of connectivity to produce a functional operation of interest.
The neural microcircuits are grouped to form dendritic subunits within the
dendritic trees of individual neurons.
The whole neuron is about 100µm in size. It contains several dendritic subunits.
The local circuits are made up of neurons with similar or different properties.
Each circuit is about 1mm in size. The neural assemblies perform operations on
characteristics of a localized region in the brain.
The interregional circuits are made up of pathways, columns and topographic
maps, which involve multiple regions
r located in different parts of the brain.
Topographic maps are organized to respond to incoming sensory information.
The central nervous system is the final level of complexity where the
topographic maps and other interregional circuits mediate specific types of behavior.
M odels of a N euron
A neuron is an information-processing
information processing unit that is fundamental to the operation
of a neural network. Its model can be shown in the following block diagram.
http://rajakishor.co.cc Page 48
The neuronal model has three basic elements:
1. A set of synapses each of which is characterized by a weight or strength of its
own. Each synapse has two parts: a signal xj and a weight wkj. Wkj refers to the
weight of the kth neuron with respect to jth input signal. The synaptic weight may
range through positive as well as the negative values.
2. An adder for summing the input signals, weighted by the respective synapses of
the neuron.
3. An activation function for limiting the amplitude of the output of a neuron.
ne
The neuron model also includes an externally applied bias, bk. the bias has the
effect of increasing or lowering the net input of the activation function, depending on
whether it is positive or negative, respectively.
A neuron k may be mathematically described as follows:
where x1, x2, …, xm are the input signals; Wk1, Wk2, …, Wkm are the synaptic weights of the
neuron k; uk is the linear combiner output due to the input signal; bk is the bias; vk is the
induced local field; ψ(.)
(.) is the activation function and yk is the output signal of the
neuron k.
The use of bias bk has the effect of applying an affine transformation to the
output uk of the linear combiner.
So, we can have
vk = uk + bk ---- (2)
http://rajakishor.co.cc Page 49
Now, the equation (1) will be written as follows:
The activation function defines the output of a neuron in terms of the induced
local field vk.
1. Threshold function
The function is defined as
1, if v ≥ 0
Ψ (v ) =
0, if v < 0
This form of a threshold function is also called as Heaviside function.
Correspondingly, the output of neuron k is expressed as
1, if vk ≥ 0
yk =
0, if vk < 0
where
m
vk = ∑ Wkj x j + bk
j =1
http://rajakishor.co.cc Page 50
This model is also called the McCullouch-Pitts model. In this model, the output of
a neuron is 1, if the induced local field of that neuron is nonnegative, and 0 otherwise.
This statement describes the all-or-none property of the model.
2. Piecewise-linear function
The activation function, here, is defined as
1
1, v≥ +
2
1 1
Ψ (v ) = v , + > v > −
2 2
1
0, v≤−
2
where the amplification factor inside the linear region of operation is assumed to be
unit. Two situations can be observed for this function:
A linear combiner arises if the linear region of operation is maintained without
running into situation.
The piecewise-linear function reduces to a threshold function if the amplification
factor of the linear region is made infinitely large.
3. Sigmoid function
This is the most common form of activation function used in the construction of
artificial neural networks. It is defined as a strictly increasing function that exhibits a
graceful balance between linear and nonlinear behavior.
An example of sigmoid function is the logistic function, which is defined as
1
Ψ (v ) =
1 + e − av
http://rajakishor.co.cc Page 51
N eural N etworks and D irected G raphs
The neural network can be represented through a signal-flow
signal flow graph. A signal-
signal
flow graph is a network of directed links (branches) that are interconnected at certain
points called nodes. A typical node j has an associated node signal xj. A typical directed
link originates at node j and terminates on node k; it has an associated transfer function
or transmittance that specifies manner in which the signal yk at node k depends on the
signal xj at node j.
The flow of signals in the various parts of the graph is directed by three basic
rules.
Rule-1:
A signal flows along a link only in the direction defined by the arrow on the link.
There are two types of links:
Synaptic links: whose behavior is governed by a linear input-output
input
relation. Here, we have yk = Wkjxj.
For example,
Rule-2:
A node signal equals the algebraic sum of all signals entering the pertinent node
via the incoming links. This also called the synaptic convergence or fan-in.
fan
For example,
http://rajakishor.co.cc Page 52
Rule-3:
The signal at a node is transmitted to each outgoing link originating from that
node.
For example,
http://rajakishor.co.cc Page 53
1. Single-layer
layer feedforward networks
In a layered neural network, the neurons are organized in the form of layers. The
simplest form of a layered network has an input layer of source nodes that project onto
an output layer but not vice versa.
For example,
The above network is a feedforward or acyclic type. This is also called a single-
single
layer network. The single layer refers to the output layer as computations take place
only at the output nodes.
2. Multilayer feedforward networks
In this class, a neural network has one or more hidden layers, whose
computation nodes are called hidden neurons or hidden units. The function of hidden
neurons is to intervene between the external input and the network output in a useful
manner. By adding one or more hidden
hidden layers, the network is enabled to extract higher-
higher
order statistics. This is essentially required when the size of the input layer is large. For
example,
http://rajakishor.co.cc Page 54
The source nodes in the input layer supply respective elements of the
activation pattern (input vector), which constitutes the input signals
applied to the second layer.
The output signals of the second layer are used as inputs to the inputs to
the third layer, and so on for the rest of the network.
The set of output signals of the neurons in the output layer constitutes the
overall response of the network to the activation pattern supplied by the
source nodes in the input layer.
http://rajakishor.co.cc Page 55
K nowledge R epresentation
Knowledge refers to stored information or models used by a person or machine
to interpret, predict, and appropriately respond to the outside world.
Knowledge
ge representation involves the following:
1. Indentifying the information that is to be processed
2. Physically encoding the information for subsequent use
Knowledge representation is goal directed. In real-world
real world applications of
“intelligent” machines, a good solution depends on a good representation of knowledge.
A major task for a neural network is to provide a model for a real-time
real
environment into which it is embedded. Knowledge of the world consists of two kinds of
information:
1. Prior information: It gives the known state of the world. It is represented by
facts about what is and what has been known.
2. Observations: These are the measures of the world. These are obtained by the
sensors that probe the environment where the neural network operates.
The set of input-output
output pairs, with each pair consisting of an input signal and the
corresponding desired response is called a set of training data or training sample.
Ex: Handwritten digital recognition.
http://rajakishor.co.cc Page 56
image of digits and then produces the required output
output digit. This phase is called
the generalization.
Note: The training data for a NN may consist of both positive and negative examples.
Example: A simple neuronal model for recognizing handwritten digits.
digits
Consider an input set X of key patterns X1, X2, X3, ……
Each key pattern represents a specific handwritten digit.
The network has k neurons.
Let W = {w1j(i), w2j(i), w3j(i), ……}, for j= 1,2,3, …., k be the set of weights of X1,
X2, X3, ….. with
ith respect to each of k neurons in the network. i referrers to an
instance.
Let y(j) be the generated output of neuron j for j=1,2,…k.
Let d(j) be the desired output of neuron j, for j=1,2,…..k.
Let e(j)= d(j) – y(j) be the error that is calculated at neuron j, for j = 1,2,…,k.
Now we design the neuronal model for the system as follows.
In the above model, each neuron computes a specific digit j. With every key
pattern, synapses are established to every neuron in the model. We assumed that the
weights of each key pattern can be either 0 or 1.
http://rajakishor.co.cc Page 57
Ex: Let
et the key pattern x1 corresponds a hand written digit 1. So its synaptic
weight W11(i) should be 1 for the 1st neuron and all other synaptic weights for x1 is must
be 0.
Weight matrix for the above model can be as follows.
http://rajakishor.co.cc Page 58
Now, the Euclidean distance between Xi and Xj is defined by
d ( X i , X j , ) =|| X i − X j ||
m
= ∑ (x
k =1
ik − x jk ) 2
The two inputs Xi and Xj are said to be similar if d(Xi, Xj) is minimum.
http://rajakishor.co.cc Page 59
There are three techniques for rendering classifier-type NNs invariant to
transformations:
1. Invariance by structure
2. Invariance by training
3. Invariant feature space
http://rajakishor.co.cc Page 60
Hebb’s Laws
Here the change in the weight vector is given by
∆Wi(t) = ηf(WiTa)a
Therefore, the jth component of ∆Wi is given by
∆wij = ηf(WiTa)aj
= ηsiaj, for j = 1, 2, …, M.
where si is the output signal of the ith unit. a is the input vector.
The Hebb’s law states that the weight increment is proportional to the product of
the input data and the resulting output signal of the unit. This law requires weight
initialization to small random values around wij = 0 prior to learning. This law
represents an unsupervised learning.
Perceptron Learning Law
Here the change in the weight vector is given by
∆Wi = η[di – sgn(WiTa)]a
where sgn(x) is sign of x. Therefore, we have
∆wij = η[di – sgn(WiTa)]aj
= η(di – Si) aj, for j = 1, 2, …, M.
The perceptron law is applicable only for bipolar output functions f(.). This is
also called discrete perceptron learning law. The expression for ∆wij shows that the
weights are adjusted only if the actual output si is incorrect, since the term in the square
brackets is zero for the correct output.
This is a supervised learning law, as the law requires a desired output for each
input. In implementation, the weights can be initialized to any random initial values, as
they are not critical. The weights converge to the final values eventually by repeated use
of the input-output pattern pairs, provided the pattern pairs are representable by the
system.
http://rajakishor.co.cc Page 61
Delta Learning Law
Here the change in the weight vector is given by
∆Wi = η[di – f(WiTa)] f(WiTa)a
where f(x) is the derivative with respect to x. Hence,
∆wij = η[di – f(WiTa)] f(WiTa)aj
= η[di - si] f(xi) aj, for j = 1, 2, …, M.
This law is valid only for a differentiable output function, as it depends on the
derivative of the output function f(.). It is a supervised learning law since the change in
the weight is based on the error between the desired and the actual output values for a
given input.
Delta learning law can also be viewed as a continuous perceptron learning law.
In-implementation, the weights can be initialized to any random values as the
values are not very critical. The weights converge to the final values eventually by
repeated use of the input-output pattern pairs. The convergence can be more or less
guaranteed by using more layers of processing units in between the input and output
layers. The delta learning law can be generalized to the case of multiple layers of a
feedforward network.
http://rajakishor.co.cc Page 62
Correlation Learning Law
Here, the change in the weight vector is given by
∆Wi = ηdia
Therefore,
∆wij = ηdiaj
This is a special case ofo the Hebbian learning with the outputt signal (si) being
replaced by the desired signal (di). But the Hebbian learning is an unsupervised
learning, whereas the correlation
correlat learning is a supervised learning,
ng, since it uses the
desired output value to adjustt the weights. In the implementation of the learning law,la
the weights are initialised
lised to small random values close to zero,
zero i.e., wij ≈ 0.
Instar (Winner-take-all)
all) Learning Law
This is relevant for a collection of neurons,
neurons organized in a layer as shown below.
All the
he inputs are connected to each of the units in the output layer in a
feedforward manner. For a given input vector a, the output from each unit i is computed
using the weighted sum wiTa. The unit k that gives maximum output is identtified. That is
WkT = max(Wi T a)
i
Then the weight vector leading to the kth unit is adjusted as follows:
∆Wk = η(a - Wk)
Therefore,
∆wkj = η(aj - wkj), for j = 1, 2, …, M.
The final weight vector tends to represent a group of input vectors within
w a small
neighbourhood. This is a case of unsupervised learning. In implementation,, the values of
the weight vectors are initialized
initiali ed to random values prior to learning, and the vector
lengths are normalized
zed during learning.
learning
http://rajakishor.co.cc Page 63
Outstar Learning Law
The outstar learning law
aw is also related to a group of units arranged in a layer as
shown below.
In
n this law the weights are adjusted so as to capture the desired output pattern
pa
characteristics. The adjustment of the weights is given by
∆Wjk = η(dj - wjk), for j = 1, 2, …, M
where the kth unit is the only active unit in d1, d2, …, dM)T
i the input layer. The vector d = (d
is the desired response
se from the layer of M units.
The outstar learning is a supervised learning
le law, and it is used with a network of
instars
stars to capture the characteristics
chara of the input and output patterns for data
compression. In implementation,
tation, the weight vectors are initialized to zeroz prior to
learning.
P attern R ecognition
Data refers to the collection of raw facts, whereas, the pattern refers to an
observed sequence of facts.
The main difference between human and machine intelligence comes from the
fact that humans perceive everything as a pattern,, whereas for a machine everything is
data.. Even in routine data consisting of integer numbers (like telephone numbers, bank
account numbers, car numbers) humans tend to perceive a pattern. If there is no
pattern, then it is very difficult for a human being to remember and reproduce the data
later.
Thus storage and recall operations in human beings and machines are performed
by different mechanisms. The pattern nature in storage and recall automatically gives
robustness and fault tolerance for the human system.
http://rajakishor.co.cc Page 64
Pattern recognition tasks
Pattern recognition is the process of identifying a specified sequence that is
hidden in a large amount of data.
Following are the pattern recognition tasks.
1. Pattern association
2. Pattern classification
3. Pattern mapping
4. Pattern grouping
5. Feature mapping
6. Pattern variability
7. Temporal patterns
8. Stability-plasticity dilemma
1. Feedforward ANN
Pattern association
Pattern classification
Pattern mapping/classification
2. Feedback ANN
Autoassociation
Pattern storage (LTM)
Pattern environment storage (LTM)
3. Feedforward and Feedback (Competitive Learning) ANN
Pattern storage (STM)
Pattern clustering
Feature mapping
In any pattern recognition task we have a set of input patterns and the
corresponding output patterns. Depending on the nature of the output patterns and the
nature of the task environment, the problem could be identified as one of association or
classification or mapping.
The given set of input-output pattern pairs form only a few samples of an
unknown system. From these samples the pattern recognition model should capture the
characteristics of the system.
http://rajakishor.co.cc Page 65
A pplications of N
eural N etworks
Area Applications
Aerospace High performance aircraft autopilots
flight path simulations
aircraft control systems
autopilot enhancements
aircraft component simulations
aircraft component fault detectors
Automotive Automobile automatic guidance systems
warranty activity analyzers
Banking Check and other document readers
credit application evaluators
Defense Weapon steering
target tracking
object discrimination
facial recognition
new kinds of sensors
sonar
radar and image signal processing including
data compression
feature extraction and noise suppression
signal/image identification
Electronics Code sequence prediction
integrated circuit chip layout
process control
chip failure analysis
machine vision
voice synthesis
nonlinear modeling
Financial Real estate appraisal
loan advisor
mortgage screening
corporate bond rating
credit line use analysis
portfolio trading program
corporate financial analysis
currency price prediction
http://rajakishor.co.cc Page 66
Area Applications
Manufacturing Manufacturing process control
product design and analysis
process and machine diagnosis
real-time particle identification
visual quality inspection systems
beer testing
welding quality analysis
paper quality prediction
computer chip quality analysis
analysis of grinding operations
chemical product design analysis
machine maintenance analysis
project bidding
planning and management
dynamic modeling of chemical process
systems
Medical Breast cancer cell analysis
EEG and ECG analysis
prosthesis design
optimization of transplant times
hospital expense reduction
hospital quality improvement
emergency room test advisement
Robotics Trajectory control
forklift robot
manipulator controllers
vision systems
Speech Speech recognition
speech compression
vowel classification
text to speech synthesis
Securities Market analysis
automatic bond rating
stock trading advisory systems
Telecommunica Image and data compression
tions automated information services
real-time translation of spoken language
customer payment processing systems
Transportation Truck brake diagnosis systems
vehicle scheduling
routing systems
http://rajakishor.co.cc Page 67
A dvantages and D isadvantages of N eural N etworks
Advantages
prediction accuracy is generally high
robust, works when training examples contain errors
output may be discrete, real-valued, or a vector of several discrete or real-
valued attributes
fast evaluation of the learned target function
Generalization
HW realizations is somewhat physical
Criticism
long training time
difficult to understand the learned function (weights)
not easy to incorporate domain knowledge
The above structure is a threshold logic unit (TLU). Its output is 1 or 0 depending
on whether or not the weighted sum of its inputs is greater than or equal to a threshold
http://rajakishor.co.cc Page 68
value Ѳ. It has also been called an Adaline (Adaptive linear element), an LTU (Linear
threshold unit), a perceptron, and a neuron.
The n-dimensional
dimensional feature or input vector is denoted by X = (x1, …, xn). The
components of X can be any real-valued
real numbers, but we often specialize to binary
numbers 0 and 1. The weights of a TLU are represented by an n-dimens n sional weight
vector W = (w1, …, wn). The TLU has output 1 if ∑ i =1 xi wi ≥ θ ; otherwise it has output 0.
n
P erceptrons
A perceptron refers to a neural network whose weights and biases could be
trained to produce a correct target vector when presented with the corresponding input
vector. The training technique used is called the perceptron learning rule. Perceptrons
aree especially suited for simple problems in pattern classification.
The perceptron could learn when initialized with random values for its weights
and biases.
In the above figure, the synaptic weights of the perceptron are denoted by w1, w2,
…, wk. Correspondingly, the inputs applied to the perceptron are denoted by X1, X2, …, Xk.
The external bias denoted by b. From m the model we find that the hard limiter input or
induced local field neuron is
k
v = ∑ xiT wi + b
i =1
http://rajakishor.co.cc Page 69
Generally there are two kinds of perceptrons that are in use.
1. Single layer perceptron.
2. Multi layer perceptron.
http://rajakishor.co.cc Page 70
e = d – y and the change to be made to the weight vector ∆W.
If weight vector changing rule is
Wnew = Wold + ∆W
Case-1: If e=0, then make a change ∆W equal to 0.
Case-2: If e=1, then make a change ∆W equal to X.
Case-3: If e=-1, then make a change ∆W equal to –X.
In this perceptron learning process along with weights we also have to alter the
bias.
The alter rule of bias is:
bnew = bold + e.
Example-1:
Suppose the classification is as like as
{X T
1 = [2 2] d1 =0}
{X T
2 = [1 -2] d 2 =1}
{X T
3 = [−2 2] d 3 =0}
{X T
4 = [−1 1] d 4 =1}
Solve it with single vector input, two element perceptron network.
Solution:
Step-1:
0
Assume initial value of W (0) = and b(0)=[0] .
0
To calculate the output value we first calculate the induced local field, v.
http://rajakishor.co.cc Page 71
v1 = X 1T W (0) + b(0)
0
=[2 2] + [0] = 0 + 0 = 0
0
y = f (v1 ) = hardlim(0) = 1
The output y does not equal the target value d1 so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d1 − y = 0 − 1 = −1
2 −2
∆W = eX 1 = (−1) =
2 −2
∆b = e = −1
We can calculate the new weights and bias using the perceptron update rules.
Wnew = Wold +∆W
bnew = bold + ∆b
So,
0 −2 −2
Wnew = W (1) = + =
0 −2 −2
bnew = b(1) = [0] + [−1] = −1
Step-2:
Now take the next input vector
v2 = X 2T W (1) + b(1)
−2
v2 = [1 − 2] + [−1] = 1
−2
y = f (v2 ) = hardlim(1)=1
On this occasion, the target d2 is 1, so the error is zero. Thus there are no changes
in weights or bias.
http://rajakishor.co.cc Page 72
So,
−2
W (2) = W (1) = and
−2
b(2) = b(1) = −1
Step-3:
Here,
−2
v3 = [−2 2] + [−1]
−2
= [0] + [−1] = −1
y = f (v3 ) = hardlim(−1) = 0
[hardlim[-1]=0, since percetron output is only 0,1]
Compare the output with the target, then y = d3 = 0, so the error is 0. Thus there
are no changes in weights or bias.
−2
So the weights W (3) = W (2) = and bias b(3) = −1
−2
Step-4:
Here,
−2
v4 = [−1 1] + [−1] = −1
−2
y = f (v4 ) = hardlim(−1) = 0
The output y does not equal the target value d4, so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d4 − y = 1 − 0 = 1
−1 −1
∆W = eX 4 = 1 =
1 1
∆b = e = 1
We can calculate the new weights and bias using the perceptron update rules.
http://rajakishor.co.cc Page 73
So,
−2 −1 −3
Wnew = W (4) = + =
−2 1 −1
bnew = b(4) = [−1] + 1 = 0
To determine whether a satisfactory solution is obtained, make one pass through
all input vectors to see if they all produce the desired target values. This is not true for
the fourth input, so make another pass.
Step-5:
Now take X5 = X1 ans d5 = d1
−3
v = [ 2 2 ] + 0 = −8
−1
y = hardlim(v), hardlim(−8) = 0
Here, d1 = y.
So, e = 0, W(4) = W(5) and b(4) = b(5).
Step-6:
Now take X6 = X2 and d6 = d2.
−3
v = [1 -2] + 0 = −1
−1
y = hardlim(v), hardlim(−1) = 0
The output y does not equal the target value d6, so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d6 − y = 1 − 0 = 1
1 1
∆ W = eX 6 = (1) =
−2 −2
∆b = e = 1
−3 1 −2
W new = W (6) = + =
−2 −2 −3
bnew = b (6) = [0] + [1] = 1
http://rajakishor.co.cc Page 74
Now check the algorithm does converge on the sixth presentation of an input.
−2
Means with the weight W (6) = and b(6) = 1 .
−3
We can get the output that will be same with the target, and errors for the
various inputs are 0 in next pass.
pass
Similarly, we can observe
Step-7:
y = hardlim(4-6+1)=0=d
hardlim(4 7
Step-8:
y = hardlim(2-3+1)=1=d
hardlim(2 8
Step-9:
y = hardlim(-4-6+1)=0=d
hardlim( 9
http://rajakishor.co.cc Page 75
B P ack ropagation N etwork
In multi-layer perceptron, the network receives input signals and processes
these signals along with their weights through the neurons of different layers to
compute the output. The computed output will be compared with the desired or target
output and the difference between them is reported as error. If the error is significant, it
will be propagated back to the network to update the weights for computing the
outputs. The process stops when the expected output is produced. The process of
propagating error to the network is called the back propagation.
Back propagation is a systematic method for training multi-layer artificial neural
networks. It is built on high mathematical foundation and has very good application
potential even though not highly practical. It is a multi-layer forward network using
extend gradient-descent based delta-learning rule, commonly known as back
propagation rule. Back propagation provides a computationally efficient method for
changing the weights in a feed forward network, with differentiable activation function
units.
Being a gradient descent method, the back propagation algorithm minimizes the
total squared error of the output computed by the net. The network is trained by
supervised learning method. The aim of this network is to train the net to achieve a
balance between the ability to respond correctly to the input patterns that are used for
training and the ability to provide good responses to the input that are similar.
Generally, back propagation learning consists of two passes: a forward pass and a
backward pass. In the forward pass, an activity pattern is applied to the sensory nodes of
the network. Its last layer produces the outputs as the actual responses of the network.
During this pass the synaptic weights of the network are fixed. During backward pass,
the synaptic weights are adjusted in accordance with an error correction rule. The
actual response of the network is subtracted from a desired response to produce an
error signal. This error signal is then propagated backward through the network against
the direction of synaptic connections. Hence this algorithm also known as error back
propagation algorithm or back propagation. The learning process performed with
the algorithm is called back propagation learning.
http://rajakishor.co.cc Page 76
B P ack ropagation A lgorithm
Actually, Back Propagation is the training or learning algorithm rather than the
neural network itself.
A Back Propagation network learns by example. Back Propagation networks are
ideal for simple Pattern Recognition and Mapping Tasks. We train the network through
a set of examples. Each example consists of a pair of input and an expected output for
that input. Once the network is trained, it will provide the desired output for any of the
input patterns.
Let’s now look at how the training works.
The network is first initialized by setting up all its weights to be small random
numbers – say between –1 and +1. Next, the input pattern is applied and the output is
calculated (this is called the forward pass). The calculated output is completely different
from the desired or target output, since all the weights are random. We then calculate
the Error of each neuron, which is the difference between the desired output and the
computed output. This error is then used mathematically to adjust the weights in such a
way that the error will get smaller. In other words, the Output of each neuron will get
much closer to its Target (this part is called the reverse pass). The process is repeated
again and again until the error is minimal or zero.
http://rajakishor.co.cc Page 77
Propagate the input forward through the network:
1. Input the instance X to the network and compute the output yk
of every neuron k in the network.
Propagate the error backword through the network:
2. For each output neuron k, calculate its error term δk:
δk = yk(1 - yk)(tk - yk)
3. For each hidden unit h, calculate its error term δh
δ h = yh (1 − yh ) ∑
k ∈outputs
wkhδ k
Example:
Let’s just look at a single connection initially, between a neuron in the output
layer and one in the hidden layer in the following figure.
http://rajakishor.co.cc Page 78
The algorithm works as follows.
Step-1:
First apply the inputs to the network and compute the output – remember this
initial output could be anything, as the initial weights were random numbers.
Step-2:
Next compute the error for neuron B. The error:
ErrorB = OutputB (1-OutputB)(TargetB – OutputB)
The “Output(1-Output)” term is necessary in the equation because of the Sigmoid
Function. If we were only using a threshold neuron it would just be (Target –
Output).
Step-3:
Adjust the weight. Let W+AB be the new weight and WAB be the initial weight.
W+AB = WAB + (ErrorB * OutputA)
Notice that it is the output of the connecting neuron (neuron A) we use (not B).
We update all the weights in the output layer in this way.
Step-4:
Calculate the Errors for the neurons of the hidden layer. Unlike the output layer
we can’t calculate these directly, because we don’t have a Target. So we Back
Propagate them from the output layer. This is done by taking the Errors from the
output neurons and running them back through the weights to get the hidden
layer errors.
For example if neuron A is connected as shown to B and C then we take the
errors from B and C to generate an error for A.
ErrorA = Output A (1 - Output A)(ErrorB WAB + ErrorC WAC)
Again, the factor Output(1-Output) is present because of the sigmoid squashing
function.
Step-5:
Having obtained the Error for the hidden layer neurons now proceed as in step-3
to change the hidden layer weights. By repeating this method we can train a
network of any number of layers.
http://rajakishor.co.cc Page 79
Worked Example:
Consider the simple network:
Answer:
1) Input to top neuron = (0.35 * 0.1) + (0.9 * 0.8) = 0.755; Output = 0.68.
Input to bottom neuron = (0.9 * 0.6) + (0.35 * 0.4) = 0.68; Output = 0.6637.
Input to final neuron = (0.3 * 0.68) + (0.9 * 0.6637) = 0.80133; Output = 0.69
2) Output error δ = (target-output)(1-output)output
= (0.5-0.69)(1-0.69)0.69 = -0.0406.
New weights for output layer
w1+ = w1 + (δ * input) = 0.3 + (-0.0406 * 0.68) = 0.272392.
w2+ = w2 + (δ * input) = 0.9 + (-0.0406 * 0.6637) = 0.87305.
Errors for hidden layers:
δ1 = δ * w1 = -0.0406 * 0.272392 * (1-output)output = -2.406 * 10 - 3
δ2= δ * w2 = -0.0406 * 0.87305 * (1-output)output = -7.916 * 10 - 3
New hidden layer weights:
w3+=0.1 + (-2.406 * 10-3 x 0.35) = 0.09916.
w4+ = 0.8 + (-2.406 * 10-3 x 0.9) = 0.7978.
w5+ = 0.4 + (-7.916 * 10-3 x 0.35) = 0.3972.
w6+ = 0.6 + (-7.916 * 10-3 x 0.9) = 0.5928.
3) Old error was -0.19. New error is -0.18205. Therefore error has reduced.
http://rajakishor.co.cc Page 80
6
Bayesian Learning
http://rajakishor.co.cc Page 81
B asics of P robability T heory
Probability is the possible occurrence of an event whose happening is not
certain.
If an experiment is conducted under essentially homogeneous conditions we
generally come across two types of situations:
1. The result or outcome is unique or certain. This kind of phenomenon is called
deterministic or predictable phenomenon.
2. The result is not certain but may be one of the several possible outcomes.
This kind of phenomenon is unpredictable or probabilistic.
Trial and event: The experiment that we conduct is a trial and the outcome that
is expected out of the experiment is an event.
Exhaustive events: Total number of possible outcomes in any trial is known as
exhaustive events.
Favourable events: The number of cases favourable to an event in a trial is the
number of outcomes which entail the happening of the event.
Mutually exhaustive event: Events are said to be mutually exclusive or
incompitable if the happening of any one of them prevents or precludes the happening
of all others, i.e., if no two or more events can happen simultaneously in the same trila.
Equally likely events: Outcomes of a trial are said to be equally likely if they
have equal chances of occurrence.
Independent events: Events are said to be independent if the happening or non-
happening of an event is not effected by the supplimentary knowledge concerning the
occurrence of any number of the remaining events.
Mathematical definition of probability
If a trial results in n exhaustive, mutually exclusive and equally likely cases and
m of them are favourable to the happening of an event E, then the probability p of
happening of E is given by
Number of cases favourable to E m
p= =
Number of exhaustive cases in the trial n
Also, the number of cases not favourable to the event E is n-m.
n−m m
⇒ q= = 1− = 1− p
n n
⇒ p + q =1
http://rajakishor.co.cc Page 82
Sample space: The set of all possible outcomes of an experiment is called the
sample space.
Event: Every non-empty subset A of S, which is a disjoint union of a single
element subsets of the sample space S of a random experiment E is called an event.
Probability function
P(A) is the probability function defined on a σ-field B of events if the following
properties hold.
1. For each A є B, P(A) is defined, is real and P(A) ≥ 0.
2. P(S) = 1.
3. If { An } is any finite or infinite sequence of disjoint events in B, then
n n
p (∪ Ai ) = ∑ P ( Ai )
i =1 i =1
Note
1. If the events A and B are mutually exclusive, then
P ( A ∪ B ) = P ( A) + P ( B )
2. Probability of the impossible event is zero, i.e., P(ф) = 0.
3. Probability of the complementary event A’ of A is given by P(A’) = 1 – P(A).
4. For any two events A and B,
P ( A '∩ B ) = P ( B ) − P ( A ∩ B )
5.
If B ⊂ A, then
(i) P(A ∩ B')=P(A) - P(B)
(ii) P(B) ≤ P(A)
http://rajakishor.co.cc Page 83
Multiplication Law of Probability and Conditional Probability
For two events A and B
P ( A ∩ B ) = P ( A).P ( B | A), P(A) > 0
=P(B).P(A|B), P(B) > 0
where P(B|A) represents the conditional probability of occurrence of B, when the
event has already happened and P(A|B) is the conditional probability of happening of A,
given that B has already happened.
Now the conditional probabilities are obtained as follows:
P( A ∩ B )
P ( B | A) =
P( A)
P( A ∩ B )
P ( A | B) =
P( B )
Thus the conditional probabilities P(B|A) and P(A|B) are defined if and only if
P(A) ≠ 0 and P(B) ≠ 0 respectively.
Note:
1. For P(B) > 0, P(A|B) ≤ P(A).
2. P(A|A) = 1.
3. If A1, A2, …, An are independent events then
P( A1 ∩ A2 ∩ ... ∩ An ) = P ( A1 ).P( A2 )...P( An )
4. For every three events A, B and C
P( A ∪ B | C ) = P( A | C ) + P( B | C ) − P( A ∩ B | C )
Independent events
An enevt B is said to be independent of event A, if P(B|A) = P(B).
Note:
1. If A and B are independent events then A and B’ are also independent events.
2. If A nd B are independent events then A’ and B’ are also independent events.
http://rajakishor.co.cc Page 84
Bayes’ Theorem
If E1, E2, …, En are mutually disjoint events with P(Ei) ≠ 0, (i = 1, 2, …, n) then for
n
any arbitrary event A which is a subset of ∪E
i =1
i such that PA() > 0,
we have
P ( Ei ).P ( A | Ei )
P ( Ei | A) = n
∑ P( E ).P( A | E )
i =1
i i
This enables us to find the probabilities of the various events E1, E2, …, En which
cause A to occur.
Note:
1. The probabilities P(E1), P(E2), …, P(En) are termed as the priory probabilities
because they exist before we gain any information from the experiment itself.
2. The probabilities P(A|Ei), (i = 1, 2, …, n) are called likelihoods because they
indicate how likely the event A under consideration is to occur, given each
and every priory probability.
3. The probabilities P(Ei|A), (i = 1, 2, …, n) are called posteriory probabilities
because they are determined after the results of the experiments are known.
B ayesians L earning
Bayesian reasoning provides a probabilistic approach to inference. It is based on
the assumption that the quantities of interest are governed by probability distributions
and that optimal decisions can be made by reasoning about these probabilities together
with observed data.
Bayesian reasoning is important to machine learning because it provides a
quantitative approach to weighing the evidence supporting alternative hypotheses.
Bayesian reasoning provides the basis for learning algorithms that directly manipulate
probabilities, as well as a framework for analyzing the operation of other algorithms
that do not explicitly manipulate probabilities.
http://rajakishor.co.cc Page 85
Bayesian learning methods are relevant to our study of machine learning for two
different reasons.
1. Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to
certain types of learning problems.
2. Bayesian learning algorithms provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities.
Features of Bayesian learning methods include:
Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct. This provides a more flexible
approach to learning than algorithms that completely eliminate a hypothesis if it
is found to be inconsistent with any single example.
Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a
probability distribution over observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic
predictions (e.g., hypotheses such as "this pneumonia patient has a 93% chance
of complete recovery").
New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods prove computationally intractable, they
can provide a standard of optimal decision making against which other practical
methods can be measured.
One practical difficulty in applying Bayesian methods is that they typically
require initial knowledge of many probabilities. When these probabilities are not
known in advance they are often estimated based on background knowledge, previously
available data, and assumptions about the form of the underlying distributions. A second
practical difficulty is the significant computational cost required to determine the Bayes
optimal hypothesis in the general case (linear in the number of candidate hypotheses).
http://rajakishor.co.cc Page 86
B ayesians T heorem in M achine L earning
In machine learning we are often interested in determining the best hypothesis
from some space H, given the observed training data D. One way to specify what we
mean by the best hypothesis is to say that we demand the most probable hypothesis,
given the data D plus any initial knowledge about the prior probabilities of the various
hypotheses in H. Bayes theorem provides a direct method for calculating such
probabilities. More precisely, Bayes theorem provides a way to calculate the probability
of a hypothesis based on its prior probability, the probabilities of observing various data
given the hypothesis, and the observed data itself.
To define Bayes theorem precisely, let us first introduce a little notation.
We shall write P(h) to denote the initial probability that hypothesis h holds,
before we have observed the training data. P(h) is often called the prior-
probability of h and may reflect any background knowledge we have about the
chance that h is a correct hypothesis.
If we have no such prior knowledge, then we might simply assign the same prior
probability to each candidate hypothesis.
Similarly, we will write P(D) to denote the prior probability that training data D
will be observed.
Next, we will write P(D|h) to denote the probability of observing data D given
some world in which hypothesis h holds.
In machine learning problems we are interested in the probability P (h|D) that h
holds given the observed training data D. P(h|D) is called the posterior-
probability of h, because it reflects our confidence that h holds after we have seen
the training data D.
The posterior probability P(h|D) reflects the influence of the training data D, in
contrast to the prior probability P(h) , which is independent of D.
Bayesian theorem
P ( D | h) P ( h)
P( h | D) =
P( D)
It is obvious that P(h|D) is directly proportional to P(D|h) and P(h).
http://rajakishor.co.cc Page 87
In many learning scenarios, the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypothesis h є H given the
observed data D. Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes
theorem to calculate the posterior probability of each candidate hypothesis.
hMAP = argmax P (h | D)
h∈H
P ( D | h) P (h)
= argmax
h∈H P( D)
= argmax P ( D | h) P(h)
h∈H
Example
Consider a medical diagnosis problem in which there are two alternative
hypotheses - h1: the patient has a particular form of cancer and h2: the patient does not
have cancer. The available data is from a particular laboratory test with two possible
outcomes: Pv (positive) and Nv (negative). We have prior knowledge that over the
entire population of people only 0.008 have this disease. Furthermore, the lab test is
only an imperfect indicator of the disease. The test returns a correct positive result in
only 98% of the cases in which the disease is actually present and a correct negative
result in only 97% of the cases in which the disease is not present. In other cases, the
test returns the opposite result.
The above situation can be summarized by the following probabilities:
P(h1) = 0.008, P(h2) = 0.992
P(Pv|h1) = 0.98, P(Nv|h1) = 0.02
P(Pv|h2) = 0.03, P(Nv|h2) = 0.97
Suppose we now observe a new patient for whom the lab test returns a positive
result. Should we diagnose the patient as having cancer or not? The maximum a
posteriori (MAP) hypothesis can be found as follows:
P(h1|Pv) = P(Pv|h1) * P(h1) = 0.98 * 0.008 = 0.0078
P(h2|Pv) = P(Pv|h2) * P(h2) = 0.03 * 0.992 = 0.0298
Thus hMAP = h2, i.e., the patient does not have cancer.
http://rajakishor.co.cc Page 88
B O ayes ptimal C lassifier
Consider a hypothesis space containing three hypotheses, h1, h2, and h3.
Suppose that the posterior probabilities of these hypotheses given the training data are
0.4, 0.3, and 0.3 respectively. Thus, h1 is the MAP hypothesis. Suppose a new instance x
is encountered, which is classified positive by h1, but negative by h2 and h3. Taking all
hypotheses into account, the probability that x is positive is 0.4 (the probability
associated with h1), and the probability that it is negative is therefore 0.6. The most
probable classification (negative) in this case is different from the classification
generated by the MAP hypothesis.
In general, the most probable classification of the new instance is obtained by
combining the predictions of all hypotheses, weighted by their posterior probabilities. If
the possible classification of the new example can take on any value vj from some set V,
then the probability P(vj|D) that the correct classification for the new instance is vj, is
just
P (v j | D ) = ∑ P (v
hi ∈H
j | hi ) P(hi | D)
The optimal classification of the new instance is the value vj, for which P (vj|D) is
maximum.
argmax
v j ∈V
∑ P (v
hi ∈H
j | hi ) P (hi | D )
This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over the
hypotheses.
To illustrate in terms of the above example, the set of possible classifications of
the new instance is V = {Pv, Nv}, and
P(h1|D) = 0.4, P(Nv|h1) = 0, P(Pv|h1) = 1
P(h2|D) = 0.3, P(Nv|h2) = 1, P(Pv|h2) = 0
P(h3|D) = 0.3, P(Nv|h3) = 1, P(Pv|h3) = 0
http://rajakishor.co.cc Page 89
therefore
∑ P ( Pv | h ) P ( h | D ) = 0.4
hi ∈ H
i i
∑ P ( Nv | h ) P ( h
hi ∈ H
i i | D ) = 0.6
and
P(X | C )P(C )
P(C | X) = i i
i P(X)
Since P(X) is constant for all classes, only
http://rajakishor.co.cc Page 90
Derivation of naïve Bayesian classifier
A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P ( X | C i ) = ∏ P ( x | C i ) = P ( x | C i ) × P ( x | C i ) × ... × P ( x | C i )
k 1 2 n
k =1
This greatly reduces the computation cost: Only counts the class distribution.
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided
by |Ci, D| (# of tuples of Ci in D).
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x−µ )2
1 −
2σ 2
g(x, µ,σ ) = e
2πσ
and P(xk|Ci) is
P ( X | C i ) = g ( xk , µCi , σ Ci )
Example:
Consider the following dataset
Age Income Student Credit_rating Buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31…40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31…40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31…40 Medium No Excellent Yes
31…40 High Yes Fair Yes
>40 Medium No Excellent No
http://rajakishor.co.cc Page 91
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample:
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
Class probabilities:
P(C1): P(buys_computer = “yes”) = 9/14 = 0.643
P(C2): P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
Computing P(X|C1):
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(X|C1) = P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C1)*P(C1) = P(X|buys_computer=“yes”) * P(buys_computer = “yes”) = 0.028
Compute P(X|C2):
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
P(X|C2) = P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|C2)*P(C2) = P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Now, Max(P(X|C1)*P(C1) P(X|C2)*P(C2) = Max(0.028, 0.007) = 0.028, which implies
that P(X|C1) is maximum.
Therefore, X belongs to class C1, i.e., (buys_computer = “yes”)
http://rajakishor.co.cc Page 92
B ayesian B N elief etwork
Bayesian belief network allows a subset of the variables to be conditionally
independent.
Bayesian belief network is a graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
http://rajakishor.co.cc Page 93
The conditional probability table (CPT) for variable LungCancer (LC):
CPT shows the conditional probability for each possible combination of its
parents.
Derivation of the probability of a particular combination of values of X, from CPT:
n
P( x1 ,..., xn ) = ∏ P ( xi | Parents (Y i ))
i =1
http://rajakishor.co.cc Page 94