Vous êtes sur la page 1sur 51

5

Artificial Neural Networks


http://rajakishor.co.cc Page 44
W hat is a N eural Network?

The human brain is a highly complex, nonlinear and parallel computer. It has the
capability to organize its structural constituents, known as neurons, so as to perform
certain computations many times faster than the fastest digital computer in existence
today.
A neural network is massively parallel distributed processor made up of simple
processing units, which has a natural propensity for storing experiential knowledge and
making it available for use.
It resembles the brain in two respects:
1. Knowledge is acquired by the network from its environment through process.
2. Interneuron connection strengths, known as synaptic weights, are used to
store the acquired knowledge.
The procedure used to perform the learning process is called a learning
algorithm. Its function is to modify the synaptic weights of the network to attain a
desired design objective.
Neural networks are also referred to as neurocomputers, connectionist
networks, parallel distributed processors.

B enefits of N eural N etwork


A neural network derives its computing power through
1. its massively parallel distribute structure
2. its ability to learn and generalize

P roperties and C apabilities of N eural Networks


1. Nonlinearity
Nonlinearity is a highly important property, particularly if the underlying
physical mechanism responsible for generation of the input signal is inherently
nonlinear.

http://rajakishor.co.cc Page 45
2. Input-output mapping
 Supervised learning
 Working through training samples or task examples.
3. Adaptivity
 Adapting the synaptic weights to change in the surrounding
environments.
4. Evidential response
5. Contextual information
6. Fault tolerance
7. VLSI implementability
8. Uniformity of analysis and design
9. Neurobiological analogy
logy

H uman B rain
The human nervous system may be viewed as a three-stage
three system.

Central to the nervous system is the brain. It is represented by the neural net.
The brain continually receives the information, perceives it, and makes appropriate
decisions. The arrows pointing from left to right indicate the forward transmission of
information – bearing signals
als through the system. The arrows pointing from right to left
signify the presence of feedback in the system.
The receptors convert stimuli from the human body or the external environment
into electrical impulses that convey information to the neural net (the brain). The
effectors convert electrical impulses generated by the neural net into discernible
responsible as system outputs.
Typically, neurons are five to six orders of magnitude slower than silicon gates.
Events in the silicon chip happen in the 10-9µs – range, whereas neural events happen in
the 10-3µs – range.

http://rajakishor.co.cc Page 46
It is estimated that there are approximately 10 billion neurons and 60 trillion
synapses or connections in the human brain.
Synapses are elementary structural and functional units that mediate the
interactions between neurons. The most common kind of synapse is a chemical synapse.
A chemical synapse operates as follows. A pre-synaptic
pre synaptic process liberates a
transmitter substance that diffuses across the synaptic junction between neurons and
then acts on a post-synaptic
synaptic process. Thus, a synapse converts a pre-synaptic
pre synaptic electric
signal into a chemical signal and then back into a post synaptic electrical signal.
Structural organization of levels in the brain

The synapses represent the most fundamental


fundamental level, depending on molecules
and ions for their action.

http://rajakishor.co.cc Page 47
A neural microcircuit refers to an assembly of synapses organized into patterns
of connectivity to produce a functional operation of interest.
The neural microcircuits are grouped to form dendritic subunits within the
dendritic trees of individual neurons.
The whole neuron is about 100µm in size. It contains several dendritic subunits.
The local circuits are made up of neurons with similar or different properties.
Each circuit is about 1mm in size. The neural assemblies perform operations on
characteristics of a localized region in the brain.
The interregional circuits are made up of pathways, columns and topographic
maps, which involve multiple regions
r located in different parts of the brain.
Topographic maps are organized to respond to incoming sensory information.
The central nervous system is the final level of complexity where the
topographic maps and other interregional circuits mediate specific types of behavior.

M odels of a N euron
A neuron is an information-processing
information processing unit that is fundamental to the operation
of a neural network. Its model can be shown in the following block diagram.

http://rajakishor.co.cc Page 48
The neuronal model has three basic elements:
1. A set of synapses each of which is characterized by a weight or strength of its
own. Each synapse has two parts: a signal xj and a weight wkj. Wkj refers to the
weight of the kth neuron with respect to jth input signal. The synaptic weight may
range through positive as well as the negative values.
2. An adder for summing the input signals, weighted by the respective synapses of
the neuron.
3. An activation function for limiting the amplitude of the output of a neuron.
ne
The neuron model also includes an externally applied bias, bk. the bias has the
effect of increasing or lowering the net input of the activation function, depending on
whether it is positive or negative, respectively.
A neuron k may be mathematically described as follows:

where x1, x2, …, xm are the input signals; Wk1, Wk2, …, Wkm are the synaptic weights of the
neuron k; uk is the linear combiner output due to the input signal; bk is the bias; vk is the
induced local field; ψ(.)
(.) is the activation function and yk is the output signal of the
neuron k.
The use of bias bk has the effect of applying an affine transformation to the
output uk of the linear combiner.
So, we can have
vk = uk + bk ---- (2)

http://rajakishor.co.cc Page 49
Now, the equation (1) will be written as follows:

Due to this affine transformation, the graph of vk versus uk no longer passes


through the origin.
vk is called the induced local field or activation potential of neuron k. In vk we
have added a synapse. Its input is x0 = +1 and weight is Wk0 = bk.

T ypes of A ctivation F unction


1. Threshold function
2. Piecewise-linear
linear function
3. Sigmoid function

The activation function defines the output of a neuron in terms of the induced
local field vk.
1. Threshold function
The function is defined as
1, if v ≥ 0
Ψ (v ) = 
0, if v < 0
This form of a threshold function is also called as Heaviside function.
Correspondingly, the output of neuron k is expressed as
1, if vk ≥ 0
yk = 
0, if vk < 0
where
m
vk = ∑ Wkj x j + bk
j =1

http://rajakishor.co.cc Page 50
This model is also called the McCullouch-Pitts model. In this model, the output of
a neuron is 1, if the induced local field of that neuron is nonnegative, and 0 otherwise.
This statement describes the all-or-none property of the model.

2. Piecewise-linear function
The activation function, here, is defined as

 1
1, v≥ +
2

 1 1
Ψ (v ) = v , + > v > −
 2 2
 1
 0, v≤−
 2

where the amplification factor inside the linear region of operation is assumed to be
unit. Two situations can be observed for this function:
 A linear combiner arises if the linear region of operation is maintained without
running into situation.
 The piecewise-linear function reduces to a threshold function if the amplification
factor of the linear region is made infinitely large.

3. Sigmoid function
This is the most common form of activation function used in the construction of
artificial neural networks. It is defined as a strictly increasing function that exhibits a
graceful balance between linear and nonlinear behavior.
An example of sigmoid function is the logistic function, which is defined as

1
Ψ (v ) =
1 + e − av

where a is the slope parameter of the sigmoid function.

http://rajakishor.co.cc Page 51
N eural N etworks and D irected G raphs
The neural network can be represented through a signal-flow
signal flow graph. A signal-
signal
flow graph is a network of directed links (branches) that are interconnected at certain
points called nodes. A typical node j has an associated node signal xj. A typical directed
link originates at node j and terminates on node k; it has an associated transfer function
or transmittance that specifies manner in which the signal yk at node k depends on the
signal xj at node j.
The flow of signals in the various parts of the graph is directed by three basic
rules.
Rule-1:
A signal flows along a link only in the direction defined by the arrow on the link.
There are two types of links:
 Synaptic links: whose behavior is governed by a linear input-output
input
relation. Here, we have yk = Wkjxj.
For example,

 Activation links: whose behavior is governed by a nonlinear input-output


input
relation.
For example,

Rule-2:
A node signal equals the algebraic sum of all signals entering the pertinent node
via the incoming links. This also called the synaptic convergence or fan-in.
fan
For example,

http://rajakishor.co.cc Page 52
Rule-3:
The signal at a node is transmitted to each outgoing link originating from that
node.
For example,

This rule is also called the synaptic divergence or fan-out.


fan
A neural network is a directed graph consisting of nodes with interconnecting
synaptic and activation links. It is characterized by four properties:
1. Each neuron is represented by a set of linear synaptic links, an externally
applied bias, and a possibly nonlinear activation link. The bias is represented
by a synaptic link connected to an input fixed at +1.
2. The synaptic links of a neuron weight their respective input signals.
3. The weighted
ghted sum of the input signals defines the induced local field of the
neuron under study.
4. The activation link squashes the induced local field of the neuron to produce
an output.
Note: A digraph describes not only the signal flow from neuron to neuron, but also the
signal flow inside each neuron.

N eural N etworks and A rchitectures


There are three fundamentally different classes of network architectures:
1. Single-layer
layer feedforward networks
2. Multilayer feedforward networks
3. Recurrent networks or neural networks with feedback

http://rajakishor.co.cc Page 53
1. Single-layer
layer feedforward networks
In a layered neural network, the neurons are organized in the form of layers. The
simplest form of a layered network has an input layer of source nodes that project onto
an output layer but not vice versa.
For example,

The above network is a feedforward or acyclic type. This is also called a single-
single
layer network. The single layer refers to the output layer as computations take place
only at the output nodes.
2. Multilayer feedforward networks
In this class, a neural network has one or more hidden layers, whose
computation nodes are called hidden neurons or hidden units. The function of hidden
neurons is to intervene between the external input and the network output in a useful
manner. By adding one or more hidden
hidden layers, the network is enabled to extract higher-
higher
order statistics. This is essentially required when the size of the input layer is large. For
example,

http://rajakishor.co.cc Page 54
 The source nodes in the input layer supply respective elements of the
activation pattern (input vector), which constitutes the input signals
applied to the second layer.
 The output signals of the second layer are used as inputs to the inputs to
the third layer, and so on for the rest of the network.
 The set of output signals of the neurons in the output layer constitutes the
overall response of the network to the activation pattern supplied by the
source nodes in the input layer.

3. Recurrent networks or neural networks with feedback


In this class, a network will have at least one feedback loop.
For example,

The above is a recurrent network with no hidden neurons. The presence of


feedback loops has an impact on the learning capability of the network and on its
performance. Moreover, the feedback loops involve the use of unit-delay
unit delay elements
-1
(denoted by Z ), which result in a nonlinear dynamical behavior of the network.

http://rajakishor.co.cc Page 55
K nowledge R epresentation
Knowledge refers to stored information or models used by a person or machine
to interpret, predict, and appropriately respond to the outside world.
Knowledge
ge representation involves the following:
1. Indentifying the information that is to be processed
2. Physically encoding the information for subsequent use
Knowledge representation is goal directed. In real-world
real world applications of
“intelligent” machines, a good solution depends on a good representation of knowledge.
A major task for a neural network is to provide a model for a real-time
real
environment into which it is embedded. Knowledge of the world consists of two kinds of
information:
1. Prior information: It gives the known state of the world. It is represented by
facts about what is and what has been known.
2. Observations: These are the measures of the world. These are obtained by the
sensors that probe the environment where the neural network operates.
The set of input-output
output pairs, with each pair consisting of an input signal and the
corresponding desired response is called a set of training data or training sample.
Ex: Handwritten digital recognition.

The training sample consists of a large variety of handwritten


handwritten digits that are
representative of a real-time
time situation. Given such a set of examples, the design of a
neural network may proceed as follows:
Step-1:
1: Select an appropriate architecture for the NN, with an input layer consisting of
source nodes equal in n number to the pixels of an input image, and an output
layer consisting of 10 neurons (one for each digit). A subset of examples is then
used to train the network by means of a suitable algorithm. This phase is the
learning phase.
Step-2: The recognition performance of the trained network is tested with data not seen
before. Here, an input image is presented to the network and not its
corresponding digit. The NN now compares the input image with the stored

http://rajakishor.co.cc Page 56
image of digits and then produces the required output
output digit. This phase is called
the generalization.

Note: The training data for a NN may consist of both positive and negative examples.
Example: A simple neuronal model for recognizing handwritten digits.
digits
Consider an input set X of key patterns X1, X2, X3, ……
Each key pattern represents a specific handwritten digit.
The network has k neurons.
Let W = {w1j(i), w2j(i), w3j(i), ……}, for j= 1,2,3, …., k be the set of weights of X1,
X2, X3, ….. with
ith respect to each of k neurons in the network. i referrers to an
instance.
Let y(j) be the generated output of neuron j for j=1,2,…k.
Let d(j) be the desired output of neuron j, for j=1,2,…..k.
Let e(j)= d(j) – y(j) be the error that is calculated at neuron j, for j = 1,2,…,k.
Now we design the neuronal model for the system as follows.

In the above model, each neuron computes a specific digit j. With every key
pattern, synapses are established to every neuron in the model. We assumed that the
weights of each key pattern can be either 0 or 1.

http://rajakishor.co.cc Page 57
Ex: Let
et the key pattern x1 corresponds a hand written digit 1. So its synaptic
weight W11(i) should be 1 for the 1st neuron and all other synaptic weights for x1 is must
be 0.
Weight matrix for the above model can be as follows.

Now the output for the neuron will be computed as follows.


Y(1) = w11x1+w21x2+w31x3+……………….+w91x9
= 1.(x1)+0.(x2)+0.(x3)+…………..+0.(x9)
= x1
Which means that neuron 1 is designed to recognize only the key pattern x1
which corresponds to the hand written digit 1. In the same way all other neurons in the
model have to recognize their respective digits.
Rules for knowledge representation
Rule-1:
1: Similar inputs from similar classes should produce similar
representations inside the network, and should belong to the same
category. The concept of Euclidean distance is used as a measure of the
similarity between inputs.
Let Xi denote the m x 1 vector.
Xi = [xi1, xi2, …, xim]T.
The vector Xi defines a point in an m-dimensional
m dimensional space called Euclidian
space denoted by R .m

http://rajakishor.co.cc Page 58
Now, the Euclidean distance between Xi and Xj is defined by
d ( X i , X j , ) =|| X i − X j ||
m
= ∑ (x
k =1
ik − x jk ) 2

The two inputs Xi and Xj are said to be similar if d(Xi, Xj) is minimum.

Rule-2: Items to be categorized as separate classes should be given widely


different representations in the network.
This rule is the exact opposite of rule-1.
Rule-3: If a particular feature is important, then there should be a large number
of neurons involved in the representation of that item in the network.
Ex: A radar application involving the detection of a target in the presence
of clutter. The detection performance of such a radar system is measured
in terms of two probabilities.
 Probability of detection
 Probability of false alarm
Rule-4: Prior information and invariances should be built into the design of a NN,
thereby simplifying the network design by not having to learn them.

How to build prior information into NN design?


We can use a combination of two techniques:
1. Restricting the network architecture through the use of local connections
known as receptive fields.
2. Constraining the choice of synaptic weight through the use of weight sharing.

How to build invariances into NN design?


 Coping with a range of transformations of the observed signals.
 Pattern recognition.
 Need of a system that is capable of understanding the whole environment.
A primary requirement of pattern recognition is to design a classifier that is
invariant to the transformations.

http://rajakishor.co.cc Page 59
There are three techniques for rendering classifier-type NNs invariant to
transformations:
1. Invariance by structure
2. Invariance by training
3. Invariant feature space

B L asic earning L aws


A neural network learns about its environment through an interactive process of
adjustments applied to its synaptic weights and bias levels. The network becomes more
knowledgeable after each iteration of the learning process.
Learning is a process by which the free parameters of a neural network are
adapted through a process of stimulation by the environment in which the network is
embedded.
The operation of a neural network is governed by neuronal dynamics. Neuronal
dynamics consists of two parts: one corresponding to the dynamics of the activation
state and the other corresponding to the dynamics of the synaptic weights.
The Short Term Memory (STM) in neural networks is modeled by the activation
state of the network. The Long Term Memory (LTM) corresponds to the encoded
pattern information in the synaptic weights due to learning.
Learning laws are merely implementation models of synaptic dynamics.
Typically, a model of synaptic dynamics is described in terms of expressions for the first
derivative of the weights. They are called learning equations.
Learning laws describe the weight vector for the ith processing unit at time
instant (t+1) in terms of the weight vector at time instant (t) as follows:
Wi(t+1) = Wi(t) + ∆Wi(t)
where ∆Wi(t) is the change in the weight vector.
There are different methods for implementing the learning feature of a neural
network, leading to several learning laws. Some basic learning laws are discussed
below. All these learning laws use only local information for adjusting the weight of the
connection between two units.

http://rajakishor.co.cc Page 60
Hebb’s Laws
Here the change in the weight vector is given by
∆Wi(t) = ηf(WiTa)a
Therefore, the jth component of ∆Wi is given by
∆wij = ηf(WiTa)aj
= ηsiaj, for j = 1, 2, …, M.
where si is the output signal of the ith unit. a is the input vector.
The Hebb’s law states that the weight increment is proportional to the product of
the input data and the resulting output signal of the unit. This law requires weight
initialization to small random values around wij = 0 prior to learning. This law
represents an unsupervised learning.
Perceptron Learning Law
Here the change in the weight vector is given by
∆Wi = η[di – sgn(WiTa)]a
where sgn(x) is sign of x. Therefore, we have
∆wij = η[di – sgn(WiTa)]aj
= η(di – Si) aj, for j = 1, 2, …, M.

The perceptron law is applicable only for bipolar output functions f(.). This is
also called discrete perceptron learning law. The expression for ∆wij shows that the
weights are adjusted only if the actual output si is incorrect, since the term in the square
brackets is zero for the correct output.
This is a supervised learning law, as the law requires a desired output for each
input. In implementation, the weights can be initialized to any random initial values, as
they are not critical. The weights converge to the final values eventually by repeated use
of the input-output pattern pairs, provided the pattern pairs are representable by the
system.

http://rajakishor.co.cc Page 61
Delta Learning Law
Here the change in the weight vector is given by
∆Wi = η[di – f(WiTa)] f(WiTa)a
where f(x) is the derivative with respect to x. Hence,
∆wij = η[di – f(WiTa)] f(WiTa)aj
= η[di - si] f(xi) aj, for j = 1, 2, …, M.
This law is valid only for a differentiable output function, as it depends on the
derivative of the output function f(.). It is a supervised learning law since the change in
the weight is based on the error between the desired and the actual output values for a
given input.
Delta learning law can also be viewed as a continuous perceptron learning law.
In-implementation, the weights can be initialized to any random values as the
values are not very critical. The weights converge to the final values eventually by
repeated use of the input-output pattern pairs. The convergence can be more or less
guaranteed by using more layers of processing units in between the input and output
layers. The delta learning law can be generalized to the case of multiple layers of a
feedforward network.

Widrow and Hoff LMS Learning Law


Here, the change in the weight vector is given by
∆Wi = η[di - WiTa]a
Hence
∆wij = η[di - WiTa]aj, for j = 1, 2, …, M.
This is a supervised learning law and is a special case of the delta learning law,
where the output function is assumed linear, i.e., f(xi) = xi.
In this case the change in the weight is made proportional to the negative
gradient of the error between the desired output and the continuous activation value,
which is also the continuous output signal due to linearity of the output function. Hence,
this is also called the Least Mean Squared (LMS) error learning law.
In implementation, the weights may be initialized to any values. The input-
output pattern pairs data is applied several times to achieve convergence of the weights
for a given set of training data. The convergence is not guaranteed for any arbitrary
training data set.

http://rajakishor.co.cc Page 62
Correlation Learning Law
Here, the change in the weight vector is given by
∆Wi = ηdia
Therefore,
∆wij = ηdiaj
This is a special case ofo the Hebbian learning with the outputt signal (si) being
replaced by the desired signal (di). But the Hebbian learning is an unsupervised
learning, whereas the correlation
correlat learning is a supervised learning,
ng, since it uses the
desired output value to adjustt the weights. In the implementation of the learning law,la
the weights are initialised
lised to small random values close to zero,
zero i.e., wij ≈ 0.

Instar (Winner-take-all)
all) Learning Law
This is relevant for a collection of neurons,
neurons organized in a layer as shown below.

All the
he inputs are connected to each of the units in the output layer in a
feedforward manner. For a given input vector a, the output from each unit i is computed
using the weighted sum wiTa. The unit k that gives maximum output is identtified. That is
WkT = max(Wi T a)
i

Then the weight vector leading to the kth unit is adjusted as follows:
∆Wk = η(a - Wk)
Therefore,
∆wkj = η(aj - wkj), for j = 1, 2, …, M.
The final weight vector tends to represent a group of input vectors within
w a small
neighbourhood. This is a case of unsupervised learning. In implementation,, the values of
the weight vectors are initialized
initiali ed to random values prior to learning, and the vector
lengths are normalized
zed during learning.
learning

http://rajakishor.co.cc Page 63
Outstar Learning Law
The outstar learning law
aw is also related to a group of units arranged in a layer as
shown below.

In
n this law the weights are adjusted so as to capture the desired output pattern
pa
characteristics. The adjustment of the weights is given by
∆Wjk = η(dj - wjk), for j = 1, 2, …, M
where the kth unit is the only active unit in d1, d2, …, dM)T
i the input layer. The vector d = (d
is the desired response
se from the layer of M units.
The outstar learning is a supervised learning
le law, and it is used with a network of
instars
stars to capture the characteristics
chara of the input and output patterns for data
compression. In implementation,
tation, the weight vectors are initialized to zeroz prior to
learning.

P attern R ecognition
Data refers to the collection of raw facts, whereas, the pattern refers to an
observed sequence of facts.
The main difference between human and machine intelligence comes from the
fact that humans perceive everything as a pattern,, whereas for a machine everything is
data.. Even in routine data consisting of integer numbers (like telephone numbers, bank
account numbers, car numbers) humans tend to perceive a pattern. If there is no
pattern, then it is very difficult for a human being to remember and reproduce the data
later.
Thus storage and recall operations in human beings and machines are performed
by different mechanisms. The pattern nature in storage and recall automatically gives
robustness and fault tolerance for the human system.

http://rajakishor.co.cc Page 64
Pattern recognition tasks
Pattern recognition is the process of identifying a specified sequence that is
hidden in a large amount of data.
Following are the pattern recognition tasks.
1. Pattern association
2. Pattern classification
3. Pattern mapping
4. Pattern grouping
5. Feature mapping
6. Pattern variability
7. Temporal patterns
8. Stability-plasticity dilemma

Basic ANN Models for Pattern Recognition Problems

1. Feedforward ANN
 Pattern association
 Pattern classification
 Pattern mapping/classification
2. Feedback ANN
 Autoassociation
 Pattern storage (LTM)
 Pattern environment storage (LTM)
3. Feedforward and Feedback (Competitive Learning) ANN
 Pattern storage (STM)
 Pattern clustering
 Feature mapping
In any pattern recognition task we have a set of input patterns and the
corresponding output patterns. Depending on the nature of the output patterns and the
nature of the task environment, the problem could be identified as one of association or
classification or mapping.
The given set of input-output pattern pairs form only a few samples of an
unknown system. From these samples the pattern recognition model should capture the
characteristics of the system.

http://rajakishor.co.cc Page 65
A pplications of N
eural N etworks

Area Applications
Aerospace  High performance aircraft autopilots
 flight path simulations
 aircraft control systems
 autopilot enhancements
 aircraft component simulations
 aircraft component fault detectors
Automotive  Automobile automatic guidance systems
 warranty activity analyzers
Banking  Check and other document readers
 credit application evaluators
Defense  Weapon steering
 target tracking
 object discrimination
 facial recognition
 new kinds of sensors
 sonar
 radar and image signal processing including
data compression
 feature extraction and noise suppression
 signal/image identification
Electronics  Code sequence prediction
 integrated circuit chip layout
 process control
 chip failure analysis
 machine vision
 voice synthesis
 nonlinear modeling
Financial  Real estate appraisal
 loan advisor
 mortgage screening
 corporate bond rating
 credit line use analysis
 portfolio trading program
 corporate financial analysis
 currency price prediction

http://rajakishor.co.cc Page 66
Area Applications
Manufacturing  Manufacturing process control
 product design and analysis
 process and machine diagnosis
 real-time particle identification
 visual quality inspection systems
 beer testing
 welding quality analysis
 paper quality prediction
 computer chip quality analysis
 analysis of grinding operations
 chemical product design analysis
 machine maintenance analysis
 project bidding
 planning and management
 dynamic modeling of chemical process
systems
Medical  Breast cancer cell analysis
 EEG and ECG analysis
 prosthesis design
 optimization of transplant times
 hospital expense reduction
 hospital quality improvement
 emergency room test advisement
Robotics  Trajectory control
 forklift robot
 manipulator controllers
 vision systems
Speech  Speech recognition
 speech compression
 vowel classification
 text to speech synthesis
Securities  Market analysis
 automatic bond rating
 stock trading advisory systems
Telecommunica  Image and data compression
tions  automated information services
 real-time translation of spoken language
 customer payment processing systems
Transportation  Truck brake diagnosis systems
 vehicle scheduling
 routing systems

http://rajakishor.co.cc Page 67
A dvantages and D isadvantages of N eural N etworks
Advantages
 prediction accuracy is generally high
 robust, works when training examples contain errors
 output may be discrete, real-valued, or a vector of several discrete or real-
valued attributes
 fast evaluation of the learned target function
 Generalization
 HW realizations is somewhat physical
Criticism
 long training time
 difficult to understand the learned function (weights)
 not easy to incorporate domain knowledge

T hreshold L U ogic nits


Linearly separable (threshold) functions are implemented in a straightforward
way by summing the weighted inputs and comparing this sum to a threshold value as
shown in the following figure.

The above structure is a threshold logic unit (TLU). Its output is 1 or 0 depending
on whether or not the weighted sum of its inputs is greater than or equal to a threshold

http://rajakishor.co.cc Page 68
value Ѳ. It has also been called an Adaline (Adaptive linear element), an LTU (Linear
threshold unit), a perceptron, and a neuron.
The n-dimensional
dimensional feature or input vector is denoted by X = (x1, …, xn). The
components of X can be any real-valued
real numbers, but we often specialize to binary
numbers 0 and 1. The weights of a TLU are represented by an n-dimens n sional weight
vector W = (w1, …, wn). The TLU has output 1 if ∑ i =1 xi wi ≥ θ ; otherwise it has output 0.
n

The weighted sum that is calculated by TLE can


can be represented as a vector dot product
X.W.

P erceptrons
A perceptron refers to a neural network whose weights and biases could be
trained to produce a correct target vector when presented with the corresponding input
vector. The training technique used is called the perceptron learning rule. Perceptrons
aree especially suited for simple problems in pattern classification.
The perceptron could learn when initialized with random values for its weights
and biases.

In the above figure, the synaptic weights of the perceptron are denoted by w1, w2,
…, wk. Correspondingly, the inputs applied to the perceptron are denoted by X1, X2, …, Xk.
The external bias denoted by b. From m the model we find that the hard limiter input or
induced local field neuron is
k
v = ∑ xiT wi + b
i =1

and the output is y = Ψ(v).

http://rajakishor.co.cc Page 69
Generally there are two kinds of perceptrons that are in use.
1. Single layer perceptron.
2. Multi layer perceptron.

Single Layer Perceptron


A single layer perceptron is the simplest form of neural network used for the
classification of patterns that are linearly separable. Its architecture is shown above.

Perceptron Learning Rule


Perceptron learning rule is based on a set of input, (target) output pairs where X
is an input to the network and d is the corresponding desired (target or expected)
output. The objective is to reduce the error e, which is the difference between the
neuron response y and the target vector d.
The perceptron learning rule calculates desired changes to the perceptron's
weights and biases, given an input vector X and the associated error e. The target vector
d must contain values of either 0 or 1, because perceptrons with hardlim transfer
functions can only output these values.
Each time network is executed, the perceptron has a better chance of producing
the desired outputs. The perceptron rule is proven to converge on a solution in a finite
number of iterations if a solution exists.
If a bias is not used, network works to find a solution by altering only the weight
vector W to point toward input vectors to be classified as 1and away from vectors to be
classified as 0. This results in a decision boundary that is perpendicular to W and that
properly classifies the input vectors.
There are three conditions that can occur for a single neuron once an input
vector X is presented and the network's response y is calculated:
Case-1 If an input vector is presented and the output of the neuron
is correct (y = d and e = d - y = 0), then the weight vector W
is not changed.
Case-2 If the neuron output is 0 and should have been 1 (y = 0 and
d = 1, and e = d - y = 1), the input vector X is added to the
weight vector W.
Case-3 If the neuron output is 1 and should have been 0 (y = 1 and
d = O, and e = d - y = -1), the input vector X is subtracted
from the weight vector W

http://rajakishor.co.cc Page 70
e = d – y and the change to be made to the weight vector ∆W.
If weight vector changing rule is
Wnew = Wold + ∆W
Case-1: If e=0, then make a change ∆W equal to 0.
Case-2: If e=1, then make a change ∆W equal to X.
Case-3: If e=-1, then make a change ∆W equal to –X.
In this perceptron learning process along with weights we also have to alter the
bias.
The alter rule of bias is:
bnew = bold + e.

Example-1:
Suppose the classification is as like as

{X T
1 = [2 2] d1 =0}

{X T
2 = [1 -2] d 2 =1}

{X T
3 = [−2 2] d 3 =0}

{X T
4 = [−1 1] d 4 =1}
Solve it with single vector input, two element perceptron network.
Solution:
Step-1:
0 
Assume initial value of W (0) =   and b(0)=[0] .
0 
To calculate the output value we first calculate the induced local field, v.

http://rajakishor.co.cc Page 71
v1 = X 1T W (0) + b(0)
0
=[2 2]   + [0] = 0 + 0 = 0
0
y = f (v1 ) = hardlim(0) = 1
The output y does not equal the target value d1 so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d1 − y = 0 − 1 = −1
 2   −2 
∆W = eX 1 = (−1)   =  
 2   −2 
∆b = e = −1
We can calculate the new weights and bias using the perceptron update rules.
Wnew = Wold +∆W
bnew = bold + ∆b

So,

0   −2   −2 
Wnew = W (1) =   +   =  
0   −2   −2 
bnew = b(1) = [0] + [−1] = −1
Step-2:
Now take the next input vector

v2 = X 2T W (1) + b(1)
 −2 
v2 = [1 − 2]   + [−1] = 1
 −2 
y = f (v2 ) = hardlim(1)=1
On this occasion, the target d2 is 1, so the error is zero. Thus there are no changes
in weights or bias.

http://rajakishor.co.cc Page 72
So,
 −2 
W (2) = W (1) =   and
 −2 
b(2) = b(1) = −1
Step-3:
Here,

 −2 
v3 = [−2 2]   + [−1]
 −2 
= [0] + [−1] = −1
y = f (v3 ) = hardlim(−1) = 0
[hardlim[-1]=0, since percetron output is only 0,1]
Compare the output with the target, then y = d3 = 0, so the error is 0. Thus there
are no changes in weights or bias.
 −2 
So the weights W (3) = W (2) =   and bias b(3) = −1
 −2 
Step-4:
Here,

 −2 
v4 = [−1 1]   + [−1] = −1
 −2 
y = f (v4 ) = hardlim(−1) = 0
The output y does not equal the target value d4, so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d4 − y = 1 − 0 = 1
 −1  −1
∆W = eX 4 = 1   =  
1 1
∆b = e = 1
We can calculate the new weights and bias using the perceptron update rules.

http://rajakishor.co.cc Page 73
So,

 −2   −1  −3
Wnew = W (4) =   +   =  
 −2   1   −1
bnew = b(4) = [−1] + 1 = 0
To determine whether a satisfactory solution is obtained, make one pass through
all input vectors to see if they all produce the desired target values. This is not true for
the fourth input, so make another pass.
Step-5:
Now take X5 = X1 ans d5 = d1

 −3 
v = [ 2 2 ]   + 0 = −8
 −1
y = hardlim(v), hardlim(−8) = 0
Here, d1 = y.
So, e = 0, W(4) = W(5) and b(4) = b(5).
Step-6:
Now take X6 = X2 and d6 = d2.

 −3
v = [1 -2]   + 0 = −1
 −1
y = hardlim(v), hardlim(−1) = 0
The output y does not equal the target value d6, so use the perceptron rule to find
the incremental changes to the weights and biases based on the error.
e = d6 − y = 1 − 0 = 1
1  1 
∆ W = eX 6 = (1)   =  
 −2   −2 
∆b = e = 1
 −3   1   −2 
W new = W (6) =   +   =  
 −2   −2   −3 
bnew = b (6) = [0] + [1] = 1
http://rajakishor.co.cc Page 74
Now check the algorithm does converge on the sixth presentation of an input.
 −2 
Means with the weight W (6) =   and b(6) = 1 .
 −3 
We can get the output that will be same with the target, and errors for the
various inputs are 0 in next pass.
pass
Similarly, we can observe
Step-7:
y = hardlim(4-6+1)=0=d
hardlim(4 7

Step-8:
y = hardlim(2-3+1)=1=d
hardlim(2 8

Step-9:
y = hardlim(-4-6+1)=0=d
hardlim( 9

By this way we can check


heck the next pass.
pa
It is clear
ear that now the weights need not
n be changed
hanged since the perceptron has
converged.

M L ulti- ayer P erceptron


Single layer perceptron
eptron is only
o used for linear classification of hyper
yper plane. For
nonlinear classification we have to use multi layer perceptron.
The multi-layer consists of a set of sensory units (Input layers), minimum one
hidden layer and an output layer.
layer The input signal
gnal propagates through the network in
i a
forward direction layer by layer.

http://rajakishor.co.cc Page 75
B P ack ropagation N etwork
In multi-layer perceptron, the network receives input signals and processes
these signals along with their weights through the neurons of different layers to
compute the output. The computed output will be compared with the desired or target
output and the difference between them is reported as error. If the error is significant, it
will be propagated back to the network to update the weights for computing the
outputs. The process stops when the expected output is produced. The process of
propagating error to the network is called the back propagation.
Back propagation is a systematic method for training multi-layer artificial neural
networks. It is built on high mathematical foundation and has very good application
potential even though not highly practical. It is a multi-layer forward network using
extend gradient-descent based delta-learning rule, commonly known as back
propagation rule. Back propagation provides a computationally efficient method for
changing the weights in a feed forward network, with differentiable activation function
units.
Being a gradient descent method, the back propagation algorithm minimizes the
total squared error of the output computed by the net. The network is trained by
supervised learning method. The aim of this network is to train the net to achieve a
balance between the ability to respond correctly to the input patterns that are used for
training and the ability to provide good responses to the input that are similar.
Generally, back propagation learning consists of two passes: a forward pass and a
backward pass. In the forward pass, an activity pattern is applied to the sensory nodes of
the network. Its last layer produces the outputs as the actual responses of the network.
During this pass the synaptic weights of the network are fixed. During backward pass,
the synaptic weights are adjusted in accordance with an error correction rule. The
actual response of the network is subtracted from a desired response to produce an
error signal. This error signal is then propagated backward through the network against
the direction of synaptic connections. Hence this algorithm also known as error back
propagation algorithm or back propagation. The learning process performed with
the algorithm is called back propagation learning.

http://rajakishor.co.cc Page 76
B P ack ropagation A lgorithm
Actually, Back Propagation is the training or learning algorithm rather than the
neural network itself.
A Back Propagation network learns by example. Back Propagation networks are
ideal for simple Pattern Recognition and Mapping Tasks. We train the network through
a set of examples. Each example consists of a pair of input and an expected output for
that input. Once the network is trained, it will provide the desired output for any of the
input patterns.
Let’s now look at how the training works.
The network is first initialized by setting up all its weights to be small random
numbers – say between –1 and +1. Next, the input pattern is applied and the output is
calculated (this is called the forward pass). The calculated output is completely different
from the desired or target output, since all the weights are random. We then calculate
the Error of each neuron, which is the difference between the desired output and the
computed output. This error is then used mathematically to adjust the weights in such a
way that the error will get smaller. In other words, the Output of each neuron will get
much closer to its Target (this part is called the reverse pass). The process is repeated
again and again until the error is minimal or zero.

Back Propagation Algorithm


BACKPROPAGATION(training_examples, η, nin, nout, nhidden)
Each training example is a pair of the form (X, t), where X is the input vector
and T is the target output vector of the network.
η is the learning rate (e.g., 0.5). nin is the number of network inputs, nhidden the
number of neurons in the hidden layer, and nout the number of neurons in the
output layer.
The input from neuron i into neuron j is denoted as xji, and weight from
neuron i to neuron j is denoted as wji.
Step-1: Create a feed-forward network with nin inputs, nhidden hidden neurons,
and nout output neurons.
Step-2: Initialize all network weights to small random numbers (e.g. between
-0.05 and 0.05).
Step-3: Until the termination condition is met, Do
For each (X, T) in training_examples, Do

http://rajakishor.co.cc Page 77
Propagate the input forward through the network:
1. Input the instance X to the network and compute the output yk
of every neuron k in the network.
Propagate the error backword through the network:
2. For each output neuron k, calculate its error term δk:
δk = yk(1 - yk)(tk - yk)
3. For each hidden unit h, calculate its error term δh

δ h = yh (1 − yh ) ∑
k ∈outputs
wkhδ k

4. Update each network weight wji


wji = wji + ∆wji
where
∆wji = ηδjxji

Example:
Let’s just look at a single connection initially, between a neuron in the output
layer and one in the hidden layer in the following figure.

The connection we’re interested in is between neuron A (a hidden layer neuron)


and neuron B (an output neuron) which has the weight WAB. The diagram also shows
another connection, between neuron A and C, which has the weight WAC.

http://rajakishor.co.cc Page 78
The algorithm works as follows.
Step-1:
First apply the inputs to the network and compute the output – remember this
initial output could be anything, as the initial weights were random numbers.
Step-2:
Next compute the error for neuron B. The error:
ErrorB = OutputB (1-OutputB)(TargetB – OutputB)
The “Output(1-Output)” term is necessary in the equation because of the Sigmoid
Function. If we were only using a threshold neuron it would just be (Target –
Output).
Step-3:
Adjust the weight. Let W+AB be the new weight and WAB be the initial weight.
W+AB = WAB + (ErrorB * OutputA)
Notice that it is the output of the connecting neuron (neuron A) we use (not B).
We update all the weights in the output layer in this way.
Step-4:
Calculate the Errors for the neurons of the hidden layer. Unlike the output layer
we can’t calculate these directly, because we don’t have a Target. So we Back
Propagate them from the output layer. This is done by taking the Errors from the
output neurons and running them back through the weights to get the hidden
layer errors.
For example if neuron A is connected as shown to B and C then we take the
errors from B and C to generate an error for A.
ErrorA = Output A (1 - Output A)(ErrorB WAB + ErrorC WAC)
Again, the factor Output(1-Output) is present because of the sigmoid squashing
function.
Step-5:
Having obtained the Error for the hidden layer neurons now proceed as in step-3
to change the hidden layer weights. By repeating this method we can train a
network of any number of layers.

http://rajakishor.co.cc Page 79
Worked Example:
Consider the simple network:

Assume that the neurons have a Sigmoid activation function and


1) Perform a forward pass on the network.
2) Perform a reverse pass (training) once (target = 0.5).
3) Perform a further forward pass and comment on the result.

Answer:
1) Input to top neuron = (0.35 * 0.1) + (0.9 * 0.8) = 0.755; Output = 0.68.
Input to bottom neuron = (0.9 * 0.6) + (0.35 * 0.4) = 0.68; Output = 0.6637.
Input to final neuron = (0.3 * 0.68) + (0.9 * 0.6637) = 0.80133; Output = 0.69
2) Output error δ = (target-output)(1-output)output
= (0.5-0.69)(1-0.69)0.69 = -0.0406.
New weights for output layer
w1+ = w1 + (δ * input) = 0.3 + (-0.0406 * 0.68) = 0.272392.
w2+ = w2 + (δ * input) = 0.9 + (-0.0406 * 0.6637) = 0.87305.
Errors for hidden layers:
δ1 = δ * w1 = -0.0406 * 0.272392 * (1-output)output = -2.406 * 10 - 3
δ2= δ * w2 = -0.0406 * 0.87305 * (1-output)output = -7.916 * 10 - 3
New hidden layer weights:
w3+=0.1 + (-2.406 * 10-3 x 0.35) = 0.09916.
w4+ = 0.8 + (-2.406 * 10-3 x 0.9) = 0.7978.
w5+ = 0.4 + (-7.916 * 10-3 x 0.35) = 0.3972.
w6+ = 0.6 + (-7.916 * 10-3 x 0.9) = 0.5928.
3) Old error was -0.19. New error is -0.18205. Therefore error has reduced.

http://rajakishor.co.cc Page 80
6

Bayesian Learning
http://rajakishor.co.cc Page 81
B asics of P robability T heory
Probability is the possible occurrence of an event whose happening is not
certain.
If an experiment is conducted under essentially homogeneous conditions we
generally come across two types of situations:
1. The result or outcome is unique or certain. This kind of phenomenon is called
deterministic or predictable phenomenon.
2. The result is not certain but may be one of the several possible outcomes.
This kind of phenomenon is unpredictable or probabilistic.
Trial and event: The experiment that we conduct is a trial and the outcome that
is expected out of the experiment is an event.
Exhaustive events: Total number of possible outcomes in any trial is known as
exhaustive events.
Favourable events: The number of cases favourable to an event in a trial is the
number of outcomes which entail the happening of the event.
Mutually exhaustive event: Events are said to be mutually exclusive or
incompitable if the happening of any one of them prevents or precludes the happening
of all others, i.e., if no two or more events can happen simultaneously in the same trila.
Equally likely events: Outcomes of a trial are said to be equally likely if they
have equal chances of occurrence.
Independent events: Events are said to be independent if the happening or non-
happening of an event is not effected by the supplimentary knowledge concerning the
occurrence of any number of the remaining events.
Mathematical definition of probability
If a trial results in n exhaustive, mutually exclusive and equally likely cases and
m of them are favourable to the happening of an event E, then the probability p of
happening of E is given by
Number of cases favourable to E m
p= =
Number of exhaustive cases in the trial n
Also, the number of cases not favourable to the event E is n-m.
n−m m
⇒ q= = 1− = 1− p
n n
⇒ p + q =1

http://rajakishor.co.cc Page 82
Sample space: The set of all possible outcomes of an experiment is called the
sample space.
Event: Every non-empty subset A of S, which is a disjoint union of a single
element subsets of the sample space S of a random experiment E is called an event.

Probability function
P(A) is the probability function defined on a σ-field B of events if the following
properties hold.
1. For each A є B, P(A) is defined, is real and P(A) ≥ 0.
2. P(S) = 1.
3. If { An } is any finite or infinite sequence of disjoint events in B, then
n n
p (∪ Ai ) = ∑ P ( Ai )
i =1 i =1

Note
1. If the events A and B are mutually exclusive, then
P ( A ∪ B ) = P ( A) + P ( B )
2. Probability of the impossible event is zero, i.e., P(ф) = 0.
3. Probability of the complementary event A’ of A is given by P(A’) = 1 – P(A).
4. For any two events A and B,
P ( A '∩ B ) = P ( B ) − P ( A ∩ B )
5.
If B ⊂ A, then
(i) P(A ∩ B')=P(A) - P(B)
(ii) P(B) ≤ P(A)

Law of addition of probabilities:


If A and B are any two events and are not disjoint, then
P ( A ∪ B ) = P ( A) + P ( B ) − P ( A ∩ B )

http://rajakishor.co.cc Page 83
Multiplication Law of Probability and Conditional Probability
For two events A and B
P ( A ∩ B ) = P ( A).P ( B | A), P(A) > 0
=P(B).P(A|B), P(B) > 0
where P(B|A) represents the conditional probability of occurrence of B, when the
event has already happened and P(A|B) is the conditional probability of happening of A,
given that B has already happened.
Now the conditional probabilities are obtained as follows:
P( A ∩ B )
P ( B | A) =
P( A)

P( A ∩ B )
P ( A | B) =
P( B )
Thus the conditional probabilities P(B|A) and P(A|B) are defined if and only if
P(A) ≠ 0 and P(B) ≠ 0 respectively.
Note:
1. For P(B) > 0, P(A|B) ≤ P(A).
2. P(A|A) = 1.
3. If A1, A2, …, An are independent events then
P( A1 ∩ A2 ∩ ... ∩ An ) = P ( A1 ).P( A2 )...P( An )
4. For every three events A, B and C
P( A ∪ B | C ) = P( A | C ) + P( B | C ) − P( A ∩ B | C )

Independent events
An enevt B is said to be independent of event A, if P(B|A) = P(B).
Note:
1. If A and B are independent events then A and B’ are also independent events.
2. If A nd B are independent events then A’ and B’ are also independent events.

http://rajakishor.co.cc Page 84
Bayes’ Theorem
If E1, E2, …, En are mutually disjoint events with P(Ei) ≠ 0, (i = 1, 2, …, n) then for
n
any arbitrary event A which is a subset of ∪E
i =1
i such that PA() > 0,

we have
P ( Ei ).P ( A | Ei )
P ( Ei | A) = n

∑ P( E ).P( A | E )
i =1
i i

This enables us to find the probabilities of the various events E1, E2, …, En which
cause A to occur.
Note:
1. The probabilities P(E1), P(E2), …, P(En) are termed as the priory probabilities
because they exist before we gain any information from the experiment itself.
2. The probabilities P(A|Ei), (i = 1, 2, …, n) are called likelihoods because they
indicate how likely the event A under consideration is to occur, given each
and every priory probability.
3. The probabilities P(Ei|A), (i = 1, 2, …, n) are called posteriory probabilities
because they are determined after the results of the experiments are known.

B ayesians L earning
Bayesian reasoning provides a probabilistic approach to inference. It is based on
the assumption that the quantities of interest are governed by probability distributions
and that optimal decisions can be made by reasoning about these probabilities together
with observed data.
Bayesian reasoning is important to machine learning because it provides a
quantitative approach to weighing the evidence supporting alternative hypotheses.
Bayesian reasoning provides the basis for learning algorithms that directly manipulate
probabilities, as well as a framework for analyzing the operation of other algorithms
that do not explicitly manipulate probabilities.

http://rajakishor.co.cc Page 85
Bayesian learning methods are relevant to our study of machine learning for two
different reasons.
1. Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to
certain types of learning problems.
2. Bayesian learning algorithms provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities.
Features of Bayesian learning methods include:
 Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct. This provides a more flexible
approach to learning than algorithms that completely eliminate a hypothesis if it
is found to be inconsistent with any single example.
 Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a
probability distribution over observed data for each possible hypothesis.
 Bayesian methods can accommodate hypotheses that make probabilistic
predictions (e.g., hypotheses such as "this pneumonia patient has a 93% chance
of complete recovery").
 New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally intractable, they
can provide a standard of optimal decision making against which other practical
methods can be measured.
One practical difficulty in applying Bayesian methods is that they typically
require initial knowledge of many probabilities. When these probabilities are not
known in advance they are often estimated based on background knowledge, previously
available data, and assumptions about the form of the underlying distributions. A second
practical difficulty is the significant computational cost required to determine the Bayes
optimal hypothesis in the general case (linear in the number of candidate hypotheses).

http://rajakishor.co.cc Page 86
B ayesians T heorem in M achine L earning
In machine learning we are often interested in determining the best hypothesis
from some space H, given the observed training data D. One way to specify what we
mean by the best hypothesis is to say that we demand the most probable hypothesis,
given the data D plus any initial knowledge about the prior probabilities of the various
hypotheses in H. Bayes theorem provides a direct method for calculating such
probabilities. More precisely, Bayes theorem provides a way to calculate the probability
of a hypothesis based on its prior probability, the probabilities of observing various data
given the hypothesis, and the observed data itself.
To define Bayes theorem precisely, let us first introduce a little notation.
 We shall write P(h) to denote the initial probability that hypothesis h holds,
before we have observed the training data. P(h) is often called the prior-
probability of h and may reflect any background knowledge we have about the
chance that h is a correct hypothesis.
 If we have no such prior knowledge, then we might simply assign the same prior
probability to each candidate hypothesis.
 Similarly, we will write P(D) to denote the prior probability that training data D
will be observed.
 Next, we will write P(D|h) to denote the probability of observing data D given
some world in which hypothesis h holds.
 In machine learning problems we are interested in the probability P (h|D) that h
holds given the observed training data D. P(h|D) is called the posterior-
probability of h, because it reflects our confidence that h holds after we have seen
the training data D.
 The posterior probability P(h|D) reflects the influence of the training data D, in
contrast to the prior probability P(h) , which is independent of D.

Bayesian theorem

P ( D | h) P ( h)
P( h | D) =
P( D)
It is obvious that P(h|D) is directly proportional to P(D|h) and P(h).

http://rajakishor.co.cc Page 87
In many learning scenarios, the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypothesis h є H given the
observed data D. Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes
theorem to calculate the posterior probability of each candidate hypothesis.

hMAP = argmax P (h | D)
h∈H

P ( D | h) P (h)
= argmax
h∈H P( D)
= argmax P ( D | h) P(h)
h∈H

Example
Consider a medical diagnosis problem in which there are two alternative
hypotheses - h1: the patient has a particular form of cancer and h2: the patient does not
have cancer. The available data is from a particular laboratory test with two possible
outcomes: Pv (positive) and Nv (negative). We have prior knowledge that over the
entire population of people only 0.008 have this disease. Furthermore, the lab test is
only an imperfect indicator of the disease. The test returns a correct positive result in
only 98% of the cases in which the disease is actually present and a correct negative
result in only 97% of the cases in which the disease is not present. In other cases, the
test returns the opposite result.
The above situation can be summarized by the following probabilities:
P(h1) = 0.008, P(h2) = 0.992
P(Pv|h1) = 0.98, P(Nv|h1) = 0.02
P(Pv|h2) = 0.03, P(Nv|h2) = 0.97
Suppose we now observe a new patient for whom the lab test returns a positive
result. Should we diagnose the patient as having cancer or not? The maximum a
posteriori (MAP) hypothesis can be found as follows:
P(h1|Pv) = P(Pv|h1) * P(h1) = 0.98 * 0.008 = 0.0078
P(h2|Pv) = P(Pv|h2) * P(h2) = 0.03 * 0.992 = 0.0298
Thus hMAP = h2, i.e., the patient does not have cancer.

http://rajakishor.co.cc Page 88
B O ayes ptimal C lassifier
Consider a hypothesis space containing three hypotheses, h1, h2, and h3.
Suppose that the posterior probabilities of these hypotheses given the training data are
0.4, 0.3, and 0.3 respectively. Thus, h1 is the MAP hypothesis. Suppose a new instance x
is encountered, which is classified positive by h1, but negative by h2 and h3. Taking all
hypotheses into account, the probability that x is positive is 0.4 (the probability
associated with h1), and the probability that it is negative is therefore 0.6. The most
probable classification (negative) in this case is different from the classification
generated by the MAP hypothesis.
In general, the most probable classification of the new instance is obtained by
combining the predictions of all hypotheses, weighted by their posterior probabilities. If
the possible classification of the new example can take on any value vj from some set V,
then the probability P(vj|D) that the correct classification for the new instance is vj, is
just

P (v j | D ) = ∑ P (v
hi ∈H
j | hi ) P(hi | D)

The optimal classification of the new instance is the value vj, for which P (vj|D) is
maximum.

Bayes optimal classification


Any system that classifies new instances according to the following equation is
called a Bayes optimal classifier, or Bayes optimal learner.

argmax
v j ∈V
∑ P (v
hi ∈H
j | hi ) P (hi | D )

This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over the
hypotheses.
To illustrate in terms of the above example, the set of possible classifications of
the new instance is V = {Pv, Nv}, and
P(h1|D) = 0.4, P(Nv|h1) = 0, P(Pv|h1) = 1
P(h2|D) = 0.3, P(Nv|h2) = 1, P(Pv|h2) = 0
P(h3|D) = 0.3, P(Nv|h3) = 1, P(Pv|h3) = 0

http://rajakishor.co.cc Page 89
therefore

∑ P ( Pv | h ) P ( h | D ) = 0.4
hi ∈ H
i i

∑ P ( Nv | h ) P ( h
hi ∈ H
i i | D ) = 0.6

and

argmax ∑ P(v j | hi ) P(hi | D) = Nv


vi ∈{ Pv , Nv} h ∈H
i

N B aïve ayesian C lassifier

Let D be a training set of tuples; each tuple is represented by an n-D attribute


vector X = (x1, x2, …, xn). Suppose there are m classes C1, C2, …, Cm. Classification is to
derive the maximum posteriori, i.e., the maximal P(Ci|X).
This can be derived from Bayes’ theorem

P(X | C )P(C )
P(C | X) = i i
i P(X)
Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )


i i i
needs to be maximized.

http://rajakishor.co.cc Page 90
Derivation of naïve Bayesian classifier
A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P ( X | C i ) = ∏ P ( x | C i ) = P ( x | C i ) × P ( x | C i ) × ... × P ( x | C i )
k 1 2 n
k =1
This greatly reduces the computation cost: Only counts the class distribution.
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided
by |Ci, D| (# of tuples of Ci in D).
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x−µ )2
1 −
2σ 2
g(x, µ,σ ) = e
2πσ
and P(xk|Ci) is
P ( X | C i ) = g ( xk , µCi , σ Ci )

Example:
Consider the following dataset
Age Income Student Credit_rating Buys_computer
<=30 High No Fair No
<=30 High No Excellent No
31…40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31…40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31…40 Medium No Excellent Yes
31…40 High Yes Fair Yes
>40 Medium No Excellent No

http://rajakishor.co.cc Page 91
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample:
X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
Class probabilities:
P(C1): P(buys_computer = “yes”) = 9/14 = 0.643
P(C2): P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
Computing P(X|C1):
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(X|C1) = P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C1)*P(C1) = P(X|buys_computer=“yes”) * P(buys_computer = “yes”) = 0.028

Compute P(X|C2):
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
P(X|C2) = P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|C2)*P(C2) = P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Now, Max(P(X|C1)*P(C1) P(X|C2)*P(C2) = Max(0.028, 0.007) = 0.028, which implies
that P(X|C1) is maximum.
Therefore, X belongs to class C1, i.e., (buys_computer = “yes”)

http://rajakishor.co.cc Page 92
B ayesian B N elief etwork
Bayesian belief network allows a subset of the variables to be conditionally
independent.
Bayesian belief network is a graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution

Example: Consider the following network

http://rajakishor.co.cc Page 93
The conditional probability table (CPT) for variable LungCancer (LC):

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)


LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for each possible combination of its
parents.
Derivation of the probability of a particular combination of values of X, from CPT:
n
P( x1 ,..., xn ) = ∏ P ( xi | Parents (Y i ))
i =1

http://rajakishor.co.cc Page 94

Vous aimerez peut-être aussi