Learning Processes: T.Sunil Chowdary

www.StudentRockStars.
com
Learning Processes
www.StudentRockStars.com
By
T.Sunil Chowdary
Learning
 Learning is a process by which free parameters of NN are
adapted thru stimulation from environment
 Sequence of Events
 stimulatedby an environment
 undergoes changes in its free parameters
 responds in a new way to the environment
 Learning Algorithm
 prescribedsteps of process to make a system learn
 ways to adjust synaptic weight of a neuron
 No unique learning algorithms - kit of tools
 The Chapter covers

 fivelearning rules, learning paradigms, issues of learning task
 probabilistic and statistical aspect of learning
Error Correction Learning(I)
 Error signal, ek(n)
ek(n) = dk(n) - yk(n)
where n denotes time step
 Error signal activates a control mechanism for corrective

adjustment of synaptic weights
Minimizing a cost function, E(n), or index of performance
1 2
E (n) = ek (n)
2
 Also called instantaneous value of error energy
 step-by-step adjustment until
 system reaches steady state; synaptic weights are stabilized
 Also called deltra rule, Widrow-Hoff rule
Error Correction Learning(II)
∆wkj(n) = ηek(n)xj(n)
 η : rate of learning; learning-rate parameter
wkj(n+1) = wkj(n) + ∆wkj(n)

wkj(n) = Z-1[wkj(n+1) ]
 Z-1 is unit-delay operator

 adjustment is proportioned to the product of error signal
and input signal
 error-correction learning is local
 The learning rate η determines the stability or convergence
Memory-based Learning
 Past experiences are stored in memory of correctly
classified input-output examples
 retrieve and analyze “local neighborhood”
 Essential Ingredient
 Criterionused for defining local neighbor
 Learning rule applied to the training examples
 Nearest Neighbor Rule (NNR)

 the vector X’n ∈ { X1, X2, …,XN } is the nearest neighbor of Xtest if
) = d(X
'
mini
d(X i, X test N
, X test
)
 X’n is the class of Xtest
Nearest Neighbor Rule
 Cover and Hart
 Examples are independent and identically distributed
 The sample size N is infinitely large
then, error(NNR) < 2 * error(Bayesian rule)

 Half of the information is contained in the Nearest Neighbor
K-nearest Neighbor rule

 variant of NNR
 Select k-nearest neighbors of Xtest and use a majority vote
 act like averaging device
 discriminate against outlier
 Radial-basis function network is a memory-based classifier
Hebbian learning process
 If two neurons on either side of a synapse connection are activated simultaneously,
then the strength of that synapse is increased.
 If two neurons on either side of a synapse are activated asynchronously,
then the strength of that synapse is weakened or eliminated.
Hebbian Learning
 If two neurons of a connection are activated
 simultaneously(synchronously), then its strength is increased
 asynchronously, then the strength is weakened or eliminated
 Hebbian synapse
 time
dependent
 depend on exact time of occurrence of two signals
 local
 locally available information is used
 interactive mechanism
 learning is done by two signal interaction
 conjunctional or correlational mechanism
 cooccurrence of two signals
presynaptic &
 Hebbian learning is found in Hippocampus postsynaptic signals
Math Model of Hebbian Modification
∆wkj(n) = F(yk(n), xj(n))
 Hebb’s hypothesis
∆wkj(n) = ηyk(n)xj(n)
where η is rate of learning
 also

called activity product rule
repeated application of x leads to exponential growth
j
 Covariance hypothesis
∆wkj(n) = η(xj - x)(yk - y)
1. wkj is enhanced if xj > x and yk > y
2. wkj is depressed if (xj > x and yk < y ) or (xj < x and yk > y )
where x and y are time-average

Competitive Learning
 Output neurons of NN compete to became active
 Only single neuron is active at any one time
 salient feature for pattern classification
 Basic Elements
A
set of neurons that are all same except synaptic weight distribution
 responds differently to a given set of input pattern
 A Limits on the strength of each neuron
 A mechanism to compete to respond to a given input
 winner-takes-all
 Neurons learn to respond specialized conditions

 become feature detectors
Competitive NN
Feedforward
connection
is excitatory
www.StudentRockStars.com Lateral
connection ( )
is inhibitory
- lateral inhibition
layer of Single layer of

source Input output neurons
Output signal
1 if vk > vj for all j, j ≠ k
yk = 
0 otherwise

Wkj denotes the synaptic weight
∑wj
kj
= 1 for all k
 Competitive Learning Rule

η (x j − wkj ) if neuron k wins the competitio n
∆wkj = 
 0 if neuron k loses the competitio n
 If neuron does not respond to a particular input, no learning
takes place
2
 x has some constant Euclidean length and ∑ kj
w = 1 for all k
j
 perform clustering thru competitive learning

Boltzman Learning
 Rooted from statistical mechanics
 Boltzman Machine : NN on the basis of Boltzman learning
 The neurons constitute a recurrent structure
 operate in binary manner: “on”: +1 and “off”: -1
 Visible neurons and hidden neurons
 energy function 1
E = − ∑∑ wkj xk x j
2 j k
j≠k
 j ≠ k means no self feedback
 Goal of Boltzman Learning is to maximize likelihood function

 set of synaptic weights is called a model of the environment if it
leads the same probability distribution of the states of visible units
Boltzman Machine Operation
 choosing a neuron at random, k, then flip the state of the
neuron from state xk to state -xk with probability
1
P ( xk → − xk ) =
− ∆Ek
1 + exp( )
www.StudentRockStars.com T
where ∆E is energy change resulting from such a flip

k
 If this rule applied repeatedly, the machine reaches thermal

equilibrium.
 Two modes of operation
 Clamped condition : visible neurons are clamped onto specific states
 Free-running condition: all neurons are allowed to operate freely
Boltzman Learning Rule
 Let ρ +kj denote the correlation between the states of
neurons j and k with its clamped condition
ρ kj+ = ∑∑ p( X
xα ∈ℑ x β
β = xβ | Xα = xα ) x j xi

−
Let ρ kj denote the correlation between the states of
neurons j and k with its free-running condition
ρ −kj = ∑∑ p( X
xα ∈ℑ x
= x) x j xi
 Boltzman Learning Rule (Hinton and Sejnowski 86)
∆wkj = η (ρ kj+ − ρ −kj ), j≠ k
where η is a learning-rate
Credit-Assignment Problem
 Problems of assigning credit or blame for overall outcomes
to each of internal decisions
 Two sub-problems
 assignment of credit to actions
 temporal credit assignment problem : time instance of action
 many actions are taken; which action is responsible the outcome
 assignment of credit to internal decisions
 structural credit assignment problem : internal structure of action
 multiple components are contributed; which one do best
 CAP occurs in Error-correction learning

 inmultiple layer feed-forward NN
 How do we credit or blame for the actions of hidden neurons ?
Learning with Teacher
 Supervised learning
 Teacher has knowledge of environment to learn
 input and desired output pairs are given as a training set
 Parameters are adjusted based on error signal step-by-step
 www.StudentRockStars.com
System performance measure
 mean-square-error
 sumof squared errors over the training sample
 visualized as error surface with parameter as coordinates
 Move toward to a minimum point of error surface
 may not be a global minimum
 use gradient of error surface - direction of steepest descent
 Good for pattern recognition and function approximation

Learning without Teacher
 Reinforcement learning
 Noteacher to provide direct (desired) response at each step
 example : good/bad, win/loose
 must solve temporal credit assignment problem
 since critics may be given at the time of final output
Primary reinforcement
Environment Critics
Heuristic
reinforcement
Learning
Systems
Unsupervised Learning
 Self-organized learning
 No external teacher or critics
 Task-independent measure of quality is required to learn
 Network parameters are optimized with respect to the measure
 competitive learning rule is a case of unsupervised learning
Learning Tasks
 Pattern Association
 Pattern Recognition
 Function Approximation
 Control
 www.StudentRockStars.com
Filtering
 Beamforming
Pattern Association
 Associative memory is distributed memory that learns by
association
 predominant features of human memory
 xk → yk, k=1, 2,…, q


Storage capacity, q
Storage phase and Recall phase
 NN is required to store a set of patterns by repeatedly presenting
 partial description or distorted pattern is used to retrieve the
particular pattern
 xk is act as a stimulus that determines location of memorized pattern
 Autoassociation : when xk = yk,
 Hetroassociation : when xk ≠ yk,
 Have q as large as possible yet recall correctly
Pattern Recognition
 Process whereby a received pattern is assigned to one of
prescribed classes (categories)
 Pattern recognition by NN is statistical in nature
 patternis a point in a multidimensional decision space
 decision boundaries - region associated with a class
 decision boundaries are determined by training
Input Outout
Feature Extraction Classification
pattern class
Feature
unsupervised vector supervised
r-D
www.StudentRockStars.comq-D
m-D
Function Approximation
Given set of labeled examples ℑ = {( x i , d i )}i =1
N

drawn from d = f ( x) ,
find F ( x) which approximates unknown function f ( x)
s.t. || F ( x) − f ( x) || < ε for all x
 Supervised learning is good for this task
 System identification di
 construct an inverse system Unknown
system ei
input Σ
NN yi
model
Control
 Feedback control system
Reference e u
signal Σ Controller Plant y
d
 search of Jacobian matrix  ∂y k 
 
J= 
 ∂u j 
 for error correction learning
 
 indirect learning
 using input-output measurement on the plant, construct NN
model
 the NN model is used to estimate Jacobian
 direct learning
Filtering
 Extract information from a set of noisy data
 information processing tasks
 filtering
 using data measured up to and including time n
 smoothing
 using data measured up to and after time n
 prediction
 predict at n+n0 ( n0 > 0) using data measured up to time n
 Cocktail party problem

 focus on a speaker in the noisy environment
 believed that preattentive, preconscious analysis involved
Blind source separation
x1(n)
u1(n) y1(n)
Unknown
mixer Demixer
...
...
...
A W
um(n) ym(n)
xm(n)
Unknown environment
Memory
 Relatively enduring neural alterations induced by interaction
with environment - neurobiological definition
 accessible to influence future behavior
 activity pattern is stored by learning process

 memory and learning are connected
Short-term memory
 compilation of knowledge representing current state of environment
 Long-term memory
 knowledge stored for a long time or permanently
Associative memory
 Memory is distributed
 stimulus pattern and response pattern consist of data vectors
 information is stored as spatial patterns of neural activities
 information of stimulus pattern determines not only stage
location but also address for its retrieval
 although neurons is not reliable, memory is
 there may be interaction between patterns stored. (not a
simple storage of patterns) There is possibility of error in
recall process
Correlation Matrix Memory
 Association of key vector xk with memorized vector yk
yk = W(k)xk, k = 1, 2, 3, …, q
 Total experience
q
M = ∑ W (k )
k =1
M = M + W(k) , k = 1, 2, ..., q
k k -1
 Adding W(k) to Mk-1 loses distinct identity of mixture of

contributions that form Mk
 but information about stimuli is not lost
Correlation Matrix memory
 Learning : Estimate of memory
^ q
matrix M M = ∑ yk xkT
 calledOuter Product Rule k =1
 generalization of Hebb’s
postulate of learning
www.StudentRockStars.com x1T 
 T
[ ]
 x2 
^
M = y1 , y 2 , ..., yq = YX T
 
 T
x q 
^ ^
M k = M k −1 + y k xTk , k = 1, 2, ..., q
Recall
^
y =Mxj
http://www.StudentRockStars.com
m m
y = ∑ y k x x j = ∑ (xTk x j ) y k
T
k
Normalize to have unit matrix k =1 k =1
m
= (x x )y + ∑ (x x )y T
j j j
T
k j k
k =1, k ≠ j
V defined as j
m
Noise Vector
=yj + ∑ (x
k =1, k ≠ j
T
k x j )y k
xTk x j
normalized to have unit matrix
cos( x k , x j ) =
|| x k || || x k ||
= xTk x j
Orthogonal set
m
vj = ∑ cos(x
k =1, k ≠1
k , x j )y k
orthogonal set cos( x k , x j ) = 0, k≠ j
1, k = j
x x = T
0, k ≠ j
k j
 Storage Capacity
 rank : # of independent
columns
 limited by dimensionality of
input
Error of Associative Memory
 Real-life is Neither orthogonal nor highly separated
 Lower bound γ
xTk x j ≥ γ for k ≠ j
 if γ is large, the memory may fail to distinguish
 memory act as having patterns of never seen before
 termed animal logic
Adaptation
 Space and time : fundamental dimension
 spatiotemporal nature of learning
 Stationary system
 learned parameters are frozen to be used later

Nonstationary system
 continually adapt its parameters to variations in the incoming signal
in a real-time fashion
 continuous learning; learning-on-the-fly
 Psudostationary
 consider stationary for a short time window
 speech signal : 10~30 msec
 weather forcasting : period of minutes
Statistical Nature of Learning Process
X is random vector consisting of a set of independent
variables
it is denoted by
D is random scalar consisting of a set of dependent variables
it is denoted by
Statistical Nature of Learning Process
ℑ = {(x i , d i )}i =1
n
 Training sample denoted by
 functional relationship between X and D : D = f (X) + ε
 deterministic function of Random variable X
 expectation error
 random
variable
 called regressive model
E[ε | x] = 0
 Mean value is zero f ( X) = E[ D | x]
f ( x) = E ( D | X = x)
 error is uncorrelated (principles of orthogonality)
E[ε f (x)] = 0
 Cost function is equal to

1N
the 1/2 squared difference
between the desired Ξ(w ) = ∑ (di − F (xi , w )) 2
response “d” and the actual 2 i =1

response “y” of the neural
network
 Symbol is 1
denoted average operator Ξ(w ) = Eℑ[( d − F (x, ℑ)) 2
taking over the entire
training sample
2
 http://www.StudentRockStars.com
Minimizing Cost Function
1 N
Ξ(w ) = ∑ (d i − F (x i , w )) 2
2 i =1
1
Eℑ is averaging Ξ( w ) = Eℑ [( d − F (x, ℑ)) 2
2
operator
1 1
Ξ(w ) = Eℑ [ε 2 ] + Eℑ [( f (x) − F (x, ℑ)) 2 ]
2 2
natural measure of Lav ( f (x) − F (x, w )) = Eℑ [( f (x) − F (x, ℑ)) 2 ]

the effectiveness
Lav ( f ( x) − F ( x, w )) = Eℑ [( E ( D | X = x] − F (x, ℑ)) 2 ]

since f ( x) = E ( D | X = x)
Bias/Variance Dilemma
E[ D | X = x] − F (x, ℑ) = E[ D | X = x] − Eℑ [ F (x, ℑ)] +
Eℑ [ F (x, ℑ)] − F (x, ℑ)
Bias: B (w ) = Eℑ [ F (x, ℑ)] − E[ D | X = x]
variance: V (w ) = Eℑ [( F (x, ℑ) − Eℑ [ F (x, ℑ)]) 2 ]
 Bias www.StudentRockStars.com
 represents inability of the NN to approximate the regression function
 variance
 inadequacy of the information contained in the training set ℑ
 we can reduce both of bias and variance only with Large
training samples
 purposely introduce “harmless” bias to reduce variance
 designed bias - constrained network takes prior knowledge

Learning Processes: T.Sunil Chowdary

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Learning Processes: T.Sunil Chowdary

Transféré par

Droits d'auteur :

Formats disponibles

www.StudentRockStars.

 The Chapter covers

wkj(n+1) = wkj(n) + ∆wkj(n)

 Z-1 is unit-delay operator

 Nearest Neighbor Rule (NNR)

 X’n is the class of Xtest

K-nearest Neighbor rule

 Radial-basis function network is a memory-based classifier

where x and y are time-average

 Neurons learn to respond specialized conditions

layer of Single layer of

 Competitive Learning Rule

 perform clustering thru competitive learning

 Goal of Boltzman Learning is to maximize likelihood function

where ∆E is energy change resulting from such a flip

 If this rule applied repeatedly, the machine reaches thermal

 Boltzman Learning Rule (Hinton and Sejnowski 86)

∆wkj = η (ρ kj+ − ρ −kj ), j≠ k

 CAP occurs in Error-correction learning

 Good for pattern recognition and function approximation

 Cocktail party problem

 Adding W(k) to Mk-1 loses distinct identity of mixture of

orthogonal set cos( x k , x j ) = 0, k≠ j

D is random scalar consisting of a set of dependent variables

 Cost function is equal to

response “d” and the actual 2 i =1

natural measure of Lav ( f (x) − F (x, w )) = Eℑ [( f (x) − F (x, ℑ)) 2 ]

Lav ( f ( x) − F ( x, w )) = Eℑ [( E ( D | X = x] − F (x, ℑ)) 2 ]

Vous aimerez peut-être aussi