Perceptron Linear Classifiers

Perceptron
Dr Ashutosh Gupta Dr.
Single Layer Discrete Perceptron Networks

Perceptron learning rule Perceptron training Learning algorithm Properties of Perceptron What a perceptron does? Regression Classification Decision Boundary Linear classification via hyperplane Perceptron Algorithm How perceptron Update works? Perceptron Convergence Theorem Example: a simple problem
Outline
Hyperplane based Classification

Linear Machines and Minimum Distance classification Single layer continuous perceptron networks for linearly separable classification
Delta rule
Outline
Perceptron vs. Delta rule XOR problem Delta rule Generalization and Early stopping
Overfitting Training time
Limitations of perceptrons Linear inseparability Multi Layer perceptron

Example: Perceptrons as Constraint Satisfaction Networks
Linear threshold unit (LTU)

X1
Discrete Perceptron
w0
o(xi)=
. . . X
X2
w2
w1
x0=1
wi xi >0 i=0 1 otherwise
1 if
wn
w i xi i=0
takes a vector of realvalued inputs (x1, ..., xn) weighted with (w1, ...,wn) calculates the linear combination of these inputs
w0 denotes a threshold value x0 is always 1 outputs 1 if the result is greater than 1, otherwise 1
4
many boolean functions can be represented by a perceptron: AND, OR, NAND, NOR a perceptron represents a hyperplane decision surface in the ndimensional space of instances some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable
Representational Power
Perceptron Learning Rule

problem: determine a weight vector w that causes the perceptron to produce the correct output for each training example perceptron training rule: Weight adjustment wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant ( (e.g. g 0.1) ) called learning g rate wi = wi + wi
Perceptron Learning Rule

algorithm: 1. Initialize w to random weights 2. repeat, until each training example is classified correctly (a) apply perceptron training rule to each training example If the output is correct (di = oi) the weights wi are not changed If the output is incorrect (di oi) the weights wi are changed such that the output of the perceptron for the new weights is closer to di . The algorithm converges to the correct classification if the training data is linearly separable and c is sufficiently small
Supervised Learning
Training and test data sets Training set; input & target
Perceptron Training
Output =
wi xi >t i=0 0 otherwise
1 if
Linear threshold is used. W weight value t threshold value

9
Simple network
output= 1 W = 1.5 X W=1 Y
1 if
w x >t
i=0 i i
0 otherwise
t = 0.0
10
Training Perceptrons
1 W=? x t = 0.0 W=? y For AND A B Output 00 0 01 0 10 0 11 1
W=?
What are the weight values? Initialize with random weight values
11
Learning algorithm
Epoch : Presentation of the entire training set to the neural l network. t k In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = 1
12
Learning algorithm
Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. F example, For l if we present t th the network t k with ith [1 [1,1] 1] for f th the AND function the training value will be 1 Output , O : The output value from the neuron Ij : Inputs being presented to the neuron Wj : Weight from input neuron (Ij) to the output neuron LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1
Properties of Perceptrons
Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
What a Perceptron Does ?

Regression: y=wx+w0 Classification: y=1 (wx+w0>0)
Regression:
Separates a Ddimensional space into two halfspaces
Hyperplane
Defined by an outward pointing normal vector w w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If not,
have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in opposite direction)
In simple cases, divide feature space by drawing a hyperplane across it. Known as a decision boundary. Discriminant function: returns different values on opposite sides. (straight line) Problems which can be thus classified are linearly separable.
Decision boundaries
Discriminant function
19
Linear Classification via Hyperplanes

Linear Classifiers: Represent the decision boundary by a hyperplane w
decision boundary (surface) Decision region R2 Decision region R1
For binary classification, w is assumed to point towards the positive class Classification rule:
wt x + b > 0 y = +1 wt x + b < 0 ) y = 1
Question: What about the points x for which wt x + b = 0? Goal: To learn the hyperplane (w, b) using the training data. => to find hyperplane equation wt x + b = 0 (It is a decision boundary)
Geometric margin gnof an example xn is its distance from the hyperplane
Concept of Margins
Geometric margin may be positive (if yn = +1) or negative (if yn = 1) Margin of a set {x1, . . . , xN} is the minimum absolute geometric margin
Functional margin of a training example:

Positive if prediction is correct;
yn(wt xn + b)
Negative if prediction is incorrect
Absolute value of the functional margin = confidence in the predicted label

..or misconfidence if prediction is wrong large margin high confidence
The Perceptron Algorithm

One of the earliest algorithms for linear classification (Rosenblatt, 1958) Based on finding a separating hyperplane of the data Guaranteed G t dt to fi find d a separating ti hyperplane h l if the th data d t is i linearly li l separable bl
If data not linear separable Make it linearly separable .. or use a combination of multiple perceptrons (Neural Networks)
Cycles through the training data by processing training examples one at a time (an online algorithm) Starts with some initialization for (w, b) (e.g., w = [0, . . . , 0]; b = 0) An iterative mistakedriven learning algorithm for updating (w, b) Dont update if w correctly predicts the label of the current training example Update w when it mispredicts the label of the current training example True label is +1, but sign(wt x + b) = 1 (or viceversa)
The Perceptron Algorithm
Repeat until convergence
Batch vs Online learning g algorithms: g

Batch algorithms operate on the entire training data Online algorithms can process one example at a time Usually more efficient (computationally, memoryfootprintwise) than batch Often batch problems can be solved using online learning!
Given: Sequence of N training examples {(x1, y1), . . . , (xN, yN)} Initialize: w = [0, . . . , 0], b = 0 Repeat until convergence: For n = 1, . . . ,N if f sign(w g ( t xn + b) ) yn ( (i.e., , mistake is made) ) w = w + ynxn b = b + yn Stopping condition: stop when either All training examples are classified correctly May overfit, so less common in practice
The Perceptron Algorithm: Formally
A fixed number of iterations completed completed, or some convergence criteria met Completed one pass over the data (each example seen once) E.g., examples arriving in a streaming fashion and cant be stored in memory (more passes just not possible) Note: sign(wt xn + b) yn is equivalent to yn(wt xn + b) < 0
Why Perceptron Updates Work?

Lets look at a misclassified positive example Perceptron (wrongly) thinks wtoldxn + bold < 0 Updates would be (yn = +1)
wnew = wold + ynxn = wold + xn bnew = bold + yn = bold + 1
(since yn = +1)
wtnewxn + bnew = (wold + xn)t xn + bold + 1 = (wtoldxn + bold ) + xtn xn + 1
Thus wtnew xn + bnew is less negative than wtoldxn + bold

So we are making ourselves more correct on this example!
Why Perceptron Updates Work (Pictorially)?
Why Perceptron Updates Work?

Lets look at a misclassified negative example Perceptron (wrongly) thinks wtoldxn + bold > 0 Updates would be (yn = 1)
wnew = wold + ynxn = wold xn bnew = bold + yn = bold 1
(since yn = 1)
wtnewxn + bnew = (wold xn)t xn + bold 1 = (wtoldxn + bold ) xtn xn 1
Thus wtnew xn + bnew is less positive than wtoldxn + bold

So we are making ourselves more correct on this example!
Why Perceptron Updates Work (Pictorially)?
Perceptron convergence theorem

The perceptron convergence theorem states that if the perceptron learning rule is applied to a linearly separable data set, a solution will be found after some fi i number finite b of f updates. d The number of updates depends on the data set, and also on the step size parameter. If the data is not linearly separable, there will be oscillation (which can be detected automatically).
Example: a simple problem

4 points linearly separable
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2
d3= 1
X3 = (1/2, 1) X1 = (-1,1/2)
d1 = - 1
X4 = (1,1/2)
d4= 1
X2 = (-1,1)
d2 = - 1
-1.5
-1
-0.5
0.5
1.5
wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi
initial weights
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2
W0 = [0 1]
W(0) = (0,1)
X1 = (-1,1/2) X3 = (1/2, 1)
d3= 1
X4 = (1,1/2)
d1 = - 1
d4= 1
X2 = (-1,1)
d2 = - 1
-1.5
-1
-0.5
0.5
1.5
wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi Initial weight g Conditions: Learning constant c =0.1 Training sets: ( vectors x1, x2, x3, x4 are augmented with x0 =1)
X1
x0=1 w1 w0 w2
o(xi)=
1 if
1 otherwise
w x >0 i=0
i i
X2
wx i=0 i i
Initial discriminate function: w1x1 +w2x2+ w0 = 0 => x2+ 1 = 0 => x2= 1 Step 1:
=> 0.2x1 +0.9x2+ 0.8 = 0 Point (1,0.5) is still mis classified !!!!
x2 Initial discriminate function x1
Step 1 modification in discriminate function
Step 2:
=> 0.4x1 +0.7x2+ 0.6 = 0
Point (1,0.5) is still mis classified !!!!
Step 3:
=> 0.4x1 +0.7x2+ 0.6 = 0 (No change in discriminate function)

Step 4:
=> 0.4x1 +0.7x2+ 0.6 = 0

(No change in discriminate function) x2 Initial discriminate function x1
Step 5:
=> 0.6x1 +0.6x2+ 0.4 = 0

x2
Point (1,0.5) is still mis classified !!!!

Initial discriminate function x1
Step 6:
=> 0.8x1 +0.4x2+ 0.2 = 0

x2
Step 6 modification in discriminate function (correct classification)
Final discriminate function Initial discriminate function x1
In 6th step: All points are linearly classified
The Best Hyperplane Separator?

Perceptron finds one of the many possible hyperplanes separating the data .. if one exists Of the many y possible p choices, , which one is the best?
Intuitively, we want the hyperplane having the maximum margin Large margin leads to good generalization on the test data
Linear Machines and Minimum Distance Classification
Linear Machines and Minimum Distance Classification two clusters of patterns, each cluster belonging to one known category (class). The center points (centers of gravity )of the clusters shown of classes 1 and 2 are vectors x1 and x2, respectively. decision hyperplane contain the midpoint of the line segment connecting prototype points P1 and P2, and normal to the vector x1 x2, which is directed toward P1 .

Decision hyperplane equation:
Hyperplane equation in terms of w and x for n dimensional space:
The weighting coefficients w1, w2, . . . , wn+1 are obtained by comparing (3.9) and (3.10) as follows:
Linear Machines and Minimum Distance Classification Let us assume that a minimumdistance classification is required to classify patterns into one of the R categories. Each of the R classes is represented by prototype points P1, P2, . . . , PR being vectors x1, x2, . . . , xR, respectively. The Euclidean distance between input p p pattern x and the p prototype yp pattern vector xi is expressed by the norm of the vector x xi as follows: A minimumdistance classifier computes the distance from pattern x of unknown classification to each prototype.

category number of that closest, or smallest distance, prototype is assigned to the unknown pattern Calculating the squared distances from Equation (3.12) yields term xtx is independent of i only R terms are computed; 2xitxxitxi, for i = 1, . . . , R, in (3.13) determine for which xi this term takes the largest of all R values. Choosing the largest of the terms xitx0.5xitxi is equivalent to choosing the smallest of the distances llx xill This property is used to equate the highlighted term with a discriminant function gi(x): [remember : hyperplane w = x1 x2] gi(x) = xitx 0.5xitxi , for i = 1, 2, . . . , R (3.14) t Also, gi(x)= wi x + wi, n+1 , for i= , 1 , 2, ..., R (3.15) Compare (3.14) and (3.15), we get,

minimumdistance classifiers are linear classifiers => linear machines. Since minimumdistance classifiers assign category membership based on the closest match => correlation classification
block diagram of a linear machine employing linear discriminant functions as in Equation (3.15)
Linear Machines and Minimum Distance Classification decision surface Sij , for the contiguous decision regions Ri, Rj is a hyperplane given by the equation Sij =
Linear Machines and Minimum Distance Classification Example: The assumed prototype points are as shown & there coordinates are:
assumed that each prototype point index corresponds to its class number number.
Linear Machines and Minimum Distance Classification Using formula (3.16) for n=2, R = 3, the weight vectors can be obtained as:
The corresponding linear discriminant functions are:
Linear Machines and Minimum Distance Classification The resulting classifier is shown in Figure 3.8(d).

there are three decision lines S12, S13, and S23 The decision lines can be calculated by using the condition (3.17) and the discriminant functions (3.20b) as:

For input pattern xt=[6 1]
g1(x) = 60+2 52 = 10 g2(x) = 12514.5 = 7.5 g3(x) = 30 +5 25 = 50 Maximum is g1(x): so pattern x belongs to class 1 Input pattern x
Single layer continuous perceptron networks for linearly separbale classification

TLU element with weights will be replaced by the continuous perceptron. Two objectives: Gain finer control over the training procedure to facilitate working with differentiable characteristics of the threshold element, thus enabling computation of the error gradient weight modification problem could be better solved by using the gradient, or steepest descent, procedure
Single layer continuous perceptron networks

procedure of descent is simple. Starting from an arbitrary chosen weight vector w, the gradient E(w) of the current error function is computed. The next value of w is obtained by moving in the direction of the negative i gradient di along l the h multidimensional l idi i l error surface. f The direction of negative gradient is the one of steepest descent. The algorithm can be summarized as below:
where is the positive constant called the learning constant and the superscript (k) denotes the step number. expression for classification error to be minimized is:

error function has a single minimum at w = wf, which can be achieved using the negative gradient di t descent starting at the initial weight vector 0.

The error minimization algorithm (3.40) requires computation of the gradient of the error (3.41) as follows:
The n + 1dimensional gradient vector (3.43) is defined as follows:

Using (3.42) we obtain for the gradient vector:
which is the training rule of the continuous perceptron (delta rule)
Delta rule for single layer continuous perceptron

The learning rule performs a search within the solution's vector space towards a global minimum.
The error surface itself is a hyperparaboloid but is seldom as smooth as is depicted below. In most problems, the solution space is quite irregular with numerous pits and hills which may cause the network to settle down in a local minimum ( (not the best overall solution). Epochs are repeated until stopping criterion is reached (error magnitude, number of iterations, change of weights, etc).

Let us express f '(net) in terms of continuous perceptron output. Using the bipolar continuous activation function f(net) of the form

complete delta training rule for the bipolar continuous activation function results from (3.40) as:
Perceptron vs. Delta Rule

perceptron training rule: uses thresholded unit converges after a finite number of iterations output hypothesis classifies training data perfectly linearly separability necessary delta rule: uses unthresholded linear unit converges g asymptotically y p y toward a minimum error hypothesis yp termination is not guaranteed linear separability not neccessary
XOR
Single layer Perceptron can not solve XOR problem. !!!
Different NonLinearly Separable Problems

Structure Single-Layer Types of Decision Regions Half Plane Bounded By Hyperplane Convex Open Or Closed Regions Arbitrary (Complexity Limited by No. of Nodes) Exclusive-OR Problem A B A B A B B A B A B A Classes with Most General Meshed regions Region Shapes B
Two-Layer
Three-Layer
A
63
Delta Rule
perceptron rule fails if data is not linearly separable delta rule converges toward a bestfit approximation uses gradient descent to search the hypothesis space
Discrete perceptron cannot be used, because it is not differentiable hence, a unthresholded linear unit is appropriate error measure:
to understand gradient descent, it is helpful to visualize the entire hypothesis space with
all possible weight vectors and associated E values
the axes w0,w1 represent possible values for the two weights of a simple linear unit
Error Surface
=> error surface must be parabolic with a single global minimum
Generalization and Early Stopping

By proper training, a neural network may produce reasonable output for inputs not seen during training Generalization Generalization is particularly useful for the analysis of a noisy data (e.g. timeseries) Overtraining will not improve the ability of a neural network to produce good output. On the contrary, it will try to take noise as the real data and lost its generality.
Generalization and Early Stopping

Overfitting vs Generalization
Validation data set
Learning data set
Early stopping area
Number of iteration in optimization
Overfitting
With sufficient nodes can classify any training set exactly May have poor generalisation ability. Crossvalidation with some patterns
Typically 30% of training patterns Validation set error is checked each epoch Stop training if validation error goes up
Training time
How many epochs of training?
Stop if the error fails to improve (has reached a minimum) Stop if the rate of improvement drops below a certain level Stop if the error reaches an acceptable level Stop when a certain number of epochs have passed
Limitations of Simple Neural Networks

The Limitations of Perceptrons
(Minsky and Papert, 1969)
Able to form only linear discriminate functions; i.e. classes which can be divided by a line or hyper plane Most functions are more complex; i.e. they are non linear or not linearly separable This crippled research in neural net theory for 15 years ....
Linear inseparability
Singlelayer perceptron with threshold units fails if problem is not linearly separable
Example: XOR
1,0 1,1
) Minsky and Paperts book
showing these negative results was very influential
0,0 Y
0,1
Solution in 1980s: Multilayer perceptrons

Removes many limitations of singlelayer networks
Can solve XOR
Draw a twolayer perceptron that computes the XOR function

2 binary inputs X and Y ( 0 or 1) 1,0 1 binary y output p ( 0 or 1) ) X One hidden layer Find the appropriate weights and threshold 0,0
Y 1,1
0,1
Multilayer Neural Network
Output layer Hidden layer of neurons Input layer X Y
EXAMPLE
Logical L i l XOR Function X 0 0 1 1 Y 0 1 0 1 Z 0 1 1 0 1,0
1,1
0,0 Y
0,1
Two neurons are need! Their combined results can produce good classification.
Solution in 1980s: Multilayer perceptrons

Two Examples of twolayer perceptrons that compute XOR
Hidden layer Logical XOR Function X 0 0 1 1 x y x y Y 0 1 0 1 Z 0 1 1 0
Each E h hidd hidden node d realizes li one of f the h li lines E.g. left side network
Output is 1 if and only if (x + y .5 > 0) 2(x + y 1.5 > 0) 0.5 > 0
E.g. Right side network Output is 1 if and only if x + y 2(x + y 1.5 > 0) 0.5 > 0
EXAMPLE
More complex multilayer networks are needed to solve more difficult problems.
Multilayer Perceptron
Output neurons
The most common output function (Sigmoid):
One or more layers of hidden units (hidden layers)
g (a) =
1
1 1 + e a
g(a) (a)
a
Input nodes
(nonlinear squashing function)
Multilayer Perceptron (MLP)

Output Values Output Layer Adjustable Weights Input Layer
Input Signals (External Stimuli)

77
Types of Layers
The input layer.
Introduces input values into the network. No activation function or other processing processing.
The hidden layer(s).

Perform classification of features Two hidden layers are sufficient to solve any problem Features imply more layers may be better
The output layer. layer

Functionally just like the hidden layers Outputs are passed on to the world outside the neural network.
78

out y 2
1 2
1
1 2
?
1 2 x
1 1 1
x y
2 1
1

out y 2
1+
1 x y <0 2 1+
=0
=1
1
1 2
1 x y >0 2
1
x y
1
1 x

out y 2 =1 =0
2 1
x
1 =1 1 =0
2 x y > 0
1
2 x y < 0
2 x

out y 2 =1 =0
1 2
1 =1 1
1 >0 2
=0
Perceptrons as Constraint Satisfaction Networks

out y 2
1+
1 x y <0 2 1+
=0
=1
1 2
1 =1 1
1 x y >0 2
=0
1 2
1 1 1
x y
2 1
2 x y > 0
1
2 x y < 0
2 x

Perceptron Linear Classifiers

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Perceptron Linear Classifiers

Transféré par

Droits d'auteur :

Formats disponibles

Perceptron

Dr Ashutosh Gupta Dr.

Single Layer Discrete Perceptron Networks

Hyperplane based Classification

Limitations of perceptrons Linear inseparability Multi Layer perceptron

Linear threshold unit (LTU)

wi xi >0 i=0 1 otherwise

Perceptron Learning Rule

Perceptron Learning Rule

wi xi >t i=0 0 otherwise

Linear threshold is used. W weight value t threshold value

output= 1 W = 1.5 X W=1 Y

What a Perceptron Does ?

Separates a Ddimensional space into two halfspaces

Linear Classification via Hyperplanes

Geometric margin gnof an example xn is its distance from the hyperplane

Functional margin of a training example:

Negative if prediction is incorrect

Absolute value of the functional margin = confidence in the predicted label

The Perceptron Algorithm

The Perceptron Algorithm

Repeat until convergence

Batch vs Online learning g algorithms: g

The Perceptron Algorithm: Formally

Why Perceptron Updates Work?

wnew = wold + ynxn = wold + xn bnew = bold + yn = bold + 1

wtnewxn + bnew = (wold + xn)t xn + bold + 1 = (wtoldxn + bold ) + xtn xn + 1

Thus wtnew xn + bnew is less negative than wtoldxn + bold

Why Perceptron Updates Work (Pictorially)?

Why Perceptron Updates Work?

wnew = wold + ynxn = wold xn bnew = bold + yn = bold 1

wtnewxn + bnew = (wold xn)t xn + bold 1 = (wtoldxn + bold ) xtn xn 1

Thus wtnew xn + bnew is less positive than wtoldxn + bold

Why Perceptron Updates Work (Pictorially)?

Perceptron convergence theorem

Example: a simple problem

x2 Initial discriminate function x1

Step 1 modification in discriminate function

=> 0.4x1 +0.7x2+ 0.6 = 0

x2 Initial discriminate function x1

Step 2 modification in discriminate function

Point (1,0.5) is still mis classified !!!!

Step 1 modification in discriminate function

=> 0.4x1 +0.7x2+ 0.6 = 0 (No change in discriminate function)

Step 2 modification in discriminate function

Step 1 modification in discriminate function

=> 0.4x1 +0.7x2+ 0.6 = 0

Step 2 modification in discriminate function

Step 1 modification in discriminate function

=> 0.6x1 +0.6x2+ 0.4 = 0

Point (1,0.5) is still mis classified !!!!

=> 0.8x1 +0.4x2+ 0.2 = 0

Final discriminate function Initial discriminate function x1

In 6th step: All points are linearly classified

The Best Hyperplane Separator?

Linear Machines and Minimum Distance Classification

Linear Machines and Minimum Distance Classification

Hyperplane equation in terms of w and x for n dimensional space:

Linear Machines and Minimum Distance Classification

Linear Machines and Minimum Distance Classification

The corresponding linear discriminant functions are:

Linear Machines and Minimum Distance Classification

Linear Machines and Minimum Distance Classification

Single layer continuous perceptron networks for linearly separbale classification