Vous êtes sur la page 1sur 42

Perceptron

Dr Ashutosh Gupta Dr.

Single Layer Discrete Perceptron Networks


Perceptron learning rule Perceptron training Learning algorithm Properties of Perceptron What a perceptron does? Regression Classification Decision Boundary Linear classification via hyperplane Perceptron Algorithm How perceptron Update works? Perceptron Convergence Theorem Example: a simple problem

Outline

Hyperplane based Classification


Linear Machines and Minimum Distance classification Single layer continuous perceptron networks for linearly separable classification
Delta rule

Outline

Perceptron vs. Delta rule XOR problem Delta rule Generalization and Early stopping
Overfitting Training time

Limitations of perceptrons Linear inseparability Multi Layer perceptron


Example: Perceptrons as Constraint Satisfaction Networks

Linear threshold unit (LTU)


X1

Discrete Perceptron
w0
o(xi)=

. . . X

X2

w2

w1

x0=1

wi xi >0 i=0 1 otherwise

1 if

wn

w i xi i=0

takes a vector of realvalued inputs (x1, ..., xn) weighted with (w1, ...,wn) calculates the linear combination of these inputs

w0 denotes a threshold value x0 is always 1 outputs 1 if the result is greater than 1, otherwise 1
4

many boolean functions can be represented by a perceptron: AND, OR, NAND, NOR a perceptron represents a hyperplane decision surface in the ndimensional space of instances some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable

Representational Power

Perceptron Learning Rule


problem: determine a weight vector w that causes the perceptron to produce the correct output for each training example perceptron training rule: Weight adjustment wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant ( (e.g. g 0.1) ) called learning g rate wi = wi + wi

Perceptron Learning Rule


algorithm: 1. Initialize w to random weights 2. repeat, until each training example is classified correctly (a) apply perceptron training rule to each training example If the output is correct (di = oi) the weights wi are not changed If the output is incorrect (di oi) the weights wi are changed such that the output of the perceptron for the new weights is closer to di . The algorithm converges to the correct classification if the training data is linearly separable and c is sufficiently small

Supervised Learning
Training and test data sets Training set; input & target

Perceptron Training

Output =

wi xi >t i=0 0 otherwise

1 if

Linear threshold is used. W weight value t threshold value


9

Simple network

output= 1 W = 1.5 X W=1 Y

1 if

w x >t
i=0 i i

0 otherwise

t = 0.0

10

Training Perceptrons
1 W=? x t = 0.0 W=? y For AND A B Output 00 0 01 0 10 0 11 1

W=?

What are the weight values? Initialize with random weight values

11

Learning algorithm
Epoch : Presentation of the entire training set to the neural l network. t k In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = 1
12

Learning algorithm
Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. F example, For l if we present t th the network t k with ith [1 [1,1] 1] for f th the AND function the training value will be 1 Output , O : The output value from the neuron Ij : Inputs being presented to the neuron Wj : Weight from input neuron (Ij) to the output neuron LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1

Properties of Perceptrons
Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability

What a Perceptron Does ?


Regression: y=wx+w0 Classification: y=1 (wx+w0>0)

Regression:

Separates a Ddimensional space into two halfspaces

Hyperplane

Defined by an outward pointing normal vector w w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If not,
have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in opposite direction)

In simple cases, divide feature space by drawing a hyperplane across it. Known as a decision boundary. Discriminant function: returns different values on opposite sides. (straight line) Problems which can be thus classified are linearly separable.

Decision boundaries

Discriminant function

19

Linear Classification via Hyperplanes


Linear Classifiers: Represent the decision boundary by a hyperplane w
decision boundary (surface) Decision region R2 Decision region R1

For binary classification, w is assumed to point towards the positive class Classification rule:

wt x + b > 0 y = +1 wt x + b < 0 ) y = 1

Question: What about the points x for which wt x + b = 0? Goal: To learn the hyperplane (w, b) using the training data. => to find hyperplane equation wt x + b = 0 (It is a decision boundary)

Geometric margin gnof an example xn is its distance from the hyperplane

Concept of Margins

Geometric margin may be positive (if yn = +1) or negative (if yn = 1) Margin of a set {x1, . . . , xN} is the minimum absolute geometric margin

Functional margin of a training example:


Positive if prediction is correct;

yn(wt xn + b)

Negative if prediction is incorrect

Absolute value of the functional margin = confidence in the predicted label


..or misconfidence if prediction is wrong large margin high confidence

The Perceptron Algorithm


One of the earliest algorithms for linear classification (Rosenblatt, 1958) Based on finding a separating hyperplane of the data Guaranteed G t dt to fi find d a separating ti hyperplane h l if the th data d t is i linearly li l separable bl

If data not linear separable Make it linearly separable .. or use a combination of multiple perceptrons (Neural Networks)

Cycles through the training data by processing training examples one at a time (an online algorithm) Starts with some initialization for (w, b) (e.g., w = [0, . . . , 0]; b = 0) An iterative mistakedriven learning algorithm for updating (w, b) Dont update if w correctly predicts the label of the current training example Update w when it mispredicts the label of the current training example True label is +1, but sign(wt x + b) = 1 (or viceversa)

The Perceptron Algorithm

Repeat until convergence

Batch vs Online learning g algorithms: g


Batch algorithms operate on the entire training data Online algorithms can process one example at a time Usually more efficient (computationally, memoryfootprintwise) than batch Often batch problems can be solved using online learning!

Given: Sequence of N training examples {(x1, y1), . . . , (xN, yN)} Initialize: w = [0, . . . , 0], b = 0 Repeat until convergence: For n = 1, . . . ,N if f sign(w g ( t xn + b) ) yn ( (i.e., , mistake is made) ) w = w + ynxn b = b + yn Stopping condition: stop when either All training examples are classified correctly May overfit, so less common in practice

The Perceptron Algorithm: Formally

A fixed number of iterations completed completed, or some convergence criteria met Completed one pass over the data (each example seen once) E.g., examples arriving in a streaming fashion and cant be stored in memory (more passes just not possible) Note: sign(wt xn + b) yn is equivalent to yn(wt xn + b) < 0

Why Perceptron Updates Work?


Lets look at a misclassified positive example Perceptron (wrongly) thinks wtoldxn + bold < 0 Updates would be (yn = +1)

wnew = wold + ynxn = wold + xn bnew = bold + yn = bold + 1

(since yn = +1)

wtnewxn + bnew = (wold + xn)t xn + bold + 1 = (wtoldxn + bold ) + xtn xn + 1

Thus wtnew xn + bnew is less negative than wtoldxn + bold


So we are making ourselves more correct on this example!

Why Perceptron Updates Work (Pictorially)?

Why Perceptron Updates Work?


Lets look at a misclassified negative example Perceptron (wrongly) thinks wtoldxn + bold > 0 Updates would be (yn = 1)

wnew = wold + ynxn = wold xn bnew = bold + yn = bold 1

(since yn = 1)

wtnewxn + bnew = (wold xn)t xn + bold 1 = (wtoldxn + bold ) xtn xn 1

Thus wtnew xn + bnew is less positive than wtoldxn + bold


So we are making ourselves more correct on this example!

Why Perceptron Updates Work (Pictorially)?

Perceptron convergence theorem


The perceptron convergence theorem states that if the perceptron learning rule is applied to a linearly separable data set, a solution will be found after some fi i number finite b of f updates. d The number of updates depends on the data set, and also on the step size parameter. If the data is not linearly separable, there will be oscillation (which can be detected automatically).

Example: a simple problem


4 points linearly separable
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2

d3= 1
X3 = (1/2, 1) X1 = (-1,1/2)

d1 = - 1

X4 = (1,1/2)

d4= 1

X2 = (-1,1)

d2 = - 1

-1.5

-1

-0.5

0.5

1.5

wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi
initial weights
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2

W0 = [0 1]
W(0) = (0,1)
X1 = (-1,1/2) X3 = (1/2, 1)

d3= 1
X4 = (1,1/2)

d1 = - 1

d4= 1

X2 = (-1,1)

d2 = - 1

-1.5

-1

-0.5

0.5

1.5

wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi Initial weight g Conditions: Learning constant c =0.1 Training sets: ( vectors x1, x2, x3, x4 are augmented with x0 =1)

X1

x0=1 w1 w0 w2

o(xi)=

1 if

1 otherwise

w x >0 i=0
i i

X2

wx i=0 i i

Initial discriminate function: w1x1 +w2x2+ w0 = 0 => x2+ 1 = 0 => x2= 1 Step 1:

=> 0.2x1 +0.9x2+ 0.8 = 0 Point (1,0.5) is still mis classified !!!!

x2 Initial discriminate function x1

Step 1 modification in discriminate function

Step 2:

=> 0.4x1 +0.7x2+ 0.6 = 0

x2 Initial discriminate function x1

Step 2 modification in discriminate function

Point (1,0.5) is still mis classified !!!!

Step 1 modification in discriminate function

Step 3:

=> 0.4x1 +0.7x2+ 0.6 = 0 (No change in discriminate function)


x2 Initial discriminate function x1

Step 2 modification in discriminate function

Step 1 modification in discriminate function

Step 4:

=> 0.4x1 +0.7x2+ 0.6 = 0


(No change in discriminate function) x2 Initial discriminate function x1

Step 2 modification in discriminate function

Step 1 modification in discriminate function

Step 5:

=> 0.6x1 +0.6x2+ 0.4 = 0


Step 5 modification in discriminate function

x2

Point (1,0.5) is still mis classified !!!!


Initial discriminate function x1

Step 6:

=> 0.8x1 +0.4x2+ 0.2 = 0


x2
Step 6 modification in discriminate function (correct classification)

Final discriminate function Initial discriminate function x1

In 6th step: All points are linearly classified

The Best Hyperplane Separator?


Perceptron finds one of the many possible hyperplanes separating the data .. if one exists Of the many y possible p choices, , which one is the best?

Intuitively, we want the hyperplane having the maximum margin Large margin leads to good generalization on the test data

Linear Machines and Minimum Distance Classification

Linear Machines and Minimum Distance Classification two clusters of patterns, each cluster belonging to one known category (class). The center points (centers of gravity )of the clusters shown of classes 1 and 2 are vectors x1 and x2, respectively. decision hyperplane contain the midpoint of the line segment connecting prototype points P1 and P2, and normal to the vector x1 x2, which is directed toward P1 .

Linear Machines and Minimum Distance Classification


Decision hyperplane equation:

Hyperplane equation in terms of w and x for n dimensional space:

The weighting coefficients w1, w2, . . . , wn+1 are obtained by comparing (3.9) and (3.10) as follows:

Linear Machines and Minimum Distance Classification Let us assume that a minimumdistance classification is required to classify patterns into one of the R categories. Each of the R classes is represented by prototype points P1, P2, . . . , PR being vectors x1, x2, . . . , xR, respectively. The Euclidean distance between input p p pattern x and the p prototype yp pattern vector xi is expressed by the norm of the vector x xi as follows: A minimumdistance classifier computes the distance from pattern x of unknown classification to each prototype.

Linear Machines and Minimum Distance Classification


category number of that closest, or smallest distance, prototype is assigned to the unknown pattern Calculating the squared distances from Equation (3.12) yields term xtx is independent of i only R terms are computed; 2xitxxitxi, for i = 1, . . . , R, in (3.13) determine for which xi this term takes the largest of all R values. Choosing the largest of the terms xitx0.5xitxi is equivalent to choosing the smallest of the distances llx xill This property is used to equate the highlighted term with a discriminant function gi(x): [remember : hyperplane w = x1 x2] gi(x) = xitx 0.5xitxi , for i = 1, 2, . . . , R (3.14) t Also, gi(x)= wi x + wi, n+1 , for i= , 1 , 2, ..., R (3.15) Compare (3.14) and (3.15), we get,

Linear Machines and Minimum Distance Classification


minimumdistance classifiers are linear classifiers => linear machines. Since minimumdistance classifiers assign category membership based on the closest match => correlation classification

block diagram of a linear machine employing linear discriminant functions as in Equation (3.15)

Linear Machines and Minimum Distance Classification decision surface Sij , for the contiguous decision regions Ri, Rj is a hyperplane given by the equation Sij =

Linear Machines and Minimum Distance Classification Example: The assumed prototype points are as shown & there coordinates are:
assumed that each prototype point index corresponds to its class number number.

Linear Machines and Minimum Distance Classification Using formula (3.16) for n=2, R = 3, the weight vectors can be obtained as:

The corresponding linear discriminant functions are:

Linear Machines and Minimum Distance Classification The resulting classifier is shown in Figure 3.8(d).

Linear Machines and Minimum Distance Classification


there are three decision lines S12, S13, and S23 The decision lines can be calculated by using the condition (3.17) and the discriminant functions (3.20b) as:

Linear Machines and Minimum Distance Classification


For input pattern xt=[6 1]

g1(x) = 60+2 52 = 10 g2(x) = 12514.5 = 7.5 g3(x) = 30 +5 25 = 50 Maximum is g1(x): so pattern x belongs to class 1 Input pattern x

Single layer continuous perceptron networks for linearly separbale classification


TLU element with weights will be replaced by the continuous perceptron. Two objectives: Gain finer control over the training procedure to facilitate working with differentiable characteristics of the threshold element, thus enabling computation of the error gradient weight modification problem could be better solved by using the gradient, or steepest descent, procedure

Single layer continuous perceptron networks


procedure of descent is simple. Starting from an arbitrary chosen weight vector w, the gradient E(w) of the current error function is computed. The next value of w is obtained by moving in the direction of the negative i gradient di along l the h multidimensional l idi i l error surface. f The direction of negative gradient is the one of steepest descent. The algorithm can be summarized as below:

where is the positive constant called the learning constant and the superscript (k) denotes the step number. expression for classification error to be minimized is:

Single layer continuous perceptron networks


error function has a single minimum at w = wf, which can be achieved using the negative gradient di t descent starting at the initial weight vector 0.

Single layer continuous perceptron networks


The error minimization algorithm (3.40) requires computation of the gradient of the error (3.41) as follows:

The n + 1dimensional gradient vector (3.43) is defined as follows:

Single layer continuous perceptron networks


Using (3.42) we obtain for the gradient vector:

which is the training rule of the continuous perceptron (delta rule)

Delta rule for single layer continuous perceptron


The learning rule performs a search within the solution's vector space towards a global minimum.
The error surface itself is a hyperparaboloid but is seldom as smooth as is depicted below. In most problems, the solution space is quite irregular with numerous pits and hills which may cause the network to settle down in a local minimum ( (not the best overall solution). Epochs are repeated until stopping criterion is reached (error magnitude, number of iterations, change of weights, etc).

Single layer continuous perceptron networks


Let us express f '(net) in terms of continuous perceptron output. Using the bipolar continuous activation function f(net) of the form

Single layer continuous perceptron networks


complete delta training rule for the bipolar continuous activation function results from (3.40) as:

Perceptron vs. Delta Rule


perceptron training rule: uses thresholded unit converges after a finite number of iterations output hypothesis classifies training data perfectly linearly separability necessary delta rule: uses unthresholded linear unit converges g asymptotically y p y toward a minimum error hypothesis yp termination is not guaranteed linear separability not neccessary

XOR

Single layer Perceptron can not solve XOR problem. !!!

Different NonLinearly Separable Problems


Structure Single-Layer Types of Decision Regions Half Plane Bounded By Hyperplane Convex Open Or Closed Regions Arbitrary (Complexity Limited by No. of Nodes) Exclusive-OR Problem A B A B A B B A B A B A Classes with Most General Meshed regions Region Shapes B

Two-Layer

Three-Layer

A
63

Delta Rule
perceptron rule fails if data is not linearly separable delta rule converges toward a bestfit approximation uses gradient descent to search the hypothesis space
Discrete perceptron cannot be used, because it is not differentiable hence, a unthresholded linear unit is appropriate error measure:

to understand gradient descent, it is helpful to visualize the entire hypothesis space with
all possible weight vectors and associated E values

the axes w0,w1 represent possible values for the two weights of a simple linear unit

Error Surface

=> error surface must be parabolic with a single global minimum

Generalization and Early Stopping


By proper training, a neural network may produce reasonable output for inputs not seen during training Generalization Generalization is particularly useful for the analysis of a noisy data (e.g. timeseries) Overtraining will not improve the ability of a neural network to produce good output. On the contrary, it will try to take noise as the real data and lost its generality.

Generalization and Early Stopping


Overfitting vs Generalization

Validation data set

Learning data set

Early stopping area

Number of iteration in optimization

Overfitting
With sufficient nodes can classify any training set exactly May have poor generalisation ability. Crossvalidation with some patterns
Typically 30% of training patterns Validation set error is checked each epoch Stop training if validation error goes up

Training time
How many epochs of training?
Stop if the error fails to improve (has reached a minimum) Stop if the rate of improvement drops below a certain level Stop if the error reaches an acceptable level Stop when a certain number of epochs have passed

Limitations of Simple Neural Networks


The Limitations of Perceptrons
(Minsky and Papert, 1969)

Able to form only linear discriminate functions; i.e. classes which can be divided by a line or hyper plane Most functions are more complex; i.e. they are non linear or not linearly separable This crippled research in neural net theory for 15 years ....

Linear inseparability
Singlelayer perceptron with threshold units fails if problem is not linearly separable
Example: XOR
1,0 1,1

) Minsky and Paperts book

showing these negative results was very influential

0,0 Y

0,1

Solution in 1980s: Multilayer perceptrons


Removes many limitations of singlelayer networks
Can solve XOR

Draw a twolayer perceptron that computes the XOR function


2 binary inputs X and Y ( 0 or 1) 1,0 1 binary y output p ( 0 or 1) ) X One hidden layer Find the appropriate weights and threshold 0,0
Y 1,1

0,1

Multilayer Neural Network

Output layer Hidden layer of neurons Input layer X Y

EXAMPLE
Logical L i l XOR Function X 0 0 1 1 Y 0 1 0 1 Z 0 1 1 0 1,0

1,1

0,0 Y

0,1

Two neurons are need! Their combined results can produce good classification.

Solution in 1980s: Multilayer perceptrons


Two Examples of twolayer perceptrons that compute XOR
Hidden layer Logical XOR Function X 0 0 1 1 x y x y Y 0 1 0 1 Z 0 1 1 0

Each E h hidd hidden node d realizes li one of f the h li lines E.g. left side network
Output is 1 if and only if (x + y .5 > 0) 2(x + y 1.5 > 0) 0.5 > 0

E.g. Right side network Output is 1 if and only if x + y 2(x + y 1.5 > 0) 0.5 > 0

EXAMPLE

More complex multilayer networks are needed to solve more difficult problems.

Multilayer Perceptron

Output neurons

The most common output function (Sigmoid):

One or more layers of hidden units (hidden layers)

g (a) =
1

1 1 + e a
g(a) (a)
a

Input nodes

(nonlinear squashing function)

Multilayer Perceptron (MLP)


Output Values Output Layer Adjustable Weights Input Layer

Input Signals (External Stimuli)


77

Types of Layers
The input layer.
Introduces input values into the network. No activation function or other processing processing.

The hidden layer(s).


Perform classification of features Two hidden layers are sufficient to solve any problem Features imply more layers may be better

The output layer. layer


Functionally just like the hidden layers Outputs are passed on to the world outside the neural network.
78

Example: Perceptrons as Constraint Satisfaction Networks


out y 2

1 2

1
1 2

?
1 2 x

1 1 1
x y

2 1
1

Example: Perceptrons as Constraint Satisfaction Networks


out y 2

1+

1 x y <0 2 1+

=0

=1

1
1 2

1 x y >0 2

1
x y

1
1 x

Example: Perceptrons as Constraint Satisfaction Networks


out y 2 =1 =0

2 1
x

1 =1 1 =0

2 x y > 0
1

2 x y < 0
2 x

Example: Perceptrons as Constraint Satisfaction Networks


out y 2 =1 =0

1 2

1 =1 1

1 >0 2
=0

Perceptrons as Constraint Satisfaction Networks


out y 2

1+

1 x y <0 2 1+

=0

=1

1 2

1 =1 1

1 x y >0 2
=0

1 2

1 1 1
x y

2 1

2 x y > 0
1

2 x y < 0
2 x

Vous aimerez peut-être aussi