Académique Documents
Professionnel Documents
Culture Documents
Outline
Linear Machines and Minimum Distance classification Single layer continuous perceptron networks for linearly separable classification
Delta rule
Outline
Perceptron vs. Delta rule XOR problem Delta rule Generalization and Early stopping
Overfitting Training time
Discrete Perceptron
w0
o(xi)=
. . . X
X2
w2
w1
x0=1
1 if
wn
w i xi i=0
takes a vector of realvalued inputs (x1, ..., xn) weighted with (w1, ...,wn) calculates the linear combination of these inputs
w0 denotes a threshold value x0 is always 1 outputs 1 if the result is greater than 1, otherwise 1
4
many boolean functions can be represented by a perceptron: AND, OR, NAND, NOR a perceptron represents a hyperplane decision surface in the ndimensional space of instances some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable
Representational Power
Supervised Learning
Training and test data sets Training set; input & target
Perceptron Training
Output =
1 if
Simple network
1 if
w x >t
i=0 i i
0 otherwise
t = 0.0
10
Training Perceptrons
1 W=? x t = 0.0 W=? y For AND A B Output 00 0 01 0 10 0 11 1
W=?
What are the weight values? Initialize with random weight values
11
Learning algorithm
Epoch : Presentation of the entire training set to the neural l network. t k In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = 1
12
Learning algorithm
Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. F example, For l if we present t th the network t k with ith [1 [1,1] 1] for f th the AND function the training value will be 1 Output , O : The output value from the neuron Ij : Inputs being presented to the neuron Wj : Weight from input neuron (Ij) to the output neuron LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1
Properties of Perceptrons
Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
Regression:
Hyperplane
Defined by an outward pointing normal vector w w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If not,
have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in opposite direction)
In simple cases, divide feature space by drawing a hyperplane across it. Known as a decision boundary. Discriminant function: returns different values on opposite sides. (straight line) Problems which can be thus classified are linearly separable.
Decision boundaries
Discriminant function
19
For binary classification, w is assumed to point towards the positive class Classification rule:
wt x + b > 0 y = +1 wt x + b < 0 ) y = 1
Question: What about the points x for which wt x + b = 0? Goal: To learn the hyperplane (w, b) using the training data. => to find hyperplane equation wt x + b = 0 (It is a decision boundary)
Concept of Margins
Geometric margin may be positive (if yn = +1) or negative (if yn = 1) Margin of a set {x1, . . . , xN} is the minimum absolute geometric margin
yn(wt xn + b)
If data not linear separable Make it linearly separable .. or use a combination of multiple perceptrons (Neural Networks)
Cycles through the training data by processing training examples one at a time (an online algorithm) Starts with some initialization for (w, b) (e.g., w = [0, . . . , 0]; b = 0) An iterative mistakedriven learning algorithm for updating (w, b) Dont update if w correctly predicts the label of the current training example Update w when it mispredicts the label of the current training example True label is +1, but sign(wt x + b) = 1 (or viceversa)
Given: Sequence of N training examples {(x1, y1), . . . , (xN, yN)} Initialize: w = [0, . . . , 0], b = 0 Repeat until convergence: For n = 1, . . . ,N if f sign(w g ( t xn + b) ) yn ( (i.e., , mistake is made) ) w = w + ynxn b = b + yn Stopping condition: stop when either All training examples are classified correctly May overfit, so less common in practice
A fixed number of iterations completed completed, or some convergence criteria met Completed one pass over the data (each example seen once) E.g., examples arriving in a streaming fashion and cant be stored in memory (more passes just not possible) Note: sign(wt xn + b) yn is equivalent to yn(wt xn + b) < 0
(since yn = +1)
(since yn = 1)
d3= 1
X3 = (1/2, 1) X1 = (-1,1/2)
d1 = - 1
X4 = (1,1/2)
d4= 1
X2 = (-1,1)
d2 = - 1
-1.5
-1
-0.5
0.5
1.5
wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi
initial weights
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2
W0 = [0 1]
W(0) = (0,1)
X1 = (-1,1/2) X3 = (1/2, 1)
d3= 1
X4 = (1,1/2)
d1 = - 1
d4= 1
X2 = (-1,1)
d2 = - 1
-1.5
-1
-0.5
0.5
1.5
wi = c [di sgn (witx)] x = c [di oi] x di is the desired response oi is the perceptron output c is a small constant (e.g. 0.1) called learning rate wi = wi + wi Initial weight g Conditions: Learning constant c =0.1 Training sets: ( vectors x1, x2, x3, x4 are augmented with x0 =1)
X1
x0=1 w1 w0 w2
o(xi)=
1 if
1 otherwise
w x >0 i=0
i i
X2
wx i=0 i i
Initial discriminate function: w1x1 +w2x2+ w0 = 0 => x2+ 1 = 0 => x2= 1 Step 1:
=> 0.2x1 +0.9x2+ 0.8 = 0 Point (1,0.5) is still mis classified !!!!
Step 2:
Step 3:
Step 4:
Step 5:
x2
Step 6:
Intuitively, we want the hyperplane having the maximum margin Large margin leads to good generalization on the test data
Linear Machines and Minimum Distance Classification two clusters of patterns, each cluster belonging to one known category (class). The center points (centers of gravity )of the clusters shown of classes 1 and 2 are vectors x1 and x2, respectively. decision hyperplane contain the midpoint of the line segment connecting prototype points P1 and P2, and normal to the vector x1 x2, which is directed toward P1 .
The weighting coefficients w1, w2, . . . , wn+1 are obtained by comparing (3.9) and (3.10) as follows:
Linear Machines and Minimum Distance Classification Let us assume that a minimumdistance classification is required to classify patterns into one of the R categories. Each of the R classes is represented by prototype points P1, P2, . . . , PR being vectors x1, x2, . . . , xR, respectively. The Euclidean distance between input p p pattern x and the p prototype yp pattern vector xi is expressed by the norm of the vector x xi as follows: A minimumdistance classifier computes the distance from pattern x of unknown classification to each prototype.
block diagram of a linear machine employing linear discriminant functions as in Equation (3.15)
Linear Machines and Minimum Distance Classification decision surface Sij , for the contiguous decision regions Ri, Rj is a hyperplane given by the equation Sij =
Linear Machines and Minimum Distance Classification Example: The assumed prototype points are as shown & there coordinates are:
assumed that each prototype point index corresponds to its class number number.
Linear Machines and Minimum Distance Classification Using formula (3.16) for n=2, R = 3, the weight vectors can be obtained as:
Linear Machines and Minimum Distance Classification The resulting classifier is shown in Figure 3.8(d).
g1(x) = 60+2 52 = 10 g2(x) = 12514.5 = 7.5 g3(x) = 30 +5 25 = 50 Maximum is g1(x): so pattern x belongs to class 1 Input pattern x
where is the positive constant called the learning constant and the superscript (k) denotes the step number. expression for classification error to be minimized is:
XOR
Two-Layer
Three-Layer
A
63
Delta Rule
perceptron rule fails if data is not linearly separable delta rule converges toward a bestfit approximation uses gradient descent to search the hypothesis space
Discrete perceptron cannot be used, because it is not differentiable hence, a unthresholded linear unit is appropriate error measure:
to understand gradient descent, it is helpful to visualize the entire hypothesis space with
all possible weight vectors and associated E values
the axes w0,w1 represent possible values for the two weights of a simple linear unit
Error Surface
Overfitting
With sufficient nodes can classify any training set exactly May have poor generalisation ability. Crossvalidation with some patterns
Typically 30% of training patterns Validation set error is checked each epoch Stop training if validation error goes up
Training time
How many epochs of training?
Stop if the error fails to improve (has reached a minimum) Stop if the rate of improvement drops below a certain level Stop if the error reaches an acceptable level Stop when a certain number of epochs have passed
Able to form only linear discriminate functions; i.e. classes which can be divided by a line or hyper plane Most functions are more complex; i.e. they are non linear or not linearly separable This crippled research in neural net theory for 15 years ....
Linear inseparability
Singlelayer perceptron with threshold units fails if problem is not linearly separable
Example: XOR
1,0 1,1
0,0 Y
0,1
0,1
EXAMPLE
Logical L i l XOR Function X 0 0 1 1 Y 0 1 0 1 Z 0 1 1 0 1,0
1,1
0,0 Y
0,1
Two neurons are need! Their combined results can produce good classification.
Each E h hidd hidden node d realizes li one of f the h li lines E.g. left side network
Output is 1 if and only if (x + y .5 > 0) 2(x + y 1.5 > 0) 0.5 > 0
E.g. Right side network Output is 1 if and only if x + y 2(x + y 1.5 > 0) 0.5 > 0
EXAMPLE
More complex multilayer networks are needed to solve more difficult problems.
Multilayer Perceptron
Output neurons
g (a) =
1
1 1 + e a
g(a) (a)
a
Input nodes
Types of Layers
The input layer.
Introduces input values into the network. No activation function or other processing processing.
1 2
1
1 2
?
1 2 x
1 1 1
x y
2 1
1
1+
1 x y <0 2 1+
=0
=1
1
1 2
1 x y >0 2
1
x y
1
1 x
2 1
x
1 =1 1 =0
2 x y > 0
1
2 x y < 0
2 x
1 2
1 =1 1
1 >0 2
=0
1+
1 x y <0 2 1+
=0
=1
1 2
1 =1 1
1 x y >0 2
=0
1 2
1 1 1
x y
2 1
2 x y > 0
1
2 x y < 0
2 x