Vous êtes sur la page 1sur 4

Homework 4 Sample Solutions, 15-681 Machine Learning

1.

S = { < ? ? High Strong ? Change > }

G = { < ? ? High Strong ? ? >,


< ? ? ? ? ? Change > }

2.

Almost nobody gave the same answer for this one, so here are some boundries on what was
accepted:
In order to get a reasonable upper bound on the necessary number of training examples, a
reasonable upper bound on the size of the hypothesis space is required.
A reasonable bound on the number of depth-2-or-less decision trees must be dominated by
an n3 term, as a consequence of the number of di erent combinations of attribute nodes in
a full, depth 2 tree (n at the root and n 1 for each child yields n3 3n2 + 2n).
Moreover, for each di erent combination of attribute nodes, there is more than one labeling
of the leaf nodes corresponding to a distinct concept. Therefore, the n3 term must have a
coecient c greater than 1.
3.

There are a lot of arguments that can be made about both sides of this topic, but the most
popular answers had to do with small sets:
 ProThere are fewer short hypotheses than long ones, so short ones are less likely to be
consistent with the data by chance, and more likely to re ect underlying generalities.
 Con There are many small, arbitrary sets of hypotheses about which one could make
the same argument - what's so special about sets of short hypotheses? Also: Which
hypotheses are long and which are short depends on the particular representation
employed.
4.

Because we can't strictly count the size of the hypothesis space, it's necessary to employ the
formula for sample complexity that uses the VC dimension.
The VC dimension of rectangles over points in the plane is 4. Not all sets of points of size
4 can be shattered, but an example of one that can is four points arranged at the endpoints
1
of a plus-sign. It wasn't necessary to prove that the VC dimension is 4, but to see that it
can't be 5, note that given any set of 5 points, there will be a set of 4 points that force the
rectangle to be a maximal size to contain them. The fth point will then be on the boundry
of the rectangle or in the interior.
5.

For each input p, the output is

op = w0 + w1w0x1;p + w2w0x2;p
The error is then

E=
X( t p op)2
p

To construct the learning rule, we need to get the partial derivative of the error with respect
to each weight. In the general case, we get,

E =  (t o )2X
wi wi p p p
=
X (t o )2
p wi
p p

=
X
2(tp op) w (t o )
p p
p
X
= 2 (tp op) w
i

 ( o)
p
p i

For w0,
 ( o ) = (1 + w x + w x )
w0 p 1 1;p 2 2;p

for w1,
 ( o )= w x
w1 p 0 1;p

and for w2,


 ( o )= w x
w1 p 0 2;p

Now we can construct a learning rule to descend the gradient of the error surface with respect
to each weight at a rate of . In the general case,
E )
wi = ( w
i

2
Plugging in from above for each i = 0; 1; 2, the minus sign associated with the direction of
descent cancels the minus sign associated with wE
and we incorporate the constant 2 into
, yielding:
i

w0 = 
X( tp op)(1 + w1x1;p + w2x2;p)
X(t
p

w1 =  p op )(w0x1;p)
X
p

w =  (t
2 p op )(w0x2;p)
p

6.

In each example, the perceptrons output 1 if the weighted sum of their inputs is  0, and -1
otherwise.
A two-input perceptron that implements the function A ^ :B :

1
\ w = -1
\
\
w = 2 \/---\
A --------| |-------->
/\---/
/
/
/ w = -2
B

3
A two-layer network of perceptrons that implements the function A XOR B :

1
\ w = -1
\
\
w = 2 \/---\
A ---------| |---
/\---/ \
/ \ w = 1
w = -2 / \
/ w = -1 \/---\
B 1 ---------| |-------->
\ /\---/
w = 2 \ /
\ / w = 1
\/---\ /
A ---------| |---
w = -2 /\---/
/
/
/ w = -1
1

Hope you enjoy the hokey ASCII drawings . . .

Vous aimerez peut-être aussi