Académique Documents
Professionnel Documents
Culture Documents
You must answer any nine questions out of the following twelve. Ea
h question is
worth 11 points.
You must ll out your name and your andrew userid
learly and in blo
k
apital letters
on the front page. You will be awarded 1 point for doing this
orre
tly.
If you answer more than 9 questions, your best 9 s
ores will be used to derive your
total.
Unless the question asks for explanation, no explanation is required for any answer.
But you are wel
ome to provide explanation if you wish.
1 Bayes Nets Inferen
e
(a) Kangaroos.
K P(K) = 2/3
P(A|K) = 1/2
A P(A|~K) = 1/10
Half of all kangaroos in the zoo are angry, and 2/3 of the zoo is
omprised of kangaroos.
Only 1 in 10 of the other animals are angry. What's the probability that a randomly-
hosen animal is an angry kangaroo?
(b) Stupidity.
S P(S) = 0.5
P(C|S) = 0.5
C P(C|~S) = 0.2
Half of all people are stupid. If you're stupid then you're more likely to be
onfused.
A randomly-
hosen person is
onfused. What's the
han
e they're stupid?
( ) Potatoes.
B P(B) = 1/2
P(T|B) = 1/2
T P(T|~B) = 1/10
P(L|T) = 1/2
L P(L|~T) = 1/10
Half of all potatoes are big. A big potato is more likely to be tall. A tall potato is
more likely to be lovable. What's the probability that a big lovable potato is tall?
P(W|S)=1/2
W P(W|~S)=1
What's P (W ^ F )?
P(W^F) = P(W^F^H) + P(W^F^~H) but P(W^F^H) = 0
P(W^F^~H) = P(F|~H)P(W^~H) =
1/2 * (P(W ^ S ^ ~H) + P(W ^ ~S ^ ~H)) = 1/2 * ( 1/8 + 1/4) = 3/16
2 Bayes Nets and HMMs
(a) Let nbs(m) = the number of possible Bayes Network graph stru
tures using m at-
tributes. (Note that two networks with the same stru
ture but dierent probabilities
in their tables do not
ount as dierent stru
tures). Whi
h of the following statements
is true?
(iii) m(m
2
1)
nbs(m) < 2 m
m(m
(iv) 2m nbs(m) < 2 2
1)
m(m
(v) 2 2
1)
nbs(m)
Answer is (v) be
ause the number of undire
ted graphs with n verti
es
is 2^[m
hoose 2℄, and there are even more a
y
li
dire
ted graphs
(a) Consider the following data with one input and one output.
3
Y (output) −−>
2
0
0 1 2 3
X (input) −−−>
(i) What is the mean squared training set error of running linear regression on
this data (using the model y = w0 + w1 x)?
0
(ii) What is the mean squared test set error of running linear regression on this
data, assuming the rightmost three points are in the test set, and the others are
in the training set.
0
0
(b) Consider the following data with one input and one output.
3
X Y
Y (output) −−>
1 1
2
2 2
3 1
1
0
0 1 2 3
X (input) −−−>
(i) What is the mean squared training set error of running linear regression on
this data (using the model y = w0 + w1 x)? (Hint: by symmetry it is
lear that
the best t to the three datapoints is a horizontal line).
SSE = (1/3)^2 + (2/3)^2 + (1/3)^2 = 6/9
MSE = SSE/3 = 2/9
1 1 1
0 0 0
0 2 4 6 0 2 4 6 0 2 4 6
X (input) −−−> X (input) −−−> X (input) −−−>
NO
Regression trees are a kind of de
ision tree used for learning from data with a real-valued
output instead of a
ategori
al output. They were dis
ussed in the \Eight favorite regression
algorithms" le
ture.
On the next page you will see pseudo
ode for building a regression tree in the spe
ial
ase where all the input attributes are boolean (they
an have values 0 or 1).
The MakeTree fun
tion takes two arguments:
D, a set of datapoints
and A, a set of input attributes.
It then makes the best regression tree it
an using only the datapoints and attributes passed
to it. It is a re
ursive pro
edure. The full algorithm is run by
alling MakeTree with
D
ontaining every re
ord and A
ontaining every attribute. Note that this
ode does no
pruning, and that it assumes that all input attributes are binary-valued.
Now read the
ode on the next page, after whi
h question (a) will ask you about bugs in
the
ode.
MakeTree(D,A)
Returns a Regression Tree
(a) Beyond the obvious problem that there is no pruning, there are three bugs in the above
ode. They are all very distin
t. One of them is at the level of a typographi
al error.
The other two are more serious errors in logi
. Identify the three bugs (remembering
that the la
k of pruning is not one of the three bugs), explaining why ea
h one is a
bug. It is not ne
essary to explain how to x any bug, though you are wel
ome to do
so if that's the easiest way to explain the bug.
In the left of the following two pi
tures I show a dataset. In the right gure I sket
h the
globally maximally likely mixture of three Gaussians for the given data.
Assume we have prote
tive
ode in pla
e that prevents any degenerate solutions in
whi
h some Gaussian grows innitesimally small.
And assume a GMM model in whi
h all parameters (
lass probabilities,
lass
entroids
and
lass
ovarian
es)
an be varied.
3 3
Y (output) −−>
Y (output) −−>
2 2
1 1
0 0
0 1 2 3 0 1 2 3
X (input) −−−> X (input) −−−>
(a) Using the same notation and the same assumptions, sket
h the globally maximally
likely mixture of two Gaussians.
3
Y (output) −−>
0
0 1 2 3
X (input) −−−>
(b) Using the same notation and the same assumptions, sket
h a mixture of three distin
t
Gaussians that is stu
k in a suboptimal
onguration (i.e. in whi
h innitely many
more iterations of the EM algorithm would remain in essentially the same suboptimal
onguration). (You must not give an answer in whi
h two or more Gaussians all have
the same mean ve
tors|we are looking for an answer in whi
h all the Gaussians have
Y (output) −−>
2
0
0 1 2 3
X (input) −−−>
(
) Using the same notation and the same assumptions, sket
h the globally maximally
likely mixture of two Gaussians in the following, new, dataset.
3
Y (output) −−>
0
0 1 2 3
X (input) −−−>
(d) Now, suppose we ran k-means with k = 2 on this dataset. Show the rough lo
ations of
the
enters of the two
lusters in the
onguration with globally minimal distortion.
3
Y (output) −−>
0
0 1 2 3
X (input) −−−>
6 Regression algorithms
For ea
h empty box in the following table, write in \Y" if the statement at the top of the
olumn applies to the regression algorithm. Write \N" if the statement does not apply.
No matter what the training The
ost of training on a dataset
data is, the predi
ted output is with R re
ords is at least
guaranteed to be a
ontinuous O(R2 ): quadrati
(or worse)
fun
tion of the input. (i.e. there in R. For iterative algorithms
are no dis
ontinuities in the pre- marked with (*) simply
onsider
di
tion). If a predi
tor gives the
ost of one iteration of the
ontinuous but undierentiable algorithm through the data.
predi
tions then you should an-
swer \Y".
Linear Regression Y N
Quadrati
Regression Y N
Per
eptrons with sigmoid a
ti- Y N
vation fun
tions (*)
1-hidden-layer Neural Nets with Y N
sigmoid a
tivation fun
tions (*)
1-nearest neighbor N N
10-nearest neighbor N N
Kernel Regression Y N
Lo
ally Weighted Regression Y N
Radial Basis Fun
tion Regres- Y N
sion with 100 Gaussian basis
fun
tions
Regression Trees N N
Cas
ade
orrelation (with sig- Y N
moid a
tivation fun
tions)
Multilinear interpolation Y N
MARS Y
7 Hidden Markov Models
Warning: this is a question that will take a few minutes if you really understand HMMs, but
ould take hours if you don't. Assume we are working with this HMM
1/2 1/2 1
S1 S2 S3
1/2 1/2
Obs= Obs= Obs=
XY XZ YZ
b (k ) = P (O = k jq = S )
i t t i
XZXYYZYZZ
(in long-hand: O1 = X; O2 = Z; O3 = X; O4 = Y; O5 = Y; O6 = Z; O7 = Y; O8 = Z; O9 = Z ).
Fill in this table with t (i) values, remembering the denition:
(t) = P (O1 ^ O2 ^ :::O
i t ^q t = si )
So for example,
3 (2) = P (O1 = X ^ O2 = Z ^ O3 = X ^ q3 = S2 )
At some point during this question you may nd it useful to use the fa
t that if U and V are
two independent real-valued random variables then Var[aU + bV ℄ = a2 Var[U ℄ + b2 Var[V ℄.
Suppose you have 10,000 datapoints f(xk ; yk ) : k = 1; 2; :::; 10000g. Your dataset has one
input and one output. The kth datapoint is generated by the following re
ipe:
xk = k=10000
yk N (0; 22 )
So that yk is all noise: drawn from a Gaussian with mean 0 and varian
e 2 = 4 (and
standard deviation = 2). Note that its value is independent of all the other y values. You
are
onsidering two learning algorithms:
Algorithm NN: 1-nearest neighbor.
Algorithm Zero: Always predi
t zero.
(a) What is the expe
ted Mean Squared Training Error for Algorithm NN?
(b) What is the expe ted Mean Squared Training Error for Algorithm Zero?
(
) What is the expe
ted Mean Squared Leave-one-out Cross-validation Error for Algo-
rithm NN?
8 = E[(xk - x[x+1℄)^2℄
(d) What is the expe
ted Mean Squared Leave-one-out Cross-validation Error for Algo-
rithm Zero?
4
10 Neural Nets
(a) Suppose we are learning a 1-hidden-layer neural net with a sign-fun
tion a
tivation
Sign(z ) = 1 if z 0
Sign(z ) = 1 if z < 0
w11 = _______
X1 h1=sign(w11 x1 + w21 x2) W1 = _______
w12 = _______
Output =
w21 = _______ W1 h1 + W2 h2
X2 h2=sign(w12 x1 + w22 x2)
W2 = _______
w22 = _______
We give it this training set, whi
h represents the ex
lusive-or fun
tion if you interpret
-1 as false and +1 as true:
X1 X2 Y
1 1 -1
1 -1 1
-1 1 1
-1 -1 -1
On the diagram above you must write in six numbers: a set of weights that would give
zero training error. (Note that
onstant terms are not being used anywhere, and note
too that the output does not need to go through a sign fun
tion). Or..if it impossible
to nd a satisfa
tory set of weights, just write \impossible".
Impossible
(b) You have a dataset with one real-valued input x and one real-valued output y in whi
h
you believe
yk = exp(wxk ) + k
where (xk ; yk ) is the kth datapoint and k is Gaussian noise. This is thus a neural net
with just one weight: w.
Give the update equation for a gradient des
ent approa
h to nding the value of a that
minimizes the mean squared error.
11 Support Ve
tor Ma
hines
Consider the following dataset. We are going to learn a linear SVM from it of the form
f (x) = sign(wx + b).
Denotes X Y
Class −1
1 −1
Denotes
Class 1 2 −1
3.5 −1
4 1
5 1
0
0 1 2 3 4 5
X (input) −−−>
(a) What values for w and b will be learned by the linear SVM?
w = 4, b = -15
(b) What is the training set error of the above example? (expressed as the per
entage of
training points mis
lassied)
(d) True or False: Even with the
lever SVM Kernel tri
k it is impossibly
omputationally
Given a dataset with 200 datapoints and 50 attributes learn an SVM
lassier
with full 20th-degree-polynomial basis fun
tions and then apply what you've
learned to predi
t the
lasses of 1000 test datapoints.
FALSE
12 VC Dimension
(a) Suppose we have one input variable x and one output variable y . We are using the
ma
hine f1 (x; ) = sign(x + ). What is the VC dimension of f1 ?
(b) Suppose we have one input variable x and one output variable y . We are using the
ma
hine f2 (x; ) = sign(x + 1). What is the VC dimension of f2 ?
(
) Now assume our inputs are m-dimensional and we use the following two-level, two-
hoi
e de
ision tree to make our
lassi
ation:
is x[A℄ < B?
/ \
if no / \ if yes
/ \
is x[C℄ < D? is x[E℄ < F?
/ \ / \
if no / \ if yes if no/ \ if yes
/ \ / \
/ \ / \
Predi
t Predi
t Predi
t Predi
t
Class G Class H Class I Class J