Sample Midterm

Section A:
(answer all of these 7 questions on this sheet. DO NOT answer on scratch paper.)
1. Answer TRUE(T) or FALSE(F) for the following statements. [16]
A: P(A) =
B
P(A|B)
B: P(A, B, C) = P(A|B, C)P(B|C)P(C)

C: In polynomial curve tting, getting very large value of weights indicate the model
over-ts the data.
D: For linear regression model with Gaussian basis function, the mean and standard
deviation for each basis function are learned in training process.
E: In linear regression with squared error function, regularization coecient is
non-negative.
F: Both generative and discriminative approach try to nd a class-conditional prob-
ability p(x|C
k
) and make decision based on this.
G: Algorithm A is better than algorithm B if the training error of algorithm A is
better than that of B.
H: In logistic regression, the posterior probability of class C
1
is written as a logistic
sigmoid acting on a linear function of the feature vector P(C
1
|) = (w
T
).
2. Question 2A and 2B concern properties of the random variables A and B which have
the following joint probability distribution P(A, B).
B
0 1
A
0 1/4 1/4
1 3/8 1/8
A: A and B are independent random variables
i. True
ii. False
iii. Not enough information to tell
[3]
B: Calculate all possible values of P(A|B).
[4]
Page 1 of 8
2. Which of the following statement about kfold cross validation is correct? [4]
A: Divide the data into k partitions, and then pick k 1 of the partitions as held-out
testing set. Train on one of the k partitions and then test on all k 1 partitions
that is not used in the training. Repeat this procedure for all possible k choices
of the training partition. Results from all the k runs are averaged.
B: Divide the data into k partitions, and then pick one of the partitions as held-out
testing set. Train on all of the k 1 partitions and then test on the one that is
not used in the training. Repeat this procedure for all possible k choices of the
held out partition. Results from all the k runs are averaged.
C: Divide the data into k partitions, and then pick one of the partitions as held-out
testing set. Train on all of the k 1 partitions and then test on the one that is
not used in the training. Repeat this procedure for all possible k choices of the
held out partition. The best result from all the k runs is reported.
D: Pick some of the samples as testing data, and then divide the remaining data into
k partitions. Train on each one of the k partitions separately and then test on
the testing data. Results from all the k runs are averaged.
E: Divide the data into k partitions, and then pick one of the partitions as held-out
testing set. Randomly pick a held-out testing set, and then train on all of the
k 1 partitions that do not overlap with the testing set. Result from this run is
reported.
Page 2 of 8
3. Consider a regression problem where the two dimensional input points x = [x
1
, x
2
]
T
are constrained to lie within [1, 1]. The training and test input points x are sampled
uniformly at random within the range. The target outputs y are governed by the
following model
y N(, 1)
where = x
2
1
x
2
8x
1
x
2
+ 5x
2
1
+ 2x
2
3. In other words, the outputs are normally
distributed with mean given by x
2
1
x
2
8x
1
x
2
+ 5x
2
1
+ 2x
2
3 and variance 1.
We learn to predict y given x using polynomial regression models with order from 1 to
9. The performance criterion is the mean squared error. We rst train a 1st, 2nd and
9th order model using n
train
= 20 training points, and then test the predictions on a
large independently sampled test set (n
test
> 1000).
Select the appropriate model that you would expect to have the lowest error for each
column. [10]
Lowest training error Highest training error Lowest test error
1st order [ ] [ ] [ ]
2nd order [ ] [ ] [ ]
9th order [ ] [ ] [ ]
Briey explain your selection.
4. Assuming a linear regression model of the form y = x
T
w + , where is a noise terms
sampled from a zero mean Gaussian distribution, that is, N(0,
2
), then how is y
distributed? [2]
A: y N(0,
2
)
B: y N(x
T
w,
2
)
C: y N(x
T
w/
2
,
2
+ )
D: y N(x
T
w + ,
2
)
E: y N(0,
2
/(x
T
w))
Page 3 of 8
5. Mary trained several models for a regression problem, and she got the following plot
with root-mean-square error evaluated on the training set and an independent testing
set for various values of model complexity. Mary confused on choosing the appropriate
model for his problem, please help him to match the three areas shown in the gure
to the following three cases. [3]
Overt [ ]
Possible choice [ ]
Undert [ ]
Page 4 of 8
6. David conducted experiments on classifying two classes of data, denoted by crosses and
circles. He used least squares classication and logistic regression. He plotted decision
boundaries of both methods and nds out they are close (LEFT plot below).
Then David tried both methods on a similar data set with extra data points added at
the bottom right of the diagram, as shown in the RIGHT-HAND plot. Identify the
classication method for the two decision boundaries (dash and solid) in this case and
explain the reason. [4]
Page 5 of 8
7. Amy has two coins c
1
and c
2
. c
1
is the a fair coin with P(heads) = 0.5. c
2
is a biased
in favor of heads with P(heads) = 0.6. Amy blindly tossed one of the coins twice,
every time it came out to be head.
A: Use maximum likelihood (ML) approach to guess which of the coin had Amy
tossed? What is the probability of heads in the next toss? [6]
B: Using the above data, and assuming the prior belief of the coin being fair is 0.75
and that being biased in favor of heads is 0.25, Now solve above question using
the maximum a posterior estimate (MAP). What is the probability of heads in
the next toss in this case? [6]
Page 6 of 8
Section B:
(answer all of the sub-questions on this sheet, DO NOT answer on scratch paper.)
1. Linear Regression (total 22 marks)
We are given N training data of dimension D. Data = (X, y) = {x
n
, y
n
}, n =
1, 2, ..., N, where row n of X is x
n
= [x
n,1
, ..., x
n,D
], y
n
, X
ND
.
A: Least Squares Solution nds w that minimizes the squared error function. Write
down the expression of the error function. (NO need to derive the solution to w [2]
B: Suppose we already nd the optimal weights w, what is the expression of predict-
ed y
new
based on our model at a new data point x
new
= [x
new,1
, ..., x
new,D
]? [2]
C: Assuming a linear regression model of the form t
n
= x
T
n
w+
n
, where
n
=
1
, ...
n
are noise terms sampled i.i.d from the same zero mean Gaussian distribution, that
is,
n
N(0,
2
). We are trying to use this model to t the data by using Max-
imum Likelihood (ML) estimation. Write down the formula for calculating the
ML estimation of w. Show the ML estimation is equivalent to the LS minimizer
of the square-error function. (NO need to solve the ML estimation) [8]
Page 7 of 8
D: Now consider adding a weight decay (quadratic regularizer), write down the ex-
pression the error function. (NO need to derive the solution to w) [2]
E: If we further consider w as a random variable and specify a prior distribution p(w)
on w that expresses our prior belief about the weight. Assume that w N(0,
2
I).
Then we could estimate w using the MAP (maximum a posterior) estimate. Show
that maximizing the posterior distribution is equivalent to minimizing the regu-
larized sum-of-squires error function. (NO need to solve the MAP estimation) [8]
Page 8 of 8

Sample Midterm

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Sample Midterm

Transféré par

Droits d'auteur :

Formats disponibles

Section A:

Vous aimerez peut-être aussi