Vous êtes sur la page 1sur 8

Machine Learning & Neural Computation 424 2013/2014

Module leader & Lecturer: Dr Aldo Faisal Imperial College London


Coursework issue date: 26/10/2013
Coursework submission: 18/11/2013 (online via CATE; New submission date with 3 extra days)
Coursework return & feedback: within 14 days after submission
Coursework has to be submitted individually and will be marked as such. You are welcome to discuss
during the lab sessions your work with other students and your GTAs. All text-based files must
contain a commented line identifying the Name and CID of the submitter. Submissions may be
individually checked for plagiarism.
Download and open the coursework zip-file from
https://www.dropbox.com/s/f95x4km5kxbyewf/MLNCAssessedCoursework2013-2014.zip
The focus of this coursework is supervised learning, and has three questions A, B and C.
Please read all the coursework through carefully, understand it and then only start working on it. Your
coursework submission should not be longer than 2,000 words (shorter submissions are quite
welcome; code is of course excluded from the word count). All figures should conform to the
visualising scientific data guidelines explained at the end of this coursework.
Space-time series of 1-bedroom/studio rents in London
You have been provided with a spatio-temporal dataset of over 57,000 rental prices (pcm) for single
bedroom/studio properties in London along with corresponding Lat/Long coordinates. Also provided
is a set of coordinates for all stations on the London Underground network.
This is a real dataset collected using automated scripts from current sources (our crawler has been
running since October 2012). Please, note that as with ever real-world data, there may be some
corrupted or strange data points. E.g. there may be a number of properties that are not located within
Greater London.
The aim of this project is to use supervised learning techniques from the course to learn the supplied
sample data mapping from geographical location and date to rental costs per calendar month. Thus,
for example for an arbitrary location in Greater London and date of the recent past and present we
want to know the predicted rental price from your machine learning system. You can choose any of
the methods discussed in the course, provided that you implement them yourself from first principles.
Since this dataset contains geolocation information, you are invited to use Google Maps
(http://maps.google.com) to search for the specific locations of rental prices (e.g. paste longitude and
latitude numbers into the map search window).

!"#$%& ( )**%+",-.&/ *0 1%*1&%."&/ 234$& +*./5 ", +-.-/&. -,+ .$3& /.-."*,/ 2%&+ 6"%64&/57 89"/ 0"#$%& :-/ 6%&-.&+ $/",# .9&
3&4*: ;-.4-3 6*<<-,+/
(
7

1
MaLlab commands Lo generaLe Lhe flgure
ploL(renLal(:,4),renLal(:,3),'.')
hold on
ploL(Lube.locaLlon(:,2),Lube.locaLlon(:,1),'ro')
axls equal
ylabel('LaLlLude [^\clrc]')
xlabel('LonglLude [^\clrc]')

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6
51
51.1
51.2
51.3
51.4
51.5
51.6
51.7
51.8
51.9
52
L
a
t
i
t
u
d
e

[
c
i
r
c
]
Longitude [
c
irc]

In your coursework zip directory you will find the following files:
rental.mat contains the location and price data that you will be using for the exercise,
stored as a structure named rental, with four columns 1. rent ( rent per
calendar month), 2. time stamp
2
and finally geographic location in terms of
degrees 3. latitude and 4. longitude. In addition it also contains a struct named tube,
which contains two fields station and location with all London Underground stations.

trainRegressor.m an empty stub file you have to write in

testRegressor.m an empty stub file you have to write in

sanityCheck.m see explanation in Question B

In the following sub-questions you will be required to generate plots from data or your
results. Note: Plots that do not have appropriate labels (i.e. the name of dimension and the
units in square brackets, e.g. Rent []) will not be counted. Therefore, figures should be
exported so that lines and data points are clearly visible when exported and any text in the
figures is clearly readable. Use export figure to export the figure from Matlab and store it
as PNG. Untidy or grainy figures will not be considered.


2
This date information is given in Matlab datenum format, use datevec to convert into normal calendar dates.
A [20%]
Familiarise yourself with the data set. Plot the rental prices over time. Now we want to know if
the rental prices increased with time or did they remain constant? Fit a 0
th
order and a 1
st
order
polynomial basis function set to this plot.
a) Give your ML estimates of all parameters involved, including the noise in the model.
b) Which model is more likely? The 0
th
order (flat prices) or the 1
st
order (linearly increasing
prices?) Justify your answer with the data.
c) What may be a problem in comparing the 1
st
order and 0
th
order model using likelihoods?
B [45%]
Next, we are going to pool the data together across time (ie. we just have price and position):
a) Visualise in a 3D plot (plot3) the rental price raw data for Greater London (rotate the view
appropriately to be as informative as possible).

b) Write the two trainRegressor and testRegressor mentioned functions and train
your system with the data set. Your task is to write two functions (the function files are
provided for your first convenience), so as to be able to answer the subsequently described
subquestions:
1. trainRegressor.m a Matlab function that accepts training inputs trainIn (a
two-column matrix of Longitude and Latitude pairs, where the rows correspond to
different training data points) and training outputs trainOut (a column vector of
rent prices, where each row rental price corresponds to the rental location in trainIn ).
The function is to return a single variable (potentially a structure or matrix)
params.
>>params = trainRegressor ( trainIn , trainOut )
2. testRegressor.m a Matlab function that accepts testing inputs testIn (a
two-column matrix of Longitude and Latitude pairs, where the rows correspond to
different test data points) and params (the parameter variable created by
trainRegressor). The function should return a column vector of rental prices for
the corresponding locations called results.
>>results = testRegressor ( testIn , params )
To test your code sanitycheck.m is a Matlab function that will check whether
your two functions match the specifications we use. Use the function with the
following command in matlab:
>>sanityCheck ( @trainRegressor ,@testRegressor )
Note: If your functions do not pass the automatic tests performed by this function we
will be unable to automatically evaluate their performance and may deduct marks.
Your code should be documented with sufficient comments so a stranger is able to follow
why and what and you did. Any free parameters that you do not want or cannot learn from the
provided data set, have to be set as constants within the trainRegressor function.
Explain and discuss your solution approach and how you chose any parameters that you set as
constants. We suggest that you use Gaussian Basis functions

c) Test your two functions by calculating the Root Mean Square error (RMSE) of your rental
price predictions, by performing cross-validation (you will have to choose suitable sizes for
the training and testing data chunks). How does the number of training points affect the
accuracy of your regression? Hint: acceptable solution will have typically a RMSE lower than
950 per calendar month.

d) Calculate the predicted rental price at all London tube stations, what is the mean and standard
deviation of these rental prices?

e) Use a bar chart (bar) plot to visualise the rent profile for the Central Line from Ealing to
Stratford.

f) What is the price to rent at Imperial College Campus (Latitude 51.499019, Longitude -
0.176256), and discuss if you believe the predicted value based on the data and the machine
learning strategy used by you.

g) What is the rent price at Upminster station, and discuss if you believe the predicted value
based on the data and the machine learning strategy used by you.

h) Generate a finely spaced grid of sample points (e.g. you could use the Matlab function
meshgrid) and visualise the landscape of London rental prices using the function surf
and add the tube stations (but not necessarily all tube stations names) in the plot (using
plot3 and text). Rotate the view so that you see London from above, you will thus get a
heatmap. Use colorbar to further annotate the plot. This allows you to answer the
following questions:
1. What is your predictors most expensive location in London and where is it located?
Provide price, latitude and longitude and use Google maps to visualise your location.
2. Where is the price gradient largest - i.e. where is the transition from an expensive area
to a cheaper area steepest?
i) How good does your machine learner predict your own rent and what is the relative error?
C [35%]
Now, we want to account for the passage of time and changes in rental prices. To this end we
will augment our data space to include position, price and time.
a) There are three main ways to achieve this:
1. Augment your sum of basis functions that operate on space (but not on time) with
additional basis functions that operates on time (but not on space).
2. Use basis functions that all take space and time as joint argument.
3. Chunk the data in time and repeat the analysis of B on each chunk.
Discuss each approach individually and sketch out a solution approach in a paragraph each.
What could be challenges or problems to explain the data using each approach?
Choose approach 1. or 2. implement them, and show and compare your results with those of
approach 3. If your choice requires you to write new regressor functions (e.g. ones
that take 3-column input arguments, etc.) please call them trainRegressorTime
and testRegressorTime (You will have to generate the files yourself,
following the template from question). Plot the change in prices for each month of the
data set as predicted by your regressor for each state of the Central Line (choose a
suitable way of visualising all this in a single figure).

1. trainRegressorTime.m a Matlab function that accepts training inputs
trainIn (a three-column matrix of 1. Datevec dates, 2. Longitude and 3. Latitude,
where the rows correspond to different test data points) and training outputs
trainOut (a column vector of rent prices, where each row rental price corresponds
to the rental location in trainIn ). The function is to return a single variable
(potentially a structure or matrix) params.
>>params = trainRegressorTime ( trainIn , trainOut )
2. testRegressorTime.m a Matlab function that accepts testing inputs
testIn (a three-column matrix of 1. Datevec dates, 2. Longitude and 3. Latitude,
where the rows correspond to different test data points) and params (the parameter
variable created by trainRegressor). The function should return a column vector
of rental prices for the corresponding locations called results.
>>results = testRegressorTime ( testIn , params )

b) Which coordinates displayed the highest increases in rental prices? Show and state your
results visualised on a London map.

c) Is the temporal change in rental prices a strong effect? We can test this by building a
nave Bayesian classifier that given a rental price and location (and using your regressors
from the previous sub-questions to give you a likelihood in time) can predict whether it
was from the early (e.g. first half) or late (e.g. second half) part of the data set? Choose a
suitable and convincing way to test your classifier on the data.

IT IS VERY IMPORTANT THAT YOU FOLLOW THE FUNCTION CALL CONVENTIONS
THAT ARE SPECIFIED. WE USE AUTOTESTING SCRIPTS TO TEST YOUR CODE. IF YOUR
CODE BREAKS DOWN IN AUTOTESTING BECAUSE IT IS NOT FOLLOWING THE
SPECIFIED CONVENTIONS THE GTAS CAN NOT BE EXPECTED TO FIX YOUR CODE TO
MAKE IT WORK. CODE THAT DOES NOT WORK MAY RESULT IN A DOWNGRADE OF
YOUR GRADE.
PLEASE SUBMIT ALL THE CODE THAT GENERATED YOUR FIGURES OR RESULTS,
ALONGSIDE THE EXPLICITLY REQUESTED MATLAB FILES. YOUR REPORT SHOULD BE
SUBMITTED AS PDF FILE.



How to visualise and present your scientific data for this coursework
(and many other applications):

A figure says more than 1000 words, ideally about what it should say or raise in as many words
doubts the authors abilities...therefore: Each figure deserves to look good. So make figures in
Matlab/Illustrator/Powerpoint/GIMP/TGIF etc. It is well worth making good looking figure. A figure
is almost always preferable to text. Figures are usually viewed in a small form. So use large fonts. It is
often a good idea if the caption of the figure allows understanding the conclusion from the figure.
Ideally, the figures allow understanding the entire paper without reading. Use the same font across all
figures and at most 3 sizes. A figure should not have more than 8 panels and usually at least 2.
Reference all your figures and subfigures in your main text (everything here holds also for tables,
numbered equations, etc). A figure not referenced is a figure that will be considered invisible by the
reader. Also, if there is no meaningful way to add reference in your main text to your figure/subfigure,
this perhaps suggests that you do not need that figure. Your figure caption should explain in direct
terms what one can see on the figure. If it is a data containing graph, the type of graph (e.g. scatter
plot), the axis, and the data typos (e.g. circles, dots) should be linked to the entities discussed. The
main text should refer to the figure and explain first what the reader is supposed to see in the figure
(imagine the reader is blind, what message should they be able you extract from the figure).

Ask yourself: Does each figure present evidence in favor of exactly one point that you are trying to
make with it, and does it present such evidence as clearly as you can imagine? E.g., if you are
conveying that one thing does better than another, are you showing relatively performance, delta
performance or absolute performance? absolute performance requires that the reader subtract in their
head but may be important when absolute performance is in itself meaningful (e.g. patient survival),
relative performance imposes no such requirement but the absolute value is lost, while delta
performance may be helpful to convey the meaning that conditions are different from each other. this
is certainly the most important consideration. the only question id answer if i were standing on one
foot, the rest are details.

Check list for your figures:

Are all axes are labeled, e.g. Height, are the axes tipped with arrows?

Do all axis labels have units, do you specify them using the square bracket
convention Height [cm]?

If there is 1 color, is it black? If there are 2, are they are black and grey
(or blue and red, but do not use red and green, please)?

If there are multiple lines/dots, is each a different line style and color?

Are all lines sufficiently thick? (If you used Matlab, and they are the default
thickness, the answer is no, it should be at least 2pt)

Are all font sizes are legible? (If you used Matlab, and they are the default
font size the answer is no, it should be at least 16-18pt)

Is the figure exported at sufficiently high DPI (400+)?

If there are multiple lines, is there is a legend, or other textual description
of each line?

If errorbars make sense, are they there? If there, does the caption explain
whether they are standard error or standard deviation? If they are not
either, is there a good justification provided for that?

Is your method in the figure named something other than proposed method
or our method? If not, name it and use it throughout.

Are you axes tight (that is, are the bounds of the axes just larger than the
max and min of what you want to show)? If not, do you have good reason
for the excess, e.g. because you need to show absolute levels.

Are all graphics that can be vector graphics actually vector graphics (unless
you export 600 DPI bitmaps, which may actually be the better way
sometimes in Matlab).

Do all axes have either clear tick marks or gridlines indicating magnitudes
of everything?

Are all the letters/numbers fully visible (i.e., not obscured by part of the
figure)?

Is the aspect ratio correct? On data where the different axis have the same units it may be important
to show equal unit as equal width in all axes (Hint: if you rescaled both the width and height
separately, it is probably not.)

If the data are 2D, are you displaying it in 2D? (If not, remove that additional
3rd dimension, it is just confusing and obfuscatory.)

Does the caption begin with a sentence (fragment) stating what the figure
is demonstrating (i.e., why it is there)?

Does the caption end by pointing out particularly interesting aspects of the
figure that one should note?

Does the caption define all acronyms used in the figure (especially if they
are not used anywhere else)?