Académique Documents
Professionnel Documents
Culture Documents
1 Introduction
Lots of people are using Expedia to plan their vacations. By knowing previous booking data,
Expedia wants to improve their recommendation system in order to offer better service for
their users. This is a challenge for every computer engineers. For the recommendation
system, it is not only the problem for Expedia, but also for the other large companies which
are trying to offer excellent services for their customs, such as Amazon, Facebook, Yelp and
etc.
2 Problem statement
In this project, we are trying to deal with large scale of dataset to predict users potential
hotel booking. According to previous users features and hotels features, the system will
assign the new user into their likely booking hotel cluster.
3
Backward elimination is the part of the Stepwise regression, which involves the all
variables first and to get the one basement result. Then this algorithm would delete the
variable that decreases the performance of the model. Iterating this process until no further
improvements are made.
In our project, we are using model as following:
Y 0 1 X 1 r 1 X r 1 .
X is the input vector, and Beta is the weight of the elements in each vector rows. Using
T
T 1
Moore-Penrose pseudoinverse to calculate : = A ( A A ) y . y is the initial
output setting.
As we learnt from class, softmax regression is the upper version of logistic regression in
machine learning. Which can address multi-classes problem. Logistic regression is a typical
method to identify the two classes. The input matrix after we rearranged is similar as the
MNIST digit classification, where the goal is to distinguish the hundred different clusters. We
1
h ( x )=
1
m
1 { y i=k } log
i=1 k=1
exp ( kT x i)
K
exp ( jT x i )
j=1
1
Gradient is : J ( )=
x i ( 1 { y i=k }P( y i=kx i ;) )
m i1
k
3.2.2
Although this problem is supervised learning problem, we are still trying to use unsupervised
learning method to solve this problem. To implement K-mean cluster, first we used PCA to
do the dimension reduction that includes the label features, then to use K-mean package in
Matlab to do the computing. After computing, we can get the cluster label for each user
vector, which is as the hotel cluster.
k
||Xi|| ,
K-mean cluster method is based on the equation as following: argmin
i =1 x S
i
which
i is the mean of that cluster. After we get the mean vector for each cluster, we can
compute the distance between each testing vector and mean vector, and then to assign the
closest clusters label to the testing item.
3.2.3
The second approach is the item-based K-nearest neighbor (KNN) algorithm. Knn is a nonparametric method used for classification and regression. In both cases, the input consists of
the k closet training examples in the feature space. The output depends on whether knn is
used for classification or regression.
Since according to the dataset we already have the labels for each user, we can use labeled
training data as sample data to compute the distance between testing data and training data by
using pdist2 function.
3.2.4
Classification Tree
Decision Tree can visually represent decision-making results based on all of the input
features. Each root node represents the feature decision and each leaf node contains the
response for that input feature. Classification Tree can divide each feature into many different
sections. Since some Expedias input feature data have difference in large order of magnitude,
Decision Tree is ideal to implement in this way.
3.2.5
Since one decision tree is very likely to result in a wrong prediction, implementing more
decision trees from different set of training data will definitely improve the results. By adding
the concept of k-fold cross validation, decision tree is generated for each k-fold and given the
most accurate decision tree. In the same time, all of the generated decision trees are saved for
future use. Then, for each input feature, results are predicted from all decision tree and the
most frequent repeated results are used for final prediction. Meanwhile, if all of the decision
trees result in different frequency, the most accurate decision tree from previous k-fold
prediction will be used instead.
3.2.5
The Support Vector Machine would produce nonlinear boundaries by constructing a linear
boundary in a large, transformed version of the feature space. SVM are supervised learning
models with associated learning algorithms that analyze data used for classification and
regression analysis. SVM would build the new model based on the training data and assigns
new examples into one category or the other, making it a non-probabilistic binary linear
classifier.
In this project, we used polynomial kernel to as SVM kernel which is
K ( x , x ) =( 1+ x , x
'
'
'
; So then the classifier of the SVM is: f ( x )= i y i K ( x , x ) + ,
i=1
which is the solution function, to use this function to assign the label for each user. In this
equation can be solved by the function quadprog in matlab that finds minimum for a
N
problem specified.
can be solved by
= i y i x i .
y i { 1,1 } ;
i=1
4 Results
a. Without feature selection
Method
KNN K= 100
KNN with confusion matrix K=1
K-means cluster
Classification Tree
Classification Tree
Classification Tree with k-fold
cross validation (k=18)
Dataset
100 Classes; 20000
100 Classes; 20000
100 Classes; 200000
100 Classes; 20000
100 Classes; 200000
Accuracy
4%
24.49%
16.59%
10.55%
21.04%
22.38%
Dataset
100 Classes; 20000
100 Classes; 20000
100 Classes; 200000
Accuracy
9.25%
24.69%
21.22%
Classification Tree
Classification Tree with k-fold
cross validation (k=18)
20.845%
22.25%
Dataset
100 Classes; 20000
100 Classes; 20000
10 Classes; 20000
100 Classes; 200000
100 Classes; 200000
Accuracy
13.5%
40.95%
14.1%
31.05%
21.98%
23.97%
The Classification Tree without k-fold cross validation runs much faster than the one with k-fold
cross validation. While increasing the k value, the output predictions are going to be more
accurate. From the results, k increase form 1 to 20 will improve the accuracy for about 1.3% to
2%. So, for some larger k value, the results might be even better. While increasing the number of
input data, the output predictions increase dramatically. For data number increase from 20,000 to
200,000, the accuracy doubled. So, it is a good method to handle large input data.
The best one among all methods is KNN algorithm with PCA feature selection. While assigning
the labels, K=100 will results more error fields than K=1. So, the accuracy for K=1 is better than
K=100.
Overall, all of the learning methods achieved very good predictions.