Vous êtes sur la page 1sur 6

Expedia Hotel Recommender System

Bing Zhang, Qingxuan Li


Department of Electrical and Computer Engineering
University of California, Davis

1 Introduction
Lots of people are using Expedia to plan their vacations. By knowing previous booking data,
Expedia wants to improve their recommendation system in order to offer better service for
their users. This is a challenge for every computer engineers. For the recommendation
system, it is not only the problem for Expedia, but also for the other large companies which
are trying to offer excellent services for their customs, such as Amazon, Facebook, Yelp and
etc.

2 Problem statement
In this project, we are trying to deal with large scale of dataset to predict users potential
hotel booking. According to previous users features and hotels features, the system will
assign the new user into their likely booking hotel cluster.
3

Feature Selection, Algorithm and Implementation


3.1 Feature Selection
There are huge amounts of data need to be analysis from Expedia. Since not all features are
necessary, some features would decrease the accuracy. How to choice the best dimension of
data is an important step in this project. Thus, a good feature combination can improve the
accuracy for this project. Here are some feature selection methods used in our design.
3.1.1

Backward elimination algorithm

Backward elimination is the part of the Stepwise regression, which involves the all
variables first and to get the one basement result. Then this algorithm would delete the
variable that decreases the performance of the model. Iterating this process until no further
improvements are made.
In our project, we are using model as following:

Y 0 1 X 1 r 1 X r 1 .
X is the input vector, and Beta is the weight of the elements in each vector rows. Using
T
T 1
Moore-Penrose pseudoinverse to calculate : = A ( A A ) y . y is the initial
output setting.

3.1.2 PCA dimension reduction


It is very hard for any algorithm to deal with the high dimensional data to train and test. In
this projects dataset, it has more than twenty dimension features, which is hard for each
method to train and test to get correct accuracy, because some of them would affect the
accuracy of the output.
PCA is one of the normal methods in the dimension reduction, which is a statistical
procedure that uses an orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables called
principal components. In this project, we are using the PCA package from Matlab to do the
dimension reduction which can offer the better output with k-nearest-neighbor algorithm
3.2 Implementation method
This project can be solved in two ways, supervised learning and unsupervised learning.
Since it already has labels in its dataset, supervised learning is supposed to deal with this
kind of problem. However, while we rearrange the data, we can also use the unsupervised
learning method to solve this problem. For supervised learning methods, we supposed to
use K-nearest-neighbor algorithm, Softmax regression, Support Vector Machine,
Classification Tree, and for unsupervised learning method, we suppose to use K-mean
cluster to solve this method.
3.2.1

Softmax regression (Multinomial logistic regression)

As we learnt from class, softmax regression is the upper version of logistic regression in
machine learning. Which can address multi-classes problem. Logistic regression is a typical
method to identify the two classes. The input matrix after we rearranged is similar as the
MNIST digit classification, where the goal is to distinguish the hundred different clusters. We
1
h ( x )=

are using the same hypothesis:


1+ exp (T x ) and the model parameters
were trained to minimize the cost function:
J ( )=

1
m

y i log h ( x i)+(1 y i) log (1h ( x i))


i=1

Rewritten cost function :

1 { y i=k } log
i=1 k=1

exp ( kT x i)
K

exp ( jT x i )
j=1

1
Gradient is : J ( )=
x i ( 1 { y i=k }P( y i=kx i ;) )

m i1
k

3.2.2

K-means Cluster Method

Although this problem is supervised learning problem, we are still trying to use unsupervised
learning method to solve this problem. To implement K-mean cluster, first we used PCA to
do the dimension reduction that includes the label features, then to use K-mean package in
Matlab to do the computing. After computing, we can get the cluster label for each user
vector, which is as the hotel cluster.
k

||Xi|| ,
K-mean cluster method is based on the equation as following: argmin
i =1 x S
i

which

i is the mean of that cluster. After we get the mean vector for each cluster, we can

compute the distance between each testing vector and mean vector, and then to assign the
closest clusters label to the testing item.
3.2.3

Item-Based K Nearest Neighbor (KNN) Algorithm

The second approach is the item-based K-nearest neighbor (KNN) algorithm. Knn is a nonparametric method used for classification and regression. In both cases, the input consists of
the k closet training examples in the feature space. The output depends on whether knn is
used for classification or regression.
Since according to the dataset we already have the labels for each user, we can use labeled
training data as sample data to compute the distance between testing data and training data by
using pdist2 function.

3.2.4

Classification Tree

Decision Tree can visually represent decision-making results based on all of the input
features. Each root node represents the feature decision and each leaf node contains the
response for that input feature. Classification Tree can divide each feature into many different
sections. Since some Expedias input feature data have difference in large order of magnitude,
Decision Tree is ideal to implement in this way.

3.2.5

Classification Tree with k-fold cross validation

Since one decision tree is very likely to result in a wrong prediction, implementing more
decision trees from different set of training data will definitely improve the results. By adding
the concept of k-fold cross validation, decision tree is generated for each k-fold and given the
most accurate decision tree. In the same time, all of the generated decision trees are saved for
future use. Then, for each input feature, results are predicted from all decision tree and the
most frequent repeated results are used for final prediction. Meanwhile, if all of the decision
trees result in different frequency, the most accurate decision tree from previous k-fold
prediction will be used instead.
3.2.5

Support Vector Machine

The Support Vector Machine would produce nonlinear boundaries by constructing a linear
boundary in a large, transformed version of the feature space. SVM are supervised learning
models with associated learning algorithms that analyze data used for classification and
regression analysis. SVM would build the new model based on the training data and assigns
new examples into one category or the other, making it a non-probabilistic binary linear
classifier.
In this project, we used polynomial kernel to as SVM kernel which is
K ( x , x ) =( 1+ x , x
'

'

'
; So then the classifier of the SVM is: f ( x )= i y i K ( x , x ) + ,
i=1

which is the solution function, to use this function to assign the label for each user. In this
equation can be solved by the function quadprog in matlab that finds minimum for a
N

problem specified.

can be solved by

= i y i x i .

y i { 1,1 } ;

i=1

4 Results
a. Without feature selection
Method
KNN K= 100
KNN with confusion matrix K=1
K-means cluster
Classification Tree
Classification Tree
Classification Tree with k-fold
cross validation (k=18)

Dataset
100 Classes; 20000
100 Classes; 20000
100 Classes; 200000
100 Classes; 20000
100 Classes; 200000

Accuracy
4%
24.49%
16.59%
10.55%
21.04%

100 Classes; 200000

22.38%

b. With backward elimination algorithm


Method
KNN K= 100
KNN with confusion matrix K=1
K-means cluster

Dataset
100 Classes; 20000
100 Classes; 20000
100 Classes; 200000

Accuracy
9.25%
24.69%
21.22%

Classification Tree
Classification Tree with k-fold
cross validation (k=18)

100 Classes; 200000

20.845%

100 Classes; 200000

22.25%

c. With PCA dimension reduction


Method
KNN K= 100
KNN with confusion matrix K=1
Softmax regression
K-mean cluster
Classification Tree
Classification Tree with k-fold
cross validation (k=18)

Dataset
100 Classes; 20000
100 Classes; 20000
10 Classes; 20000
100 Classes; 200000
100 Classes; 200000

Accuracy
13.5%
40.95%
14.1%
31.05%
21.98%

100 Classes; 200000

23.97%

5 Conclusion and Evaluation


Due to the performance of computer, we are not able to train and test all of the data provided by
Expedia. But we still can see the difference among different algorithms.
Results in first table were implement without any feature optimization. Comparing with the other
two tables, the accuracy is the lowest for Softmax, KNN and K-means Algorithm. Which
illustrate both optimization methods can improve accuracy for these three learning methods.
However, backward elimination method will hurt the accuracy for Classification Tree methods
while PCA can improve it. Also, PCA performed better for KNN and Kmeans methods
comparing to backward elimination. In this case, we can conclude that PCA dimensional
reduction is better than backward elimination method for feature selection.
In the result table, SVM algorithm is not included because classifying data takes too much time.
Also, it contains too many classes to do training, which also results in a long time calculation.
The timing issue indicated SVM is not a good method for the huge amount of classes training
and testing.
Softmax regression is the method used for MINST digital in class. The low accuracy might cause
by the poor input data features. Expedias input data varies in order of hundreds. Those nonunified data will hurt the prediction results a lot. Although implementing backward elimination
and PCA data selection methods, there are still error features existing in input data that might
hurt the final results. In addition, Softmax contains only one layer of learning. It is too hard to
learn all possible combinations for an output of one hundred outcomes. From our results, it is
easy to see Softmax is not a good method for problems with large possible outcomes.
For the k-mean cluster, we delete the label first, then use package from Matlab to calculate the
mean vector for each cluster and then find the similar data to assign the labels. This method is
not official method to solve the supervised learning problem. However, compare the others, kmeans has ability to deal with the large scale of classes.

The Classification Tree without k-fold cross validation runs much faster than the one with k-fold
cross validation. While increasing the k value, the output predictions are going to be more
accurate. From the results, k increase form 1 to 20 will improve the accuracy for about 1.3% to
2%. So, for some larger k value, the results might be even better. While increasing the number of
input data, the output predictions increase dramatically. For data number increase from 20,000 to
200,000, the accuracy doubled. So, it is a good method to handle large input data.
The best one among all methods is KNN algorithm with PCA feature selection. While assigning
the labels, K=100 will results more error fields than K=1. So, the accuracy for K=1 is better than
K=100.
Overall, all of the learning methods achieved very good predictions.

Vous aimerez peut-être aussi