Vous êtes sur la page 1sur 3

PREDICTIVE MODEL FOR ONLINE PURCHASES - An Implementation Overview 1.

PROBLEM DESCRIPTION
Consumers today go through a complex decision making process before buying a product or service online or offline. Most consumers spend time researching products, reading expert and user reviews, doing price comparison, and even searching for coupons and incentives. Hence, it is becoming increasingly important for Retailers to precisely target and convert a consumer in the shortest time possible. This paper discusses a model to help Online Retailers tap into a wide variety of Consumer Behavioural and Transactional data to improve the conversion rates. The main objective of our Predictive Model was to predict with high precision the likelihood of online purchases by consumers using site visitor data, past purchase/transaction data, product catalog and pricing. A secondary objective was to identify the features (aka as variables or parameters) that significantly impact the purchase decision from among all potential features.

2. DATA DESCRIPTION
The training data set had 27,855 online visits and 15 predictor variables and the test data set had 9285 online visits and 15 predictors. The Predictor Variables pretty much covered a wide gamut of behaviour and purchase attributes such as: Customer ID and customer demographics (Var1 Var2) Normalized product and past purchase attributes (Var3 Var12) Visit related attributes from weblogs (Var13 Var15) The Predicted Response or Target variable had a value of 1 - the likelihood of a purchase 0 - the likelihood of not a purchase

3. TECHNOLOGY
R & WEKA were used for the model building The input and output file formats were CSV files

4. STATISTICAL MODEL
The aim of our predictive model was to classify whether the target variable as 1 when the consumer makes a purchase and 0 when the consumer does not purchase, i.e. browses without buying. Our hybrid Predictive model made use of Logistic Regression's capabilities to identify the significant variables and Random Forests to improve the accuracy of prediction. In Random Forest, we fine tuned the following parameters to improve the prediction accuracy.

mtry - Number of predictors to be used during each split. The default value is sqrt(p) for classification and p/3 for regression. nodesize - Minimum size of terminal nodes. Setting this number larger causes smaller tree to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5). ntree - Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.

We also used Bagging and Boosting techniques to improve the predictive power of our classification using R libraries ipred & gbm..

5. FEATURE SELECTION
As a classifier, Random Forest performs an implicit feature selection, using a small subset of "strong variables" for the classification only, leading to its superior performance on high dimensional data. The outcome of this implicit feature selection of the random forest can be visualized by the "Gini importance", and can be used as a general indicator of feature relevance. This quantity the Gini importance finally indicates how often a particular feature was selected for a split, and how large its overall discriminative value was for the classification problem under study.

The variable importance plot shows that five variables impact the performance of our algorithm the most. Applying the principle of parsimony we select these five predictors for our final Model since the addition of any more predictor(s) doesn't bring down the error rate.

7. PERFORMANCE OF PREDICTION

We used the following error measure to calculate the accuracy of prediction:

E = _(test data) [ ( 0.8I_(true value=1) + 0.2I_(true value=0) ) |scaled predicted value - true value| ]
The penalty for miss-classifying a purchase as a visit is huge compared to the penalty of missclassifying a visit as a purchase. This penalty ratio was 4:1. By aggregating the classification error

across all observations, we can determine the performance of prediction.

Table indicating the performance of our Model across all techniques:


METHOD Logistic Regression Random Forest IBk J48 LMT CLASSIFICATION ACCURACY 81% 85.8% 84.6% 85.2% 82.6% PREDICTION ERROR 814.6 267 377 NA NA

8. RESULTS ACHIEVED
Our current hybrid model correctly classified Buy Vs Browse in more than 85% instances Random Forest with five predictors gave the best error value which is at 267. Number of significant predictors is five out of the total 15 predictors.

9. GENERAL APPLICABILITY OF THIS MODEL


A variation of our Predictive Model can be adapted to various Customer Segmentation problems such as Recommendation engines, Targeted Emails/Incentives, and Customer Churn/Retention.

About Serendio
The name Serendio is derived from the expression Serendipitous Discovery. Unearthing the not-so obvious and seemingly unrelated from data-structured and unstructured, is a hallmark of our technology and our name merely reinforces that. Our Big Data Science solutions help in driving Decisions and Actions for a wide variety of businesses in Retail, Insurance, Media, Education, and Healthcare. Our mission is to help every Company transform itself into a Data-driven organization.

Contact Us
Website: www.serendio.com Email: info@serendio.com Headquarters: 4677 Old Ironsides Drive Suite 450 Santa Clara, CA 95054 United States Phone: +1 408 496 9930 Follow us on:

Vous aimerez peut-être aussi