Vous êtes sur la page 1sur 13

Data mining Report

Analysis of customers buying behavior of an E-Commerce website

Data Mining Project Report

Page 1

Contents

1. 2. 3. 4. 5. 6. 7.

Executive Summary Introduction Understanding the problem Literature Review Data Collection Data Preprocessing Data Modeling a. Modeling Steps

8.

PASW Modeler a. Analysis with PASW Modeler

9. 10. 11. 12.

Summary of Results Data Analysis Managerial & Business Applications References

Data Mining Project Report

Page 2

Executive Summary
India has had a steady growth in the internet user base which stands at about 150 million as of 2013. India now has 3rd largest Internet population in the world after China (at 575m) and the US (at 275m). This has led to the rise of the online shopping industry in a big way in the country over the last five years. The low internet penetration of about 12% has a promise of upcoming opportunities in this area in future. It will not be wrong to say that India is riding the ecommerce wave at present with the market size pegged at $14 billion in 2012 and an astonishing YoY growth rate of about 70%. With this growth scenario in mind, identifying segments of customers and their behavioral patterns over different time intervals, is an important application for businesses, especially in case of the last tier of the online retail chain which is concerned with electronic Business-to-Customer relationship (B2C) . This is particularly important in dynamic and ever-changing markets, where customers are driven by ever changing market competition and demands. This could lead to the prediction of churn, or which customers are leaving the companys loyalty. Also, the provision of customized service to the customers is vital for a company to establish long lasting and pleasant relationship with consumers. It has also been observed that keeping old customers generates more profit than attracting new ones. So, customer retention is a big factor too. So, there is always a trade-off between customer benefits and transaction costs, which has to be optimized by the managers. The increased availability of individual consumer data presents the possibility of direct targeting of individual customers. That is, the abundance of customer information enables marketers to take advantage of individual-level purchase models for direct marketing and targeting decisions. But, such an enormous amount of data can be a huge clutter, and it can become cumbersome to draw meaningful conclusions from such raw data. This is where the utility of customer behavior prediction using Data mining techniques comes in. The purpose of this project is to study, implement and analyze various Data-mining tools and techniques and then do an analysis of the sample / raw data to obtain a meaningful interpretation. The data mining methods used are Neural Networks and Decision Tree Analysis. C&RT, a recursive partitioning method, builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). RFM analysis is based on both appropriate reasoning and empirical evidence of customer behavior. It evaluates the customers on the basis of recency, frequency and monetary value spent. As an example, people who have bought from you recently are much more likely to respond to a new offer than someone who had made a purchase in the distant past. Neural network was used to train the model and the hypothesis tested. A result of 80% accuracy was achieved when the relation between product type and payment method was evaluated.

Introduction

Data Mining Project Report

Page 3

E-commerce has changed the face of most business functions in competitive enterprises. Internet technologies have seamlessly automated interface processes between customers and retailers, retailers and distributors, distributors and factories, and factories and their myriad suppliers. In general, e-commerce and e-business (henceforth referred to as e-commerce) have enabled what we call as online transactions. Also, generating large-scale real-time data has never been easier. With data pertaining to various views of business transactions being readily available, it is only apposite to seek the services of data mining to make (business) sense out of these data sets. In the last few years, Next.go.in grew from a simple website for online auctions to a full-scale e-commerce enterprise that processes huge data to create a better shopping experience. Data mining is an essential element for creating a great experience at Next.go.in. Data mining is a systematic way of extracting information from very large data sources. Techniques include pattern mining, trend discovery, and prediction. For Next.go.in, data mining plays an important role in the following areas: Identifying the consumer behavior, identifying association between purchase value and factors such as location, age and gender of customer and finding out which types of customers are more likely to respond to a certain promotion mailing. All this information helps to generate content that is targeted at the high value consumer. Every organization needs to understand its business operation, inventory and sales pattern in order to see the results of its strategy and formulate the future course of action. Inventory intelligence requires us to use data mining to process items and map them to the correct product category. This involves text mining, natural language understanding, and machine learning techniques. Successful inventory classification also helps us provide a better search experience and gives a user the most relevant product. We are seeing a growing need for data mining and its huge potential for e-commerce sites. The success of an e-commerce company is determined by the experience it offers its users, which these days is linked to data understanding. Next.go.in is one E-commerce website which offers varied products like laptops, portable devices, accessories etc. In this project we have tried to find the strategic correlation between the various attributes in the transactional data of Next.go.in .The business objective was to discover the buying behavior of the consumer on one hand and to find the appropriate positioning of the advertisements for revenue generation, customer profiling and avenues for launch for new products. The correlation derived would help in making crucial business decisions.

Understanding the problem: Project Objectives


The data of www.next.go.in, an e-commerce website was obtained and the following issues were faced: Data Mining Project Report Page 4

1.

The data was not preprocessed and thus required thorough checking and removal of outliers and missing data. The data was then analyzed for buying patterns of customers according to their gender, age, payment method and product type with the use of PASW Model as the software and Neural Networks and Decision Tree as the DM and classification techniques. Patterns were tested and a positive relation was established between the buying behavior components of a consumer viz. gender, age, location and purchase amount and product type. The model was used to answer questions like Who are the most / least valuable customers to the business? What are the distinct characteristics of them? What are customers purchase behavior patterns? Which types of customers are more likely to respond to a certain promotion mailing? What are the sales patterns in terms of various perspectives such as products / items, regions and time (weekly, monthly, quarterly, yearly and seasonally)?

2.

3.

4.

Literature Review

A multi-technique data mining approach to exploring consumer behaviours, Kuriakose Athappilly, Muhammad A. Razi and J. Michael Tarn Learnings: Discriminant analysis work well for solving relatively less complex problems Neural Networks works best for predictive modelling and market basket analysis

Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining Daqing Chen, Sai Laing Sain, Kun Guo Learnings: Understanding the Data-Pre processing Technique RFM Model-based Clustering Analysis

Data Mining Project Report

Page 5

Data Collection
1. The data used for analysis is the entire transactional data of one months duration from 1st Dec, 2012 to 31st Dec, 2012 of www.next.go.in amounting to 2034 valid transactions in total Primary Data from companys records The data set consists of independent and dependent variables Independent Variables: Age, Location, Gender Dependent Variables: Purchase Amount, Product Type

2. 3.

Figure 1: A snapshot of the data file with the fields visible

Data Preprocessing The data that was procured was having a lot of anomalies in the form of noise, missing fields, outliers and inconsistent data. Therefore, it could not be used as it is and had to be processed or cleaned to correct and remove all such anomalies before using it for creating the models. Data Cleaning: Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. (Han & Kamber, 2006) To prepare the data we started with cleaning the data and making it ready to be used as a mini-database. A few instances of garbage data were found like random value (noise), Data Mining Project Report Page 6

missing values and incomplete values. Such values were identified and removed. The entire row was removed if any record had some missing data which was essential to the analysis to keep the data unbiased. White spaces were removed and the data was analyzed to identify the attributes that we can use as input and output to get the results of our hypothesis and extracted that information in to different sheets for computation. Date attribute in original raw data is broken into two parts: Date and Time. Thus two new attributes are formed. This was necessary for the analysis as it allows two transactions made in a single day by a single consumer to be counted as two distinct transactions. Also the time of the purchase can be recorded as Morning, Evening, Midday or Night for the consumers. Purchase Value parameter formed by multiplying the quantity of items purchased with the price.

Modeling
Models Used: Neural Networks and Decision Tree Analysis Why Neural Networks? They are used extensively in the business world as predictive models Each neuron takes many inputs and generates an output that is a non-linear function of the weighted Mimic the way learning occurs in the brain

Why Decision Tree Analysis? Modeling Steps: 1. 2. 3. 4. 5. Generating the data stream for each hypothesis, by first inputting the data from an excel sheet. Using auto data prep for preparing the data for the manipulation, cleaning and transforming it. Partitioned the data in 80% and 20% for training and testing Applying the modeling techniques of Neural Networks and Decision Tree Analysis and adding table and analysis nodes to them to view the output in different forms. Analyzing the most important factors responsible for predicting the buying behavior of a consumer. It is extensively used in predictive models It is used to map observation about an item to conclusion of target value

Data Mining Project Report

Page 7

PASW Modeler
PASW Modeler uses a three-tier design. Users manipulate icons and options in the front-end application on Windows operating systems. This front-end client application then communicates with Clementine Server software, or directly with a database or dataset. The most common configuration in large corporations is to house the Clementine Server software on a powerful analytical server box (Windows, UNIX, Linux), which then connects to the corporate Data warehouse. Data processing commands are automatically converted from the icon-based user interface into a command code (which is not visible) and is sent to the Clementine Server for processing. Where possible, this command code will be further compiled into SQL and processed on the data warehouse. According to Rexer's Annual Data Miner Survey in 2010: [2] 1. 2. PASW (along with STATISTICA Data Miner and R) received the strongest satisfaction ratings as a data mining tool in both 2010 and 2009. PASW Modeller and PASW Text Analytics (now called PASW Modeller Premium) were rated as the second (17%) and fourth (7%), respectively, most used text mining software.

Analysis with PASW Modeler


Step 1: Feeding data through Variable File Step 2 : Data Preparation

Data Mining Project Report

Page 8

Step 3: Input

Step 4: Setting Nodes

Data Mining Project Report

Page 9

Step 5: Stream Generation 1. Neural Network Model

2.

Decision Tree Model

Data Mining Project Report

Page 10

Step 6: Result- Location most important variable to determine customer purchase value followed by Payment mode and Purchase time

Data Mining Project Report

Page 11

Summary of Results
The results found from the data set using the modeling techniques were as follows: 1. 2. 3. The Model developed helped in identifying the factors influencing the purchase variable. Location is the single largest factor which influence the customer purchase value The decision tree model helps the retailers to cater to the exact need of the customers

DATA ANALYSIS
To start with we did a visual analysis of the raw data that we got, to get familiar with it, to understand it and extract information of our importance from it. Also to find some trends in age/gender vs products and age/gender vs price. We used neural network modeling technique because we are trying to predict the how sales are related to some independent factors like age, location and payment mode. We used Decision Tree because it gives tree based logic as to the way a consumer makes a decision. Then we analyzed the predicted data against the original data and saw how the model was training itself and the transformations it was doing and hence we figured out how the models predicted the results. Finally we analyzed the output that we got and we saw that people are purchase value has highest dependence on location of a customer, followed by purchase time and payment mode. Therefore the website content, promotions and policies need to take into account these factors so that majority of its high value customers are served and new customers addition rate is high.

Data Mining Project Report

Page 12

Managerial Implications
1. 2. 3. 4. 5. Rough cut prediction for future revenues and profit margin based on the analysis of customer purchase value Given the customer profile we can predict the purchase item Identify key markets and devise targeted marketing strategies to consolidate the sites position in those key markets Sales profile across each location aids in better inventory control and supply chain optimization Identify the weaknesses in areas where the site is not the preferred choice for e-retail and implement measures to improve market share

References 1. 2. Data mining by Han & Kamber Multiple Regression model for market capitalization: Ko, Kenneth. Journal of Global Business Issues, Summer/Fall2009, Vol. 3 Issue 2. Annual Report TMKPL The Brand Manager's Statistical Package: SPSS's Categories Module. By: Schibrowksy, John A.; Collins, Robert H.. Journal of Personal Selling & Sales Management, Summer90, Vol. 10 Issue 3 Research Paper 4 SPSS PASW Modeler SPSS for Psychologists

3. 4.

5. 6.

Data Mining Project Report

Page 13