Académique Documents
Professionnel Documents
Culture Documents
E-commerce domain can provide all the right ingredients for successful data
mining and claim that it is a killer domain for data mining. With the
proliferation of the electronic commerce e-purchasing has
become a daily practice for many purchasing organizations. Data
mining has been used in e-commerce for some time already. It
has many applications in this field such as: searching for patterns
in transactional data, preparation of personalization applications,
etc. [1]. However, before this kind of data analysis will be ever
possible, an e-commerce system itself needs to be successfully
implemented.
1
INTRODUCTION
Data Mining is the process of extracting knowledge hidden from large volumes of
raw data.
Human analysts with no special tools can no longer make sense of enormous
volumes of data that require processing in order to make informed business
decisions. Data mining automates the process of finding relationships and patterns
in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.
In one way we can say that data mining and e-commerce go hand in hand. It has
been shown with the help of many examples. With the help of data mining we can
predict the future happenings that provides reassessments.
These are all the questions that can probably be answered if information hidden
among megabytes of data in your database can be found explicitly and utilized.
Modeling the investigated system, discovering relations that connect variables in a
database are the subject of data mining.
2
CHAPTER 1
WHY USE DATA MINING?
Data might be one of the most valuable assets of your corporation – but only if
you know how to reveal valuable knowledge hidden in raw data. Data mining
allows you to extract diamonds of knowledge from your historical data and predict
outcomes of future situations. It will help you optimize your business decisions,
increase the value of each customer and communication, and improve satisfaction
of customer with your services.
In all these cases data mining can help you reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage. Today increasingly
more companies acknowledge the value of this new opportunity and turn to
3
Megaputer for leading edge data mining tools and solutions that help optimizing
their operations and increase your bottom line.
4
CHAPTER 2
AN OVERVIEW OF E-COMMERCE
5
Web merchandising, as distinct for example from marketing, focuses on how to
acquire products and how to make them available. Electronic commerce affects
the acquisition of products, because (as illustrated best by dell computer
corporation) the supply chain can be integrated tightly with the customer interface.
Even more intriguing from the data mining perspective, since customers are
interacting with the computer directly, product assortments, virtual product
displays, and other merchandising interfaces can be modified dynamically, and
even can be personalized to individual customers as shown in figure 1.
6
CHAPTER 3
TASKS SOLVED BY DATA MINING
3.1 Predicting
A task of learning a pattern from examples and using the developed model to
predict future values of the target variable.
3.3 Clustering
A task of identifying groups of records that are similar between themselves but
different from the rest of the data. Often, the variables providing the best
clustering should be identified as well.
7
CHAPTER 4
DATA MINING APPLICATIONS
8
4.2 Searching Pattern In Transactional Data
Data mining algorithms often produce a mass of patterns, much smaller than the
original mountain of data, but still in need of post-processing.
Creating individual consumer profiles for personalized recommendation (or
for other purposes, such as providing dynamic content or tailored advertising)
exacerbates this problem, because now one may be searching for patterns
individually for each of millions of consumers.
As we’ve mentioned, electronic commerce systems allow unprecedented
flexibility in merchandising. However, flexibility is not a benefit unless one
knows how to map the many options to different situations. For example, how
should different product assortments or merchandising cues be chosen? We have
to focus on the analysis and evaluation of web merchandising [1]. Specifically,
analyze the “clickstreams” the series of links followed by customer on a site.
Their thesis is that the effectiveness of many on-line merchandising tactics can be
analyzed by a combination of specified metrics and visualization techniques
applied to clickstreams.
We provide a detailed case study of the analysis of clickstream data from a
web retailer [1]. The study shows how the breakdown of clickstreams into sub
segments can highlight potential problems in merchandising product has many
click-throughs but a low click-to-buy rate. Subsequent analysis shows that it has a
high basket-to-buy rate, but a low click-to-basket rate. This analysis would allow
merchandisers to begin to develop informed hypothesis about how performance
might be improved. For example, since this is a high-priced product, one might
hypothesize that customers were lured to the product page and then turned off by
the product’s high price. If this were true, there are several different actions that
might be appropriate (reduce the price, convince the customer that the product is
worth its high price, target the lure better so as not to “waste click” etc.).
We also study measuring and improving the success of web sites. In
particular, they are concerned that success should be evaluated in terms of the
9
business goal of the web site (e.g., retail sales), and that treatments should not be
limited to measurement alone, but also should suggest concrete avenues for
improvement. WE discuss the discovery of navigation patterns, presenting a brief
but comprehensive survey of the state of the art, and also presenting a method
that addresses some of its deficiencies.
Chapter 5
10
A GENETIC ALGORITHM BASED DATA MINING
5.1.1 Assumptions
The main goal of this project was to identify the profiles of e-
purchasing adopters. It was also very important to determine the
most significant factors in terms of the firm’s perceived
importance of managerial benefits as well as the obstacles
related to the possible implementation of an e-purchasing
system. The whole knowledge that can be discovered with this
approach is hidden in data and can be extracted virtually with no
assumptions. The only requirement that must be fulfilled prior to
the investigation of this kind of relations in the data is to divide
the attribute space
into two disjoint subspaces representing premise and consequent
parts of the decision rules.
11
5.1.2 Searching For Patterns
The theory of genetic algorithms is based on the process of natural selection,
according to which the nature aims at the creation of organisms that will be
adjusted to the surrounding environment in the best possible way.
In genetic algorithms, possible solutions to the analyzed problem
are encoded into so called chromosomes. Chromosomes consist
of genes that represent a solution numerically. Possible values
that can be assigned to a particular gene are determined by its
allele (domain).
We consider the chromosomes that are being produced and
modified along the process of evolution (a sequence of
generations) represent patterns covering records in the data set.
Each of such patterns has a possible coverage in the data
(support), which is given by the number of records matching the
pattern.
For the example shown
***1******5*1********3*****
Figure 3: Example of a chromosome (set positions genes
no. 4, 11, 13, 22).
t=0;
P (t): = InitializePopulation (no_of_attributes,
attributes_domains);
while (t < max_number_of_generations) do
EvaluateFitness (P (t), dataset);
12
t: = t+1;
P (t): = Select (P (t-1));
Crossover (P (t));
Mutate (P (t));
end while;
FitnessEvaluation (chromosome)
set_positions : =
CountSetPositions (chromosome);
if (HasSupportinData (chromosome)) then
support : = CalculateSupport( chromosome );
fitness : = support * set_positions;
13
else
partial_support :=
CalculatePartialSupport (chromosome);
fitness := partial_support * threshold_support
end if;
FitnessEvaluation := fitness;
end FitnessEvaluation;
CalculatePartialSupport (chromosome)
for each record in dataset
matching :=
CountGenesMatchingRecord (record);
matching_ratio :=
matching / length_of_chromosome;
partial_support :=
partial_support + matching_ratio;
end for;
CalculatePartialSupport :=
partial_support / no_of_records;
end CalculatePartialSupport;
14
Another very important feature of the proposed genetic
algorithm is a multi-point crossover option. In many experiments
on different types of data, this approach
was found to be much more effective with respect to both the
number of discovered patterns, and the time of convergence.
As an outcome of several evolutions modeled by this genetic
algorithm, a set of data patterns was created. Those patterns,
along with the information about the level of their support, were
then used as an input to the second algorithm that generated
pseudo-association rules.
15
particular e-commerce solution(s), described by a set of
attributes Y, is determined by the level of importance
of the managerial benefits it looks for and concerns
with some obstacles V (Y)”
and:
“If a given company perceives managerial benefits X
important to the extent of V (X), then its likelihood of
adoption of a particular e-commerce solution(s) Y is
determined by the level of concern with some obstacles
V (Y)”,
this approach seems to be ideal.
16
comparison of each chromosome in the population against each
record in the database, the total number of such comparisons (C)
is
given by:
C = N ¤ A ¤ P ¤ G:
As for the second part of the methodology presented in this
paper - rule generation, its e±ciency obviously depends on the
number of the attributes included in a given pattern and on the
constraints concerning the rule structure in terms of its premise
and consequent parts (provided by the user).
5.2 Results
17
5.2.2 Searching for patterns
In order to increase the variety of patterns, the genetic algorithm
was launched on several computers simultaneously.
Because of the relatively small size of the data sample as well
as the conclusions derived from the preliminary statistical data
analysis (about small diversity of the data), support threshold of
desired patterns was lowered to 3 - 5%.
As a result of several evolutions (each consisting of comparable
number of generations) of the genetic algorithm, 1564 patterns
were found in the database. Each of those patterns had at least
two “set” values for the corresponding attributes.
18
of 79%.
RULE 2:
IF a company uses electronic commerce in purchase orders
frequently, THEN the reduction of transaction time is extremely
important for them WITH support of 21% and confidence of 81%.
5.3 Summary
All of the generated rules were definitely reasonable, however
not all of them were so obvious and could not be easily
anticipated. Some of them simply confirmed the conclusions
drawn from the statistical data analysis while others produced an
interesting and novel insight into the problem of profiling
adopters and nonadopters of e-purchasing. On the basis of these
results it can be stated that
this approach is appropriate and useful for the discovery of the
profile description hidden in data.
19
6. CONCLUSION
20
BIBLIOGRAPHY
21
APPENDIX
22