Vous êtes sur la page 1sur 22

ABSTRACT

E-commerce domain can provide all the right ingredients for successful data
mining and claim that it is a killer domain for data mining. With the
proliferation of the electronic commerce e-purchasing has
become a daily practice for many purchasing organizations. Data
mining has been used in e-commerce for some time already. It
has many applications in this field such as: searching for patterns
in transactional data, preparation of personalization applications,
etc. [1]. However, before this kind of data analysis will be ever
possible, an e-commerce system itself needs to be successfully
implemented.

The data mining applications has been discussed briefly


based on a project
Involving IBM and Britain’s Safeway supermarkets.

1
INTRODUCTION

Data Mining is the process of extracting knowledge hidden from large volumes of
raw data.

Human analysts with no special tools can no longer make sense of enormous
volumes of data that require processing in order to make informed business
decisions. Data mining automates the process of finding relationships and patterns
in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.

In one way we can say that data mining and e-commerce go hand in hand. It has
been shown with the help of many examples. With the help of data mining we can
predict the future happenings that provides reassessments.

• What goods should be promoted to this customer?


• Will this customer default on a loan or pay back on schedule?
• What medical diagnose should be assigned to this patient?

These are all the questions that can probably be answered if information hidden
among megabytes of data in your database can be found explicitly and utilized.
Modeling the investigated system, discovering relations that connect variables in a
database are the subject of data mining.

2
CHAPTER 1
WHY USE DATA MINING?

Data might be one of the most valuable assets of your corporation – but only if
you know how to reveal valuable knowledge hidden in raw data. Data mining
allows you to extract diamonds of knowledge from your historical data and predict
outcomes of future situations. It will help you optimize your business decisions,
increase the value of each customer and communication, and improve satisfaction
of customer with your services.

Data that require analysis differ for companies in different industries.


Examples include:

• Sales and contacts histories


• Demographic data on your customers and prospects
• Clickstream and transactional data from your website

In all these cases data mining can help you reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage. Today increasingly
more companies acknowledge the value of this new opportunity and turn to

3
Megaputer for leading edge data mining tools and solutions that help optimizing
their operations and increase your bottom line.

4
CHAPTER 2
AN OVERVIEW OF E-COMMERCE

As compared to ancient or shielded legacy systems, data collection can be


controlled to a large extent. We now have the opportunity to design systems that
collect data for the purposes of data mining, rather than having to struggle with
translating and mining data collected for other purposes. Data are collected
electronically, rather then manually, so less noise is introduced from manual
processing. Electronic commerce data are rich, containing information on prior
purchase activity and detailed demographic data.
In addition, some data that previously were very difficult to collect now are
accessible easily. For example, electronic commerce systems can record the
actions of customers in the virtual "store” including what they look at, what they
put into their shopping cart and do not buy, and so on. Previously, in order to
obtain such data companies had to trail customers (in person), surreptitiously
recording their activities, or had to undertake complicated analysis of in-store
videos (Underhill 2000). It was not cost-effective to collect such data in bulk, and
correlating them with individual customers is practically impossible. For e-
commerce systems massive amount of data can be collected inexpensively.
Unlike many data mining applications, the vehicle for capitalizing on the
results of mining- the electronic commerce system-already is automated.
Therefore the hurdles of system building are substantially lower, as are the
political and social hurdles involved with automating a manual process. Also,
because the mined models will fit well with the existing system, computing return
on investment can be much easier.

5
Web merchandising, as distinct for example from marketing, focuses on how to
acquire products and how to make them available. Electronic commerce affects
the acquisition of products, because (as illustrated best by dell computer
corporation) the supply chain can be integrated tightly with the customer interface.
Even more intriguing from the data mining perspective, since customers are
interacting with the computer directly, product assortments, virtual product
displays, and other merchandising interfaces can be modified dynamically, and
even can be personalized to individual customers as shown in figure 1.

6
CHAPTER 3
TASKS SOLVED BY DATA MINING

3.1 Predicting
A task of learning a pattern from examples and using the developed model to
predict future values of the target variable.

3.2 Detection of relations


A task of searching for the most influential independent variables for a selected
target variable.

3.3 Clustering
A task of identifying groups of records that are similar between themselves but
different from the rest of the data. Often, the variables providing the best
clustering should be identified as well.

3.4 Market Basket Analysis


Processing transactional data in order to find those groups of products that are
sold together well. One also searches for directed association rules identifying the
best product to be offered with a current selection of purchased products.

7
CHAPTER 4
DATA MINING APPLICATIONS

4.1 Preparation Of Personalization Technique


Personalized techniques provide ease to the customers by providing
recommendation of products. Here we are studying a project involving IBM and
Britain’s Safeway supermarkets, in which customers use palm-top PDAs to
compose shopping lists (based to a large extent on the products they have
purchased previously). The use of PADs increases customer convenience, because
they don’t have to walk the aisles for these purchases; they simply pick them up at
the store. However, it reduces the company’s ability to “recommend” products via
in-store displays, and the like.
Here we show how recommendations can be made instead on the PDA, using a
combination of data mining techniques. The recommendations were made to
actual customers in two field trials. After incorporating “interestingness”
knowledge learned from the first trial, in the second trial (in a different store”) the
results were encouraging notwithstanding several application challenges.
Specifically, 25% of orders included something from the recommendation list,
corresponding to a revenue boost of 1.8% (respectable as compared to other
promotions). Perhaps more important, they show that customers are significantly
more likely to choose high-ranked recommendations than low-ranked ones,
indicating that the algorithms are doing well at modeling the likelihood of
purchasing items previously not purchased. The study shows intuitive rules and
clusters and relative preferences, demonstrating the potential of data mining for
improving understanding of the business—which may be useful even where
recommendations are not implemented (or are not effective).

8
4.2 Searching Pattern In Transactional Data
Data mining algorithms often produce a mass of patterns, much smaller than the
original mountain of data, but still in need of post-processing.
Creating individual consumer profiles for personalized recommendation (or
for other purposes, such as providing dynamic content or tailored advertising)
exacerbates this problem, because now one may be searching for patterns
individually for each of millions of consumers.
As we’ve mentioned, electronic commerce systems allow unprecedented
flexibility in merchandising. However, flexibility is not a benefit unless one
knows how to map the many options to different situations. For example, how
should different product assortments or merchandising cues be chosen? We have
to focus on the analysis and evaluation of web merchandising [1]. Specifically,
analyze the “clickstreams” the series of links followed by customer on a site.
Their thesis is that the effectiveness of many on-line merchandising tactics can be
analyzed by a combination of specified metrics and visualization techniques
applied to clickstreams.
We provide a detailed case study of the analysis of clickstream data from a
web retailer [1]. The study shows how the breakdown of clickstreams into sub
segments can highlight potential problems in merchandising product has many
click-throughs but a low click-to-buy rate. Subsequent analysis shows that it has a
high basket-to-buy rate, but a low click-to-basket rate. This analysis would allow
merchandisers to begin to develop informed hypothesis about how performance
might be improved. For example, since this is a high-priced product, one might
hypothesize that customers were lured to the product page and then turned off by
the product’s high price. If this were true, there are several different actions that
might be appropriate (reduce the price, convince the customer that the product is
worth its high price, target the lure better so as not to “waste click” etc.).
We also study measuring and improving the success of web sites. In
particular, they are concerned that success should be evaluated in terms of the

9
business goal of the web site (e.g., retail sales), and that treatments should not be
limited to measurement alone, but also should suggest concrete avenues for
improvement. WE discuss the discovery of navigation patterns, presenting a brief
but comprehensive survey of the state of the art, and also presenting a method
that addresses some of its deficiencies.

Chapter 5

10
A GENETIC ALGORITHM BASED DATA MINING

5.1 Research methodology


One of the data mining techniques that aim at the discovery of
hidden knowledge is extraction of association or pseudo-
association (attributes are not limited to the binary domain) rules
in the form of IF-THEN statements [2].

We represent a slightly different approach combining the


power of genetic algorithms with the simplicity of standard
association rule generation algorithms will be described. This
approach applies a genetic algorithm to the problem of searching
for repetitive patterns hidden in data and then simply generates
pseudo-association rules based on those patterns.

5.1.1 Assumptions
The main goal of this project was to identify the profiles of e-
purchasing adopters. It was also very important to determine the
most significant factors in terms of the firm’s perceived
importance of managerial benefits as well as the obstacles
related to the possible implementation of an e-purchasing
system. The whole knowledge that can be discovered with this
approach is hidden in data and can be extracted virtually with no
assumptions. The only requirement that must be fulfilled prior to
the investigation of this kind of relations in the data is to divide
the attribute space
into two disjoint subspaces representing premise and consequent
parts of the decision rules.

11
5.1.2 Searching For Patterns
The theory of genetic algorithms is based on the process of natural selection,
according to which the nature aims at the creation of organisms that will be
adjusted to the surrounding environment in the best possible way.
In genetic algorithms, possible solutions to the analyzed problem
are encoded into so called chromosomes. Chromosomes consist
of genes that represent a solution numerically. Possible values
that can be assigned to a particular gene are determined by its
allele (domain).
We consider the chromosomes that are being produced and
modified along the process of evolution (a sequence of
generations) represent patterns covering records in the data set.
Each of such patterns has a possible coverage in the data
(support), which is given by the number of records matching the
pattern.
For the example shown

***1******5*1********3*****
Figure 3: Example of a chromosome (set positions genes
no. 4, 11, 13, 22).
t=0;
P (t): = InitializePopulation (no_of_attributes,
attributes_domains);
while (t < max_number_of_generations) do
EvaluateFitness (P (t), dataset);

12
t: = t+1;
P (t): = Select (P (t-1));
Crossover (P (t));
Mutate (P (t));
end while;

Figure 4: Pseudo-code for the general scheme of the


genetic algorithm.

in Figure 3 it will be the number of all records containing given


values at fourth, eleventh, thirteenth, and twenty second
positions, no matter what are all the other values. Obviously, we
are mostly interested in patterns that have relatively high
support and this will be the main feature of the fitness function
used for this algorithm. The minimal, desired level of support in
data can be specified before the execution of the genetic
algorithm, so that all the patterns with less coverage will not be
included in the result at all. Pseudo-code for the general scheme
of the genetic algorithm is shown in Figure 4.

FitnessEvaluation (chromosome)
set_positions : =
CountSetPositions (chromosome);
if (HasSupportinData (chromosome)) then
support : = CalculateSupport( chromosome );
fitness : = support * set_positions;

13
else
partial_support :=
CalculatePartialSupport (chromosome);
fitness := partial_support * threshold_support
end if;
FitnessEvaluation := fitness;
end FitnessEvaluation;
CalculatePartialSupport (chromosome)
for each record in dataset
matching :=
CountGenesMatchingRecord (record);

matching_ratio :=
matching / length_of_chromosome;
partial_support :=
partial_support + matching_ratio;
end for;
CalculatePartialSupport :=
partial_support / no_of_records;
end CalculatePartialSupport;

Figure 5: Pseudo-code for the fitness function evaluation

More detailed description of the algorithm for the fitness function


evaluation is presented by the pseudocode in Figure 5.

14
Another very important feature of the proposed genetic
algorithm is a multi-point crossover option. In many experiments
on different types of data, this approach
was found to be much more effective with respect to both the
number of discovered patterns, and the time of convergence.
As an outcome of several evolutions modeled by this genetic
algorithm, a set of data patterns was created. Those patterns,
along with the information about the level of their support, were
then used as an input to the second algorithm that generated
pseudo-association rules.

5.1.3 Rule generation

Association rules expose the existence of relations between


attributes in data (binary domain – exists vs. does not exist)
while pseudo-association rules discover
relations between values of those attributes (domains of the
attributes themselves). Basically, pseudoassociation rules are
simple IF-THEN statements:
“If the set of attributes X, included in the premise
part of the rule, has some values, described by a set of
values V (X), then the set of attributes Y, included in
the consequent part, tends to have values, described by
another set of values V (Y)”.
In our case, since we aim at the derivation of rules of type:
“If a given company’s profile, in a sense of attributes
X, is described by the values V (X), then the
likelihood of the company being involved in adoption of

15
particular e-commerce solution(s), described by a set of
attributes Y, is determined by the level of importance
of the managerial benefits it looks for and concerns
with some obstacles V (Y)”
and:
“If a given company perceives managerial benefits X
important to the extent of V (X), then its likelihood of
adoption of a particular e-commerce solution(s) Y is
determined by the level of concern with some obstacles
V (Y)”,
this approach seems to be ideal.

An algorithm of extraction of association rules usually consists of two parts:


searching for patterns hidden in data (in this project achieved by application of the
genetic algorithm) and generation of rules based on those patterns.

5.1.4 Algorithm complexity

Apart from the execution of the standard genetic algorithms


operators, the comparison of the chromosome to the data
records is the most critical part of the algorithm. Effectiveness of
this portion of the algorithm strongly depends on four
parameters: the number of records in the database (N), the
number of attributes
in the database (A), the size of the population (P), and the
number of generations (G) within a given evolution process.
Since the matching is nothing more but a value-to-gene

16
comparison of each chromosome in the population against each
record in the database, the total number of such comparisons (C)
is
given by:
C = N ¤ A ¤ P ¤ G:
As for the second part of the methodology presented in this
paper - rule generation, its e±ciency obviously depends on the
number of the attributes included in a given pattern and on the
constraints concerning the rule structure in terms of its premise
and consequent parts (provided by the user).

Because of the emphasis on the fast discovery of patterns by the


genetic algorithm in the first part, the rule generation is rather
fast. What is even more important, it is very flexible and allows
the user to generate different set of rules (i.e. different points of
view) on the basis of the same set of patterns.

5.2 Results

5.2.1 Preliminary statistical data analysis


Importance of data mining approach here is that it determines the levels of
support and confidence of potentially discovered patterns and
rules (the levels of support and confidence of the rules are
obviously directly related to the frequency of occurrence of the
attributes’ values included in those rules). Accordingly, the
results of this preliminary statistical data analysis were carefully
analyzed and taken into account while determining the genetic
algorithm’s parameters.

17
5.2.2 Searching for patterns
In order to increase the variety of patterns, the genetic algorithm
was launched on several computers simultaneously.
Because of the relatively small size of the data sample as well
as the conclusions derived from the preliminary statistical data
analysis (about small diversity of the data), support threshold of
desired patterns was lowered to 3 - 5%.
As a result of several evolutions (each consisting of comparable
number of generations) of the genetic algorithm, 1564 patterns
were found in the database. Each of those patterns had at least
two “set” values for the corresponding attributes.

5.2.3 Rule generation


Patterns discovered and prepared in the previous step were then
used as the basis for rules generation. At this level, the sets of
attributes were divided into premise and consequent parts of the
rules adequately to the problem specification. The input patterns
were also checked for overlapping, and those that were covered
by another pattern were removed from consideration. After this
verification, 633 effective patterns were preserved. On the basis
of this final set of patterns, a
number of rules of given support and confidence was generated.
A few examples of those rules are presented below:
RULE 1:
IF a company uses EDI system, THEN the company is willing to
help suppliers to establish an electronic commerce network WITH
support of 32% and confidence

18
of 79%.
RULE 2:
IF a company uses electronic commerce in purchase orders
frequently, THEN the reduction of transaction time is extremely
important for them WITH support of 21% and confidence of 81%.

5.3 Summary
All of the generated rules were definitely reasonable, however
not all of them were so obvious and could not be easily
anticipated. Some of them simply confirmed the conclusions
drawn from the statistical data analysis while others produced an
interesting and novel insight into the problem of profiling
adopters and nonadopters of e-purchasing. On the basis of these
results it can be stated that
this approach is appropriate and useful for the discovery of the
profile description hidden in data.

19
6. CONCLUSION

We discuss that company preference knowledge must be incorporated--the task is


not just to recommend what the customer will most like, but also what the store
would like to sell. It also should be kept in mind that there is more to data mining
than just building an automated recommendation system. If indeed one is
participating in a knowledge discovery process, the knowledge that is discovered
may be used for various purposes. However, when it comes to improving the
efficiency of the knowledge discovery process as a whole, additional research on
efficient mining algorithms will have diminishing returns if the rest of the process
remains difficult and manual.
In all we highlight that although electronic commerce systems are an ideal
application for data mining, there still is much research needed—mostly in areas
of the knowledge discovery process other than the algorithmic phase.

20
BIBLIOGRAPHY

[1] Kohavi R., Provost F., Applications of Data Mining to Electronic


Commerce, Data Mining and Knowledge Discovery -
International Journal, Special Issue on E-Commerce and Data
Mining, Kluwer Academic Publishers, Boston, 2001.
[2] Dr. Min H., Smolinski G. T., Boratyn M. G., “A Genetic Algorithm-based
Data Mining Approach to Profiling the Adopters and Non-Adopters of E-
Purchasing, Logistics and Distribution Institute, University of Louisville,
Louisville, KY 40292.

21
APPENDIX

22

Vous aimerez peut-être aussi