Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT
Web usage mining is the appli
ation of data mining te
hniques to dis
over usage patterns from Web data, in order to
understand and better serve the needs of Web-based appli
ations. Web usage mining
onsists of three phases, namely
prepro
essing, pattern dis
overy, and pattern analysis. This
paper des
ribes ea
h of these phases in detail. Given its appli
ation potential, Web usage mining has seen a rapid in
rease in interest, from both the resear
h and pra
ti
e
ommunities. This paper provides a detailed taxonomy of the
work in this area, in
luding resear
h eorts as well as
ommer
ial oerings. An up-to-date survey of the existing work
is also provided. Finally, a brief overview of the WebSIFT
system as an example of a prototypi
al Web usage mining
system is given.
Keywords: data mining, world wide web, web usage mining.
1.
INTRODUCTION
whi
h also des
ribes the ar
hite
ture of the WebMiner system [42, one of the rst systems for Web Usage mining. The
pro
eedings of the re
ent WebKDD workshop [41, held in
onjun
tion with the KDD-1999
onferen
e, provides a sampling of some of the
urrent resear
h being performed in the
area of Web Usage Analysis, in
luding Web Usage mining.
This paper provides an up-to-date survey of Web Usage mining, in
luding both a
ademi
and industrial resear
h eorts,
as well as
ommer
ial oerings. Se
tion 2 des
ribes the various kinds of Web data that
an be useful for Web Usage
mining. Se
tion 3 dis
usses the
hallenges involved in dis
overing usage patterns from Web data. The three phases
are prepro
essing, pattern dis
overy, and patterns analysis.
Se
tion 4 provides a detailed taxonomy and survey of the
existing eorts in Web Usage mining, and Se
tion 5 gives
an overview of the WebSIFT system [31, as a prototypi
al
example of a Web Usage mining system. nally, Se
tion 6
dis
usses priva
y
on
erns and Se
tion 7
on
ludes the paper.
2. WEB DATA
1999 ACM
SIGKDD Explorations. Copyright
User Prole: Data that provides demographi
information about users of the Web site. This in
ludes
registration data and
ustomer prole information.
The usage data
olle
ted at the dierent sour
es will represent the navigation patterns of dierent segments of the
overall Web tra
, ranging from single-user, single-site browsing behavior to multi-user, multi-site a
ess patterns.
3.
As shown in Figure 1, there are three main tasks for performing Web Usage Mining or Web Usage Analysis. This
se
tion presents an overview of the tasks for ea
h step and
dis
usses the
hallenges involved.
3.1 Preprocessing
Assuming ea
h user has now been identied (through
ookies, logins, or IP/agent/path analysis), the
li
k-stream for
ea
h user must be divided into sessions. Sin
e page requests
from other servers are not typi
ally available, it is di
ult
to know when a user has left a Web site. A thirty minute
timeout is often used as the default method of breaking a
user's
li
k-stream into sessions. The thirty minute timeout
1999 ACM
SIGKDD Explorations. Copyright
is based on the results of [23. When a session ID is embedded in ea
h URI, the denition of a session is set by the
ontent server.
While the exa
t
ontent served as a result of ea
h user a
tion is often available from the request eld in the server
logs, it is sometimes ne
essary to have a
ess to the
ontent
server information as well. Sin
e
ontent servers
an maintain state variables for ea
h a
tive session, the information
ne
essary to determine exa
tly what
ontent is served by a
user request is not always available in the URI. The nal
problem en
ountered when prepro
essing usage data is that
of inferring
a
hed page referen
es. As dis
ussed in Se
tion
2.2, the only veriable method of tra
king
a
hed page views
is to monitor usage from the
lient side. The referrer eld
for ea
h request
an be used to dete
t some of the instan
es
when
a
hed pages have been viewed.
Figure 2 shows a sample log that illustrates several of the
problems dis
ussed above (The rst
olumn would not be
present in an a
tual server log, and is for illustrative purposes only). IP address 123.456.78.9 is responsible for
three server sessions, and IP addresses 209.456.78.2 and
209.45.78.3 are responsible for a fourth session. Using
a
ombination of referrer and agent information, lines 1
through 11
an be divided into three sessions of A-B-F-O-G,
L-R, and A-B-C-J. Path
ompletion would add two page referen
es to the rst session A-B-F-O-F-B-G, and one referen
e
to the third session A-B-A-C-J. Without using
ookies, an
embedded session ID, or a
lient-side data
olle
tion method,
there is no method for determining that lines 12 and 13 are
a
tually a single server session.
Site Files
Preprocessing
Pattern Discovery
Preprocessed
Clickstream
Data
Raw Logs
Pattern Analysis
"Interesting"
Rules, Patterns,
and Statistics
Rules, Patterns,
and Statistics
IP Address Userid
Time
Agent
123.456.78.9
123.456.78.9
[25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I)
123.456.78.9
123.456.78.9
[25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I)
123.456.78.9
123.456.78.9
[25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
123.456.78.9
[25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 2001820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
10 123.456.78.9
[25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 2009430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
11 123.456.78.9
12 209.456.78.2
13 209.456.78.3
Mozilla/3.04 (Win95, I)
Mozilla/3.04 (Win95, I)
Mozilla/3.04 (Win95, I)
1999 ACM
SIGKDD Explorations. Copyright
Pattern dis
overy draws upon methods and algorithms developed from several elds su
h as statisti
s, data mining,
ma
hine learning and pattern re
ognition. However, it is
not the intent of this paper to des
ribe all the available algorithms and te
hniques derived from these elds. Interested
readers should
onsult referen
es su
h as [33; 24. This se
tion des
ribes the kinds of mining a
tivities that have been
applied to the Web domain. Methods developed from other
elds must take into
onsideration the dierent kinds of data
abstra
tions and prior knowledge available for Web Mining.
For example, in asso
iation rule dis
overy, the notion of a
transa
tion for market-basket analysis does not take into
onsideration the order in whi
h items are sele
ted. However, in Web Usage Mining, a server session is an ordered
sequen
e of pages requested by a user. Furthermore, due to
the di
ulty in identifying unique sessions, additional prior
knowledge is required (su
h as imposing a default timeout
period, as was pointed out in the previous se
tion).
example, asso
iation rule dis
overy using the Apriori algorithm [18 (or one of its variants) may reveal a
orrelation
between users who visited a page
ontaining ele
troni
produ
ts to those who a
ess a page about sporting equipment.
Aside from being appli
able for business and marketing appli
ations, the presen
e or absen
e of su
h rules
an help
Web designers to restru
ture their Web site. The asso
iation
rules may also serve as a heuristi
for prefet
hing do
uments
in order to redu
e user-per
eived laten
y when loading a
page from a remote site.
3.2.3 Clustering
Clustering is a te
hnique to group together a set of items
having similar
hara
teristi
s. In the Web Usage domain,
there are two kinds of interesting
lusters to be dis
overed :
usage
lusters and page
lusters. Clustering of users tends
to establish groups of users exhibiting similar browsing patterns. Su
h knowledge is espe
ially useful for inferring user
demographi
s in order to perform market segmentation in
E-
ommer
e appli
ations or provide personalized Web
ontent to the users. On the other hand,
lustering of pages
will dis
over groups of pages having related
ontent. This
information is useful for Internet sear
h engines and Web
assistan
e providers. In both appli
ations, permanent or
dynami
HTML pages
an be
reated that suggest related
hyperlinks to the user a
ording to the user's query or past
history of information needs.
3.2.4 Classification
Classi
ation is the task of mapping a data item into one
of several predened
lasses [33. In the Web domain, one
is interested in developing a prole of users belonging to a
parti
ular
lass or
ategory. This requires extra
tion and
sele
tion of features that best des
ribe the properties of a
given
lass or
ategory. Classi
ation
an be done by using
supervised indu
tive learning algorithms su
h as de
ision
tree
lassiers, naive Bayesian
lassiers, k-nearest neighbor
lassiers, Support Ve
tor Ma
hines et
. For example,
lassi
ation on server logs may lead to the dis
overy of interesting rules su
h as : 30% of users who pla
ed an online
order in /Produ
t/Musi
are in the 18-25 age group and live
on the West Coast.
3.2.5 Sequential Patterns
The te
hnique of sequential pattern dis
overy attempts to
nd inter-session patterns su
h that the presen
e of a set of
items is followed by another item in a time-ordered set of sessions or episodes. By using this approa
h, Web marketers
an predi
t future visit patterns whi
h will be helpful in
pla
ing advertisements aimed at
ertain user groups. Other
types of temporal analysis that
an be performed on sequential patterns in
ludes trend analysis,
hange point dete
tion,
or similarity analysis.
3.2.6 Dependency Modeling
Dependen
y modeling is another useful pattern dis
overy
task in Web Mining. The goal here is to develop a model
apable of representing signi
ant dependen
ies among the
various variables in the Web domain. As an example, one
may be interested to build a model representing the dierent
stages a visitor undergoes while shopping in an online store
based on the a
tions
hosen (ie. from a
asual visitor to a serious potential buyer). There are several probabilisti
learn-
4.
forward referen
e to
hara
terize user episodes for the mining of traversal patterns. A maximal forward referen
e is the
sequen
e of pages requested by a user up to the last page before ba
ktra
king o
urs during a parti
ular server session.
The SpeedTra
er proje
t [56 from IBM Watson is built on
the work originally reported in [25. In addition to episode
identi
ation, SpeedTra
er makes use of referrer and agent
information in the prepro
essing routines to identify users
and server sessions in the absen
e of additional
lient side
information. The Web Utilization Miner (WUM) system
[55 provides a robust mining language in order to spe
ify
hara
teristi
s of dis
overed frequent paths that are interesting to the analyst. In their approa
h, individual navigation
paths,
alled trails, are
ombined into an aggregated tree
stru
ture. Queries
an be answered by mapping them into
the intermediate nodes of the tree stru
ture. Han et al. [58
have loaded Web server logs into a data
ube stru
ture in
order to perform data mining as well as On-Line Analyti
al
Pro
essing (OLAP) a
tivities su
h as roll-up and drill-down
of the data. Their WebLogMiner system has been used to
dis
over asso
iation rules, perform
lassi
ation and timeseries analysis (su
h as event sequen
e analysis, transition
analysis and trend analysis). Shahabi et. al. [53; 59 have
one of the few Web Usage mining systems that relies on
lient side data
olle
tion. The
lient side agent sends ba
k
page request and time information to the server every time
a page
ontaining the Java applet (either a new page or a
previously
a
hed page) is loaded or destroyed.
4.2.1 Personalization
Personalizing the Web experien
e for a user is the holy grail
of many Web-based appli
ations, e.g. individualized marketing for e-
ommer
e [4. Making dynami
re
ommendations to a Web user, based on her/his prole in addition to
usage behavior is very attra
tive to many appli
ations, e.g.
ross-sales and up-sales in e-
ommer
e. Web usage mining
is an ex
ellent approa
h for a
hieving this goal, as illustrated
in [43 Existing re
ommendation systems, su
h as [8; 6, do
not
urrently use data mining for re
ommendations, though
there have been some re
ent proposals [16.
The WebWat
her [37, SiteHelper [45, Letizia [39, and
lustering work by Mobasher et. al. [43 and Yan et. al. [57
have all
on
entrated on providing Web Site personalization
based on usage information. Web server logs were used by
Yan et. al. [57 to dis
over
lusters of users having similar a
ess patterns. The system proposed in [57
onsists
of an oine module that will perform
luster analysis and
an online module whi
h is responsible for dynami
link generation of Web pages. Every site user will be assigned to
a single
luster based on their
urrent traversal pattern.
The links that are presented to a given user are dynami
ally sele
ted based on what pages other users assigned to
the same
luster have visited. The SiteHelper proje
t learns
a users preferen
es by looking at the page a
esses for ea
h
user. A list of keywords from pages that a user has spent
a signi
ant amount of time viewing is
ompiled and presented to the user. Based on feedba
k about the keyword
list, re
ommendations for other pages within the site are
made. WebWat
her \follows" a user as he or she browses
the Web and identies links that are potentially interesting
to the user. The WebWat
her starts with a short des
ription of a users interest. Ea
h page request is routed through
the WebWat
her proxy server in order to easily tra
k the
Project
Application
Data Source
Data
Type
User
Site
Focus
Server Proxy Client Structure ContentUsageProfileSingleMultiSingleMulti
WebSIFT (CTS99)
General
x
x
x
x
x x
SpeedTracer (WYB98,CPY96) General
x
x
WUM (SF98)
General
x
x
x
x
Shahabi (SZAS97,ZASS97)
General
x
x
x
x
Site Helper (NW97)
Personalization
x
x
x
x
Letizia (Lie95)
Personalization
x
x
x
x
Web Watcher (JFM97)
Personalization
x
x
x
x
x
Krishnapuram(NKJ99)
Personalization
x
x
Analog (YJGD96)
Personalization
x
x
Mobasher (MCS99)
Personalization
x
x
x
x
Tuzhilin(PT98)
Business
x
x
SurfAid
Business
x
x
x
x
Buchner(BM98)
Business
x
x
x
WebTrends,Hitlist,Accrue,etc. Business
x
x
WebLogMiner (ZXH98)
Business
x
x
PageGather,SCML (PE98,PE99) Site Modification x
x
x
x
x x
Manley(Man97)
Characterization
x
x
x
x
Arlitt(AW96)
Characterization
x
x
x
x
Pitkow(PIT97,PIT98)
Characterization
x
x
x
x
x
Almeida(ABC96)
Characterization
x
x
Rexford(CKR98)
System Improve.
x
x
x
x
Schechter(SKS98)
System Improve.
x
x
Aggarwal(AY97)
System Improve.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Web Usage
Mining
Personalization
Site Helper
Letizia
Web Watcher
Mobasher
Analog
Krishnapuram
System
Improvement
Rexford
Schecter
Aggarwal
Site
Modification
Adaptive Sites
WebSIFT
WUM
SpeedTracer
WebLogMiner
Shahabi
Business
Intelligence
SurfAid
Buchner
Tuzhilin
Usage
Characterization
Pitkow
Arlitt
Manley
Almeida
1999 ACM
SIGKDD Explorations. Copyright
user session a
ross multiple Web sites and mark any interesting links. WebWat
her learns based on the parti
ular
user's browsing plus the browsing of other users with similar interests. Letizia is a
lient side agent that sear
hes
the Web for pages similar to ones that the user has already
viewed or bookmarked. The page re
ommendations in [43
are based on
lusters of pages found from the server log for a
site. The system re
ommends pages from
lusters that most
losely mat
h the
urrent session. Pages that have not been
viewed and are not dire
tly linked from the
urrent page are
re
ommended to the user. [44 attempts to
luster user sessions using a fuzzy
lustering algorithm. [44 allows a page
or user to be assigned to more than one
luster.
dis
overy te
hniques :
ustomer attra
tion,
ustomer retention,
ross sales and
ustomer departure. There are several
ommer
ial produ
ts, su
h as SurfAid [11, A
rue [1, NetGenesis [7, Aria [3, Hitlist [5, and WebTrends [13 that
provide Web tra
analysis mainly for the purpose of gathering business intelligen
e. A
rue, NetGenesis, and Aria
are designed to analyze e-
ommer
e events su
h as produ
ts bought and advertisement
li
k-through rates in addition to straight forward usage statisti
s. A
rue provides a
path analysis visualization tool and IBM's SurfAid provides
OLAP through a data
ube and
lustering of users in addition to page view statisti
s. Padmanabhan et. al. [46 use
Web server logs to generate beliefs about the a
ess patterns
of Web pages at a given Web site. Algorithms for nding interesting rules based on the unexpe
tedness of the rule were
also developed.
5. WEBSIFT OVERVIEW
The WebSIFT system [31 is designed to perform Web Usage Mining from server logs in the extended NSCA format
(in
ludes referrer and agent elds). The prepro
essing algorithms in
lude identifying users, server sessions, and inferring
a
hed page referen
es through the use of the referrer
eld. The details of the algorithms used for these steps are
ontained in [30. In addition to
reating a server session
le, the WebSIFT system performs
ontent and stru
ture
prepro
essing, and provides the option to
onvert server sessions into episodes. Ea
h episode is either the subset of all
ontent pages in a server session, or all of the navigation
pages up to and in
luding ea
h
ontent page. Several algorithms for identifying episodes (referred to as transa
tions
in the paper) are des
ribed and evaluated in [28.
The server session or episode les
an be run through sequential pattern analysis, asso
iation rule dis
overy,
lustering, or general statisti
s algorithms, as shown in Figure 5.
The results of the various knowledge dis
overy tools
an be
analyzed through a simple knowledge query me
hanism, a
visualization tool (asso
iation rule map with
onden
e and
support weighted edges), or the information lter (OLAP
tools su
h as a data
ube are possible as shown in Figure 5,
but are not
urrently implemented). The information lter
makes use of the prepro
essed
ontent and stru
ture information to automati
ally lter the results of the knowledge
dis
overy algorithms for patterns that are potentially interesting. For example, usage
lusters that
ontain page views
from multiple
ontent
lusters are potentially interesting,
whereas usage
lusters that mat
h
ontent
lusters may not
be interesting. The details of the method the information lter uses to
ombine and
ompare eviden
e from the dierent
data sour
es are
ontained in [31.
6.
PRIVACY ISSUES
7. CONCLUSIONS
8. REFERENCES
[1 A
rue. http://www.a
rue.
om.
[2 Alladvantage. http://www.alladvantage.
om.
[3 Andromedia aria. http://www.andromedia.
om.
[4 Broadvision. http://www.broadvision.
om.
[5 Hit list
ommer
e. http://www.marketwave.
om.
[6 Likeminds. http://www.andromedia.
om.
[7 Netgenesis. http://www.netgenesis.
om.
[8 Netper
eptions. http://www.netper
eptions.
om.
[9 Netzero. http://www.netzero.
om.
[10 Platform
for
http://www.w3.org/P3P/.
priva y
proje t.
INPUT
Site Files
Access Log
Referrer Log
PREPROCESSING
Site Spider
Registration or
Remote Agent
Data
Agent Log
Data Cleaning
User Identification
Session Identification
Path Completion
Classification
Algorithm
Episode
Identification
Site Topology
Site Content
Episode File
PATTERN
DISCOVERY
Page Classification
Sequential
Pattern
Mining
Sequential Patterns
PATTERN ANALYSIS
Association
Rule Mining
Clustering
Page Clusters
Information
Filter
User Clusters
Association Rules
OLAP/
Visualization
Standard
Statistics
Package
Usage Statistics
Knowledge
Query
Mechanism
[29 Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern dis
overy on the world wide web. In International Conferen
e on Tools with Arti
ial Intelligen
e, pages 558{
567, Newport Bea
h, 1997. IEEE.
[30 Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web
browsing patterns. Knowledge and Information Systems, 1(1), 1999.
1999 ACM
SIGKDD Explorations. Copyright
[46 Balaji Padmanabhan and Alexander Tuzhilin. A beliefdriven method for dis
overing unexpe
ted patterns. In
Fourth International Conferen
e on Knowledge Dis
overy and Data Mining, pages 94{100, New York, New
York, 1998.
[47 M. Pazzani, L. Nguyen, and S. Mantik. Learning from
hotlists and
oldlists: Towards a www information ltering and seeking agent. In IEEE 1995 International
Conferen
e on Tools with Arti
ial Intelligen
e, 1995.
[48 Mike Perkowitz and Oren Etzioni. Adaptive web sites:
Automati
ally synthesizing web pages. In Fifteenth National Conferen
e on Arti
ial Intelligen
e, Madison,
WI, 1998.
[49 Mike Perkowitz and Oren Etzioni. Adaptive web sites:
Con
eptual
luster mining. In Sixteenth International
Joint Conferen
e on Arti
ial Intelligen
e, Sto
kholm,
Sweden, 1999.
[50 Peter Pirolli, James Pitkow, and Ramana Rao. Silk
from a sow's ear: Extra
ting usable stru
tures from
the web. In CHI-96, Van
ouver, 1996.
[51 G. Salton and M.J. M
Gill. Introdu
tion to Modern Information Retrieval. M
Graw-Hill, New York, 1983.
[52 S. S
he
hter, M. Krishnan, and M. D. Smith. Using
path proles to predi
t http requests. In 7th International World Wide Web Conferen
e, Brisbane, Australia, 1998.
[53 Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and
Vishal Shah. Knowledge dis
overy from users web-page
navigation. In Workshop on Resear
h Issues in Data
Engineering, Birmingham, England, 1997.
[54 E. Spertus. Parasite : Mining stru
tural information on
the web. Computer Networks and ISDN Systems: The
International Journal of Computer and Tele
ommuni
ation Networking, 29:1205{1215, 1997.
[55 Myra Spiliopoulou and Lukas C Faulsti
h. Wum: A
web utilization miner. In EDBT Workshop WebDB98,
Valen
ia, Spain, 1998. Springer Verlag.
[56 Kun-lung Wu, Philip S Yu, and Allen Ballman. Speedtra
er: A web usage mining and analysis tool. IBM
Systems Journal, 37(1), 1998.
[57 T. Yan, M. Ja
obsen, H. Gar
ia-Molina, and U. Dayal.
From user a
ess patterns to dynami
hypertext linking. In Fifth International World Wide Web Conferen
e, Paris, Fran
e, 1996.
[58 O. R. Zaiane, M. Xin, and J. Han. Dis
overing web
a
ess patterns and trends by applying olap and data
mining te
hnology on web logs. In Advan
es in Digital
Libraries, pages 19{29, Santa Barbara, CA, 1998.
[59 Amir Zarkesh, Jafar Adibi, Cyrus Shahabi, Reza Sadri, and
Vishal Shah. Analysis and design of server informative wwwsites. In Sixth International Conferen
e on Information and
Knowledge Management, Las Vegas, Nevada, 1997.
1999 ACM
SIGKDD Explorations. Copyright